Skip to content

Conversation

@jonathonherbert
Copy link
Contributor

@jonathonherbert jonathonherbert commented Feb 4, 2026

What does this change?

Clean up shutterstock image descriptions, by removing special instructions and credit information when they are already wholly defined in metadata. When they are not in the metadata, we leave them in the description — this is especially important if the description is the only place a photographer's byline is included.

How should a reviewer test this change?

  • The unit tests should pass. We've captured several cases to exercise the logic.
  • Deploy to CODE. The image descriptions should improve as above.

How can success be measured?

Staff spend less time spent cleaning captions.

Who should look at this?

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

@jonathonherbert jonathonherbert force-pushed the jsh/cleanup-shutterstock-descriptions branch from c64ef74 to d678af7 Compare February 4, 2026 16:41
@jonathonherbert jonathonherbert self-assigned this Feb 4, 2026
@jonathonherbert jonathonherbert added the feature Departmental tracking: work on a new feature label Feb 4, 2026
@paperboyo
Copy link
Contributor

This is great! 🙏

I think it might be better to add certain belt (with braces). It doesn’t happen as often with Shutterstock any more, but it used to (so at least will affect migration of old images). It is also a common occurrence for eg. Getty’s intermediaries (Anadolu etc, example PROD id de815b821e2cfa1fdbc5a2ae1ad1f3309ec6118a). It will also provide a step towards extraction (not just removal) of tokens otherwise missing from structured fields (esp. for those Getty intermediaries).

This is the case (PROD id cd2de8ef064b2a137f32a0250447cf454afe0fe1): after earlier cleanup, supplierProcessor sees:
byline: ZUMA Wire
credit: REX/Shutterstock
while description still contains the real byline: Mandatory Credit: Photo by Sachelle Babbar/ZUMA Wire/REX/Shutterstock

Current code doesn’t check if removed string contains all the tokens otherwise available in their proper fields and cleans it up. This prevents humans from being able to fix the byline manually.

@jonathonherbert
Copy link
Contributor Author

jonathonherbert commented Feb 5, 2026

I think it might be better to add certain belt (with braces).

Good shout, e753e11 adds a check to only remove the credit field if the first part of the slash-delimited byline in the description matches the metadata. Includes cd2de8ef064b2a137f32a0250447cf454afe0fe1 as fixture.

Presume the person we care about is (edit) always the first one?

@jonathonherbert jonathonherbert marked this pull request as ready for review February 5, 2026 12:16
@jonathonherbert jonathonherbert requested a review from a team as a code owner February 5, 2026 12:16
@jonathonherbert jonathonherbert force-pushed the jsh/cleanup-shutterstock-descriptions branch 2 times, most recently from c7c6e18 to 7c4fab0 Compare February 5, 2026 21:16
@jonathonherbert jonathonherbert force-pushed the jsh/cleanup-shutterstock-descriptions branch from 7c4fab0 to 3e42726 Compare February 9, 2026 09:56
}
}

private def matchMandatoryCreditBylines(suppliersReference: String) = s"Mandatory Credit: Photo by (.*)?\\(${Regex.quote(suppliersReference)}\\)\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the group should be optional, and we can be slightly neater leaving a space between the grups

Suggested change
private def matchMandatoryCreditBylines(suppliersReference: String) = s"Mandatory Credit: Photo by (.*)?\\(${Regex.quote(suppliersReference)}\\)\n"
private def matchMandatoryCreditBylines(suppliersReference: String) = s"Mandatory Credit: Photo by (.*) \\(${Regex.quote(suppliersReference)}\\)\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Departmental tracking: work on a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants