Skip to content

URL agnostic deduplication of WARC #13

@Arkiver2

Description

@Arkiver2

This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated URLs can be best discovered by comparing the hashes.

This would be used for the flickr Archive Team project. The WARCs would be postprocessed with warcat deduplication.

edit: better explanation of what this would be used for.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions