Skip to content

DANS QDR Merged ORE/Bag changes for QA#12167

Open
qqmyers wants to merge 98 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-QDR-merged_bag_changes_for_QA
Open

DANS QDR Merged ORE/Bag changes for QA#12167
qqmyers wants to merge 98 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-QDR-merged_bag_changes_for_QA

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented Feb 17, 2026

What this PR does / why we need it: As discussed in Tech Hour, etc., there have been many changes to the OAI_ORE export and archival Bag generation code driven by requirements at QDR and DANS that have been made into individual PRs for review. Since the end goal is for all of these to work together, and because it would be easier to setup/configuration archival bag workflows once, I've created this PR that should have all the changes from the others.

Specifically, this PR is a merge of
#12144, #12122, #12129, and #12099 with
#12144 building on #12133, which builds on #12063, and
#12122 building on #12103, which builds on #12101, and
#12129 building on #12104 and #12103

Tracking the review statuses:

Which issue(s) this PR closes:

Not sure if the magic happens for PRs:

Special notes for your reviewer: The individual PRs Should be reviewed first. This PR only includes some minor updates required to merge the different threads - see the non-merge commits from Feb 17. Assuming the earlier reviews covered the basic design, I think only some sanity checking of those commits is needed here.

Suggestions on how to test this: Nominally QA should check all the changes across all of the other PRs. I'll try to combine the list of changes/test instructions here to help. Overall, the PR makes changes that impact all of the different archivers - local, s3, google, DRS, Duracloud. Since the core team isn't set up to easily test all of those, I'd suggest just testing the local archiver and asking QDR (me) to test the Google archiver and DANS to test the S3 one. The DRS archiver was specifically built for Harvard but the integration with DRS has been put on hold, so I'm not sure there's a way to test it now. The Duracloud archiver was created for TDL - I'm not sure if they're interested in testing now or not (not sure they'll continue using this archiver), but I'll check.

General: Configure a local archiver and setup a post-publication workflow using it per the guides.

For specific changes in the PR:

General Archiving Improvements

  • Multiple performance and scaling improvements have been made for creating archival bags for large datasets, including:
    • Superusers can now see a pending status in the dataset version table while archiving is active.
    • Workflows are now triggered outside the transactions related to publication, assuring that workflow locks and status updates are always recorded.

Try archiving something with a significant number of files/overall size - you should be able to refresh the page and look at the version table as a superuser and see the pending status. You should also be able to see :BagGeneratorThreads number of parallel* files in the temp dir as the dataset files are being zipped.

  • Potential conflicts between archiving/workflows, indexing, and metadata exports after publication have been resolved, avoiding cases where the status/last update times for these actions were not recorded.

Eventually, the archiving should succeed. There should be no log errors and the post publish indexing an exportall should also finish and update the last indexed/last exported times in the db. The archivalcopylocation of the datasetversion should be updated with the final status of archiving (success).

  • A bug has been fixed where superusers would incorrectly see the "Submit" button to launch archiving from the dataset page version table.

The submit button should not show for prior versions when the latest has not been archived. (You can edit the archivalcopylocation in the db for versions - e.g. set it to null to remove a success from the post-publish workflow so the Submit button will reappear (as it looks like archiving hasn't been attempted yet). Alternately, you could turn the post-publish workflow on/off to archive only certain versions.

  • The local, S3, and Google archivers have been updated to support deleting existing archival files for a version to allow re-creating the bag for a given version.
  • For archivers that support file deletion, it is now possible to recreate an archival bag after "Update Current Version" has been used (replacing the original bag). By default, Dataverse will mark the current version's archive as out-of-date, but will not automatically re-archive it.
    • A new 'obsolete' status has been added to indicate when an archival bag exists for a version but it was created prior to an "Update Current Version" change.

Two potential tests - leave the flag off, verify that doing an update-current-version as a superuser results in the obsolete status. Then flip the flag on and try update current again and verify that archiving happens and the bag is updated (should be a different size, the OAI-ORE file will have your metadata change).

  • Improvements have been made to file retrieval for bagging, including retries on errors and when download requests are being throttled.

Things should work with throttling - not sure how that is best configured for testing.

  • A bug causing :BagGeneratorThreads to be ignored has been fixed, and the default has been reduced to 2.

Set that parameter, include it in your local archiver config, and verify that a different number of parallel* files appear in the /tmp dir (temporarily) as the bag is zipped. The files go away once the final zip is done so using a big enough dataset to have time to check is key.

  • Retrieval of files for inclusion in an archival bag is no longer counted as a download.

Should not see any download counts due to archiving.

  • It is now possible to require that all previous versions have been successfully archived before archiving of a newly published version can succeed. (This is intended to support use cases where deduplication of files between dataset versions will be done and is a step towards supporting the Oxford Common File Layout (OCFL).)

Set the flag and verify that trying to archive the latest version fails if any prior version was not archived successfully (status is null, or failure).

  • The pending status has changed to use the same JSON format as other statuses

Could verify that pending in the db (which is only there temporarily) is a json structure, not just the word pending.

OAI-ORE Export Updates

  • The export now uses URIs for checksum algorithms, conforming with JSON-LD requirements.

Look at the checksums in the OAI_ORE export (in the bag or the metadata export after publishing)

  • The https://schema.org/additionalType has been updated to "Dataverse OREMap Format v1.0.2" to reflect format changes.

Should be able to see this line in the OAI-ORE

Archival Bag (BagIt) Updates

  • The bag-info.txt file now correctly includes information for dataset contacts, fixing a bug where nothing was included when multiple contacts were defined. (Multiple contacts were always included in the OAI-ORE file in the bag; only the baginfo file was affected).
  • Values used in the bag-info.txt file that may be multi-line (i.e. with embedded CR or LF characters) are now properly indented and wrapped per the BagIt specification (Internal-Sender-Identifier, External-Description, Source-Organization, Organization-Address).
  • The dataset name is no longer used as a subdirectory within the data/ directory to reduce issues with unzipping long paths on some filesystems.

Should be able to see these by inspection. One test is to have more than one contact, another is to make sure you have multiline entries for some/all of the listed fields. It should also be obvious the title is not used as a folder name by looking at the bag structure.

  • For dataset versions with no files, the empty manifest-<alg>.txt file will now use the algorithm from the :FileFixityChecksumAlgorithm setting instead of defaulting to MD5.

Publish a dataset with no files and a non md5 algorithm set and verify the file name is correct.

  • A new key, Dataverse-Bag-Version, has been added to bag-info.txt with the value "1.0" to allow for tracking changes to Dataverse's archival bag generation over time.

Another inspect the bag-info.txt file.

  • When using the holey bag option discussed above, the required fetch.txt file will be included.

For this and the related files outside the zip, set the size limits (per file or total per dataset) and verify that when you're above that limit, the extra files are either sent outsize the zipped bag (in a dir of the same name - if you unzip the zip in place, the files sent separately should be in the right place to make the bag complete), or, if you select holey=true, will be listed in the fetch.txt file.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Aggregated from all the other PRs.

Additional documentation:

qqmyers and others added 30 commits December 6, 2025 18:26
Spec doesn't allow empty lines, dropping whitespace-only lines seems
reasonable as well (users can't see from the Dataverse display whether
an empty line would appear in bag-info.txt or not if we all whotespace
only lines (or whitespace beyond the 78 char wrap limit)
affects manifest and pid-mapping files as well as data file placement
Added unit tests for multilineWrap
@qqmyers qqmyers added this to the 6.10 milestone Feb 18, 2026
@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Size: 10 A percentage of a sprint. 7 hours.

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants

Comments