DANS QDR Merged ORE/Bag changes for QA by qqmyers · Pull Request #12167 · IQSS/dataverse

qqmyers · 2026-02-17T22:05:32Z

What this PR does / why we need it: As discussed in Tech Hour, etc., there have been many changes to the OAI_ORE export and archival Bag generation code driven by requirements at QDR and DANS that have been made into individual PRs for review. Since the end goal is for all of these to work together, and because it would be easier to setup/configuration archival bag workflows once, I've created this PR that should have all the changes from the others.

Specifically, this PR is a merge of
#12144, #12122, #12129, and #12099 with
#12144 building on #12133, which builds on #12063, and
#12122 building on #12103, which builds on #12101, and
#12129 building on #12104 and #12103

Tracking the review statuses:

Which issue(s) this PR closes:

Not sure if the magic happens for PRs:

Closes Support partially unzipped archival bags #12144
Closes Support "holey" archival bags #12133
Closes Only Archive in Sequence #12129
Closes Archive outside transaction #12122
Closes Fix Version Table Archiving Submit logic #12104
Closes Allow archiving when using 'Update-Current-Version' #12103
Closes Call Archiving commands asynchronously from UI to avoid timeouts #12101
Closes Move workflows to onSuccess, use async export when publishing #12099
Closes OAI_ORE and Archival Bag updates #12063

Special notes for your reviewer: The individual PRs Should be reviewed first. This PR only includes some minor updates required to merge the different threads - see the non-merge commits from Feb 17. Assuming the earlier reviews covered the basic design, I think only some sanity checking of those commits is needed here.

Suggestions on how to test this: Nominally QA should check all the changes across all of the other PRs. I'll try to combine the list of changes/test instructions here to help. Overall, the PR makes changes that impact all of the different archivers - local, s3, google, DRS, Duracloud. Since the core team isn't set up to easily test all of those, I'd suggest just testing the local archiver and asking QDR (me) to test the Google archiver and DANS to test the S3 one. The DRS archiver was specifically built for Harvard but the integration with DRS has been put on hold, so I'm not sure there's a way to test it now. The Duracloud archiver was created for TDL - I'm not sure if they're interested in testing now or not (not sure they'll continue using this archiver), but I'll check.

General: Configure a local archiver and setup a post-publication workflow using it per the guides.

For specific changes in the PR:

General Archiving Improvements

Multiple performance and scaling improvements have been made for creating archival bags for large datasets, including:
- Superusers can now see a pending status in the dataset version table while archiving is active.
- Workflows are now triggered outside the transactions related to publication, assuring that workflow locks and status updates are always recorded.

Try archiving something with a significant number of files/overall size - you should be able to refresh the page and look at the version table as a superuser and see the pending status. You should also be able to see :BagGeneratorThreads number of parallel* files in the temp dir as the dataset files are being zipped.

Potential conflicts between archiving/workflows, indexing, and metadata exports after publication have been resolved, avoiding cases where the status/last update times for these actions were not recorded.

Eventually, the archiving should succeed. There should be no log errors and the post publish indexing an exportall should also finish and update the last indexed/last exported times in the db. The archivalcopylocation of the datasetversion should be updated with the final status of archiving (success).

A bug has been fixed where superusers would incorrectly see the "Submit" button to launch archiving from the dataset page version table.

The submit button should not show for prior versions when the latest has not been archived. (You can edit the archivalcopylocation in the db for versions - e.g. set it to null to remove a success from the post-publish workflow so the Submit button will reappear (as it looks like archiving hasn't been attempted yet). Alternately, you could turn the post-publish workflow on/off to archive only certain versions.

The local, S3, and Google archivers have been updated to support deleting existing archival files for a version to allow re-creating the bag for a given version.
For archivers that support file deletion, it is now possible to recreate an archival bag after "Update Current Version" has been used (replacing the original bag). By default, Dataverse will mark the current version's archive as out-of-date, but will not automatically re-archive it.
- A new 'obsolete' status has been added to indicate when an archival bag exists for a version but it was created prior to an "Update Current Version" change.

Two potential tests - leave the flag off, verify that doing an update-current-version as a superuser results in the obsolete status. Then flip the flag on and try update current again and verify that archiving happens and the bag is updated (should be a different size, the OAI-ORE file will have your metadata change).

Improvements have been made to file retrieval for bagging, including retries on errors and when download requests are being throttled.

Things should work with throttling - not sure how that is best configured for testing.

A bug causing :BagGeneratorThreads to be ignored has been fixed, and the default has been reduced to 2.

Set that parameter, include it in your local archiver config, and verify that a different number of parallel* files appear in the /tmp dir (temporarily) as the bag is zipped. The files go away once the final zip is done so using a big enough dataset to have time to check is key.

Retrieval of files for inclusion in an archival bag is no longer counted as a download.

Should not see any download counts due to archiving.

It is now possible to require that all previous versions have been successfully archived before archiving of a newly published version can succeed. (This is intended to support use cases where deduplication of files between dataset versions will be done and is a step towards supporting the Oxford Common File Layout (OCFL).)

Set the flag and verify that trying to archive the latest version fails if any prior version was not archived successfully (status is null, or failure).

The pending status has changed to use the same JSON format as other statuses

Could verify that pending in the db (which is only there temporarily) is a json structure, not just the word pending.

OAI-ORE Export Updates

The export now uses URIs for checksum algorithms, conforming with JSON-LD requirements.

Look at the checksums in the OAI_ORE export (in the bag or the metadata export after publishing)

The https://schema.org/additionalType has been updated to "Dataverse OREMap Format v1.0.2" to reflect format changes.

Should be able to see this line in the OAI-ORE

Archival Bag (BagIt) Updates

The bag-info.txt file now correctly includes information for dataset contacts, fixing a bug where nothing was included when multiple contacts were defined. (Multiple contacts were always included in the OAI-ORE file in the bag; only the baginfo file was affected).
Values used in the bag-info.txt file that may be multi-line (i.e. with embedded CR or LF characters) are now properly indented and wrapped per the BagIt specification (Internal-Sender-Identifier, External-Description, Source-Organization, Organization-Address).
The dataset name is no longer used as a subdirectory within the data/ directory to reduce issues with unzipping long paths on some filesystems.

Should be able to see these by inspection. One test is to have more than one contact, another is to make sure you have multiline entries for some/all of the listed fields. It should also be obvious the title is not used as a folder name by looking at the bag structure.

For dataset versions with no files, the empty manifest-<alg>.txt file will now use the algorithm from the :FileFixityChecksumAlgorithm setting instead of defaulting to MD5.

Publish a dataset with no files and a non md5 algorithm set and verify the file name is correct.

A new key, Dataverse-Bag-Version, has been added to bag-info.txt with the value "1.0" to allow for tracking changes to Dataverse's archival bag generation over time.

Another inspect the bag-info.txt file.

When using the holey bag option discussed above, the required fetch.txt file will be included.

For this and the related files outside the zip, set the size limits (per file or total per dataset) and verify that when you're above that limit, the extra files are either sent outsize the zipped bag (in a dir of the same name - if you unzip the zip in place, the files sent separately should be in the right place to make the bag complete), or, if you select holey=true, will be listed in the fetch.txt file.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Aggregated from all the other PRs.

Additional documentation:

Spec doesn't allow empty lines, dropping whitespace-only lines seems reasonable as well (users can't see from the Dataverse display whether an empty line would appear in bag-info.txt or not if we all whotespace only lines (or whitespace beyond the 78 char wrap limit)

affects manifest and pid-mapping files as well as data file placement

Added unit tests for multilineWrap

This reverts commit 884b81b.

into DANS-QDR-merged_bag_changes_for_QA

…-QDR-merged_bag_changes_for_QA

qqmyers and others added 30 commits December 6, 2025 18:26

add checksum URI values and methods

c9f728b

update version and use checksum URIs

a25e47b

handle multiline descriptions and org names

6c0cb49

remove title as a folder

b0daad7

affects manifest and pid-mapping files as well as data file placement

handle null deaccession reason

e5457a8

use static to simplify testing

10b0556

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

d6cf1e2

Sanitize/split multiline catalog entry, add Dataverse-Bag-Version

6d24185

Added unit tests for multilineWrap

c4daf28

Removed unnecessary repeat helper method

e76bc91

Alined test names with actual test being done

108c912

Merge pull request #48 from janvanmansum/OREBag1.0.2-amend

62ea9d9

Added unit tests for multilineWrap

DD-2098 - allow archivalstatus calls on deaccessioned versions

884b81b

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

5e4e90a

set array properly

3076d69

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

cbdc15f

DD-2212 - use configured checksum when no files are present

1a7dafa

Revert "DD-2098 - allow archivalstatus calls on deaccessioned versions"

7eea57c

This reverts commit 884b81b.

add Source-Org as a potential multiline case, remove change to Int Id

2477cf9

release note

3f3908f

use constants, pass labelLength to wrapping, start custom lineWrap

aa44c08

update to handle overall 79 char length

8227edf

wrap any other potentially long values

d0749fc

cleanup deprecated code, auto-gen comments

24a625f

update comment

bf036f3

add tests

be65611

Merge remote-tracking branch 'IQSS/develop' into OREBag1.0.2

2516cf4

QDR updates to apache 5, better fault tolerance for file retrieval

24d098a

release note update

b4a3799

qqmyers added 27 commits January 29, 2026 17:43

Merge remote-tracking branch 'IQSS/develop' into DANS-2097

74a73fb

setting name tweak, add docs, release note

0642897

simplify

ca0af05

basic fetch

49f4818

order by file size

7f5179f

only add subcollection folders (if they exist)

bc63285

replace deprecated constructs

59f3a2a

restore name collision check

69c9a0d

add null check to quiet log/avoid exception

422435a

cleanup - checksum change

d9cfe1d

cleanup, suppress downloads with gbrec for fetch file

4895f80

add setting, refactor, for non-holey option

62a03b2

Update to track non-zipped files, add method

637b2e3

reuse stream supplier, update archivers to send oversized files

a6b0505

docs, release note update

5739e35

style fix

5c82ab8

Merge remote-tracking branch 'IQSS/develop' into DANS-2157_holey_bags3

b0be6a1

Merge branch 'DANS-2097' into DANS-QDR-merged_bag_changes_for_QA

61f6d1b

Merge remote-tracking branch 'QDR/Arch6-archive_outside_transaction'

8c85f98

into DANS-QDR-merged_bag_changes_for_QA

merge fixes - refactor precondition check for prior versions

949606b

test fix

de9ed31

style fix to separate submit button from status

ee87ab5

missing param

9840648

Merge remote-tracking branch 'QDR/Arch1-createWFLocksEarly' into DANS…

6911fa7

…-QDR-merged_bag_changes_for_QA

add sleep

20008ec

release note updates

6fcd84d

tweaks, remove duplicates

17588c7

qqmyers added this to IQSS Dataverse Project Feb 18, 2026

qqmyers added this to the 6.10 milestone Feb 18, 2026

qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Feb 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DANS QDR Merged ORE/Bag changes for QA#12167

DANS QDR Merged ORE/Bag changes for QA#12167
qqmyers wants to merge 98 commits intoIQSS:developfrom
GlobalDataverseCommunityConsortium:DANS-QDR-merged_bag_changes_for_QA

qqmyers commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

qqmyers commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General Archiving Improvements

OAI-ORE Export Updates

Archival Bag (BagIt) Updates

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

qqmyers commented Feb 17, 2026 •

edited

Loading