DANS QDR Merged ORE/Bag changes for QA#12167
Open
qqmyers wants to merge 98 commits intoIQSS:developfrom
Open
Conversation
Spec doesn't allow empty lines, dropping whitespace-only lines seems reasonable as well (users can't see from the Dataverse display whether an empty line would appear in bag-info.txt or not if we all whotespace only lines (or whitespace beyond the 78 char wrap limit)
affects manifest and pid-mapping files as well as data file placement
Added unit tests for multilineWrap
This reverts commit 884b81b.
into DANS-QDR-merged_bag_changes_for_QA
…-QDR-merged_bag_changes_for_QA
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it: As discussed in Tech Hour, etc., there have been many changes to the OAI_ORE export and archival Bag generation code driven by requirements at QDR and DANS that have been made into individual PRs for review. Since the end goal is for all of these to work together, and because it would be easier to setup/configuration archival bag workflows once, I've created this PR that should have all the changes from the others.
Specifically, this PR is a merge of
#12144, #12122, #12129, and #12099 with
#12144 building on #12133, which builds on #12063, and
#12122 building on #12103, which builds on #12101, and
#12129 building on #12104 and #12103
Tracking the review statuses:
Which issue(s) this PR closes:
Not sure if the magic happens for PRs:
Special notes for your reviewer: The individual PRs Should be reviewed first. This PR only includes some minor updates required to merge the different threads - see the non-merge commits from Feb 17. Assuming the earlier reviews covered the basic design, I think only some sanity checking of those commits is needed here.
Suggestions on how to test this: Nominally QA should check all the changes across all of the other PRs. I'll try to combine the list of changes/test instructions here to help. Overall, the PR makes changes that impact all of the different archivers - local, s3, google, DRS, Duracloud. Since the core team isn't set up to easily test all of those, I'd suggest just testing the local archiver and asking QDR (me) to test the Google archiver and DANS to test the S3 one. The DRS archiver was specifically built for Harvard but the integration with DRS has been put on hold, so I'm not sure there's a way to test it now. The Duracloud archiver was created for TDL - I'm not sure if they're interested in testing now or not (not sure they'll continue using this archiver), but I'll check.
General: Configure a local archiver and setup a post-publication workflow using it per the guides.
For specific changes in the PR:
General Archiving Improvements
Try archiving something with a significant number of files/overall size - you should be able to refresh the page and look at the version table as a superuser and see the pending status. You should also be able to see :BagGeneratorThreads number of parallel* files in the temp dir as the dataset files are being zipped.
Eventually, the archiving should succeed. There should be no log errors and the post publish indexing an exportall should also finish and update the last indexed/last exported times in the db. The archivalcopylocation of the datasetversion should be updated with the final status of archiving (success).
The submit button should not show for prior versions when the latest has not been archived. (You can edit the archivalcopylocation in the db for versions - e.g. set it to null to remove a success from the post-publish workflow so the Submit button will reappear (as it looks like archiving hasn't been attempted yet). Alternately, you could turn the post-publish workflow on/off to archive only certain versions.
Two potential tests - leave the flag off, verify that doing an update-current-version as a superuser results in the obsolete status. Then flip the flag on and try update current again and verify that archiving happens and the bag is updated (should be a different size, the OAI-ORE file will have your metadata change).
Things should work with throttling - not sure how that is best configured for testing.
:BagGeneratorThreadsto be ignored has been fixed, and the default has been reduced to 2.Set that parameter, include it in your local archiver config, and verify that a different number of parallel* files appear in the /tmp dir (temporarily) as the bag is zipped. The files go away once the final zip is done so using a big enough dataset to have time to check is key.
Should not see any download counts due to archiving.
Set the flag and verify that trying to archive the latest version fails if any prior version was not archived successfully (status is null, or failure).
Could verify that pending in the db (which is only there temporarily) is a json structure, not just the word pending.
OAI-ORE Export Updates
Look at the checksums in the OAI_ORE export (in the bag or the metadata export after publishing)
https://schema.org/additionalTypehas been updated to "Dataverse OREMap Format v1.0.2" to reflect format changes.Should be able to see this line in the OAI-ORE
Archival Bag (BagIt) Updates
bag-info.txtfile now correctly includes information for dataset contacts, fixing a bug where nothing was included when multiple contacts were defined. (Multiple contacts were always included in the OAI-ORE file in the bag; only the baginfo file was affected).bag-info.txtfile that may be multi-line (i.e. with embedded CR or LF characters) are now properly indented and wrapped per the BagIt specification (Internal-Sender-Identifier,External-Description,Source-Organization,Organization-Address).data/directory to reduce issues with unzipping long paths on some filesystems.Should be able to see these by inspection. One test is to have more than one contact, another is to make sure you have multiline entries for some/all of the listed fields. It should also be obvious the title is not used as a folder name by looking at the bag structure.
manifest-<alg>.txtfile will now use the algorithm from the:FileFixityChecksumAlgorithmsetting instead of defaulting to MD5.Publish a dataset with no files and a non md5 algorithm set and verify the file name is correct.
Dataverse-Bag-Version, has been added tobag-info.txtwith the value "1.0" to allow for tracking changes to Dataverse's archival bag generation over time.Another inspect the bag-info.txt file.
holeybag option discussed above, the requiredfetch.txtfile will be included.For this and the related files outside the zip, set the size limits (per file or total per dataset) and verify that when you're above that limit, the extra files are either sent outsize the zipped bag (in a dir of the same name - if you unzip the zip in place, the files sent separately should be in the right place to make the bag complete), or, if you select holey=true, will be listed in the fetch.txt file.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: Aggregated from all the other PRs.
Additional documentation: