Fundamental source code structure for checking resources external to Reactome by jweiser · Pull Request #1 · reactome/release-external-dependencies

jweiser · 2020-08-18T20:02:27Z

The structure of the code is such that the ResourceParser class will take in a CSV or JSON file with records that describe properties for each resource. A list of Resource objects will be returned be returned after parsing. The Resource objects will be iterated over and the proper type of ResourceChecker will be created using the ResourceCheckerFactory. The ResourceChecker will do its work ensuring the resource exists and is valid and a report will be produced.

The code is still a work in progress right now (some aspects of Main and specific types of ResourceChecker classes are still being written), but it is far enough along for a review and merging the existing source.

…ncies

…jects

…ent an inteface method

… interfaces

SolomonShorser-OICR · 2020-08-18T20:15:48Z

src/main/resources/External_Resources.json

@@ -0,0 +1,1030 @@
+[


Why is there a JSON file? Didn't I just see the same content in a CSV file?

Yes, initially I had the resource information in a Google Sheet and was building the parser based on CSV, but I realized JSON was a better way to go. The application supports both right now, but I may move it to entirely JSON.

Created issue #6

src/main/java/org/reactome/release/Resource.java

SolomonShorser-OICR · 2020-08-18T20:22:48Z

src/main/java/org/reactome/release/resourcechecker/FTPFileResourceChecker.java

+		FTPClient ftpClient = new FTPClient();
+
+		ftpClient.connect(getFtpServer());
+		ftpClient.enterLocalPassiveMode();


Might need an option to toggle this (passive/active mode) for specific resources. It's been a problem in the past, hasn't it?

Yes - I'll add it in if any of the resources accessed by FTP won't work without active mode during testing.

Created issue #5

src/main/java/org/reactome/release/resourcechecker/FileResourceChecker.java

src/main/java/org/reactome/release/resourcechecker/WebPageResourceChecker.java

cookersjs · 2020-08-20T17:27:36Z

pom.xml

+		<!--
+		integration tests run by default when running the command "mvn verify"; can be overridden using the option
+		"-DskipITs=true"
+		-->
+		<skipITs>false</skipITs>


It looks like they can also just change this <skipITs> value as well

Yes, that's true but changing the default value in the file changes the configuration in a more permanent way. I set it to skipping ITs as false, because I think we would like to run integration tests by default, but the command-line option gives a convenient way to override that value on a one-time basis without affecting the default behaviour of running all tests (integration and unit tests).

src/main/java/org/reactome/release/Main.java

cookersjs · 2020-08-20T17:31:10Z

src/main/java/org/reactome/release/Main.java

+
+

Yeah :). I'll remove it. The Main class is under development and still messy :).

cookersjs · 2020-08-20T17:34:03Z

src/main/java/org/reactome/release/Resource.java

+		return resourceInfoJson;
+	}
+
+	private void checkMandatoryAttributesExist(String originalRecord) throws IllegalArgumentException {


I know you said this is a WIP so will probably plan to do this anyways, but for this method I'm a little unsure how it accomplishes what it does. I get the gist of it from the name of the method, but putting all these values into the Map and such is something I don't quite understand -- could comments be added?

It is the how or the why that's unclear?

I would say both, but more of the 'how'.

The map holds the attributes of the resource that should be included in the JSON that describes the resource. The key for the map is the name of the attribute that should be there and the value is the call to the appropriate "getter" method which should be able to provide that value since the record is now stored in the Resource object after being constructed.

If any of the values in the map give a null or empty value (checked for in the for-loop iterating over the entries in the map), it means something is wrong because a value that should have been available after the construction of the Resource object is not available. This triggers an IllegalStateException because the Resource object is not in the expected state of having values for fundamental information about the resource it represents. This is either because the JSON provided didn't specify that information or something went wrong in providing that information during the construction of the object. Either way, it's an illegal state and should cause an exception to check the problem with the resource info and/or code.

cookersjs · 2020-08-20T17:39:13Z

src/main/java/org/reactome/release/Resource.java

+//	public String getHeaderAsTSVString() {
+//		return String.join("\t", getHeaderNameToValueMap().keySet());
+//	}
+//
+//	public String getValuesAsTSVString() {
+//		return String.join("\t", getHeaderNameToValueMap().values());
+//	}


No, actually this was when the output was going to be in CSV format. Good catch - I'll remove it.

SolomonShorser-OICR · 2020-08-20T17:45:47Z

src/main/java/org/reactome/release/resourcechecker/FileResourceChecker.java

+	 */
+	default boolean isFileSizeAcceptable(long previousFileSize, double acceptablePercentageDrop) {
+		long differenceInFileSize = getFileSize() - previousFileSize;
+		double percentChangeInFileSize = differenceInFileSize * 100.0d / previousFileSize;


Does it ever happen where a resource is suddenly much larger than it used to be? I think that is covered by this code but I'm not sure... Maybe change the name of acceptablePercentageDrop to acceptablePercentageChange?

A file increase of any amount is currently acceptable, but you're right - I want to implement something to alert us to a suspicious jump in size. I'm just thinking of what the threshold would/should be.

Created issue #2

cookersjs · 2020-08-20T17:49:28Z

src/main/java/org/reactome/release/resourcechecker/FileResourceChecker.java

+	 * @see #isFileSizeAcceptable(long, double)
+	 */
+	default boolean isFileSizeAcceptable(long previousFileSize) {
+		final double acceptableFileSizePercentageDrop = 5.0;


I know for older release files we had the heuristic of a 10% change in file sizes being deemed suspicious. Any reason for the 5% threshold now? Also, should this value be configurable?

Yeah, I thought of 10% initially, but I wanted to err on the side of caution a bit more so I went (somewhat arbitrarily) with 5% with the caveat that we can change it as we get experience with the feedback we get with the automated checking.

For configuration, this method is overloaded allowing you to pass in a second parameter of the percentage drop threshold that will trigger the check to fail.

src/main/java/org/reactome/release/resourcechecker/ResourceCheckerFactory.java

cookersjs · 2020-08-20T17:54:34Z

src/main/java/org/reactome/release/resourcechecker/WebPageResourceChecker.java

+//		Path exampleFile = Paths.get(".", "example.html");
+//		Files.deleteIfExists(exampleFile);
+//		Files.write(exampleFile, page.getBytes(), StandardOpenOption.CREATE);


Removeable?

SolomonShorser-OICR · 2020-08-20T17:56:16Z

src/main/java/org/reactome/release/resourcechecker/FileResourceChecker.java

+			while (fileSize > BYTE_UNIT_CONVERSION_FACTOR) {
+				magnitudeOfByteConversionFactor += 1;
+				fileSize = fileSize / BYTE_UNIT_CONVERSION_FACTOR;
+			}


Since the number of possible magnitudes is fixed (and relatively small), a loop is not really necessary here. A few if-else statements would work just as well.

Okay - I'll take a look at reworking this.

cookersjs · 2020-08-20T17:59:39Z

Regarding the external resource json/csv files -- is there a plan for keeping everything in sync? Is the envisioned usage to be to use this new module to confirm everything will work for release, and for cases where it flags that it couldn't find the file then we would be notified and need to change it this modules files as well as the affected release step?

jweiser · 2020-08-20T18:26:15Z

Regarding the external resource json/csv files -- is there a plan for keeping everything in sync? Is the envisioned usage to be to use this new module to confirm everything will work for release, and for cases where it flags that it couldn't find the file then we would be notified and need to change it this modules files as well as the affected release step?

The CSV file I think is going to be phased out and replaced with a JSON file. I'm still giving some thought about the best way to keep the reference to the release process up to date in a straightforward way.

For the usage, yes, this project is designed to check things outside of Reactome needed/used by the release process and give us warning that something is missing or incorrect and we need to contact the maintainer of the resource and/or change something on our side.

SolomonShorser-OICR

There are a few outstanding issues that are not yet resolved:

CSV/JSON input file - please commit to a format.
Files from resources being larger than expected.
Filesize chages from resources not being captured in the input file that should track this.

If they are not resolved within this PR, that's OK but in that case I'd like to see separate issues opened to track them, before approving this PR, just so these issues don't get lost.

jweiser · 2020-08-20T21:07:56Z

There are a few outstanding issues that are not yet resolved:
* CSV/JSON input file - please commit to a format.

* Files from resources being larger than expected.

* Filesize chages from resources not being captured in the input file that should track this.
If they are not resolved within this PR, that's OK but in that case I'd like to see separate issues opened to track them, before approving this PR, just so these issues don't get lost.

Fair enough. I don't want this PR to get too large, so I'll create issues for them and reference them from this PR before it's closed.

jweiser · 2020-08-26T00:46:29Z

There are a few outstanding issues that are not yet resolved:
* CSV/JSON input file - please commit to a format.

* Files from resources being larger than expected.

* Filesize chages from resources not being captured in the input file that should track this.
If they are not resolved within this PR, that's OK but in that case I'd like to see separate issues opened to track them, before approving this PR, just so these issues don't get lost.

Issues for these have been opened:

#6: CSV/JSON input file - please commit to a format.
#2: Files from resources being larger than expected.
#7: Filesize changes from resources not being captured in the input file that should track this.

jweiser and others added 29 commits June 16, 2020 11:35

adds starting source code

a80ff62

ignores Maven "target" directory

db42f8d

Add README.md

a86cbc5

Add Apache 2.0 LICENSE

17f58a8

reorganizes ResourceCheckers using interfaces

6834d03

makes ResourceType enum public

385f7b8

creates new ResourceChecker classes for RESTfulAPI and Web Pages

2276751

updates Main class to produce output for debugging

8d9c4bf

adds dependency used by WebPageResourceChecker for web scraping

720149a

Merge branch 'master' of gitlab.com:jdweiser/release-external-depende…

f100ed1

…ncies

adds ability to parse JSON to a Resource object

d1428e8

adds ability to parse JSON array or resource into list of Resource ob…

05360f9

…jects

updates methods to scrape website content

b98f096

updates resource raw data files

9efbcd4

updates Resource to be able to retrieve optional error and expected text

c5e1d1f

updates Main class to use JSON and to use either CSV or JSON

1368eac

adds necessary details for maven central deployment to pom.xml

8d259a3

refactors Resource class to store JSON as underlying data structure

cf67859

moves/adds logic for unit conversion of file size to ByteUnit enum

c98fdfd

removes getReport override to use FileResourceChecker impl. instead

e9adf73

adds logic to check resource checks pass and produce a report

cdc0302

changes methods to handle IOException at source rather than throwing

6e0024e

adds methods to check if resource passed checks and get post-check info

3ee4b8c

refactors logic to parent interface and changes method name to implem…

82173bc

…ent an inteface method

renames 'ResourceBuilder' class to 'ResourceParser'

15576cb

method to see if resource passes checks and temp. disables report

83945c2

adds expected file size for an FTP file in test resource JSON

1b82574

changes HTTP file and Web Page checkers to use default getReport from…

9edddf5

… interfaces

returns report as a JSONObject instead of a String

5a0385f

jweiser added the enhancement New feature or request label Aug 18, 2020

SolomonShorser-OICR reviewed Aug 18, 2020

View reviewed changes

src/main/java/org/reactome/release/Resource.java Show resolved Hide resolved

SolomonShorser-OICR reviewed Aug 18, 2020

View reviewed changes

src/main/java/org/reactome/release/resourcechecker/FileResourceChecker.java Outdated Show resolved Hide resolved

SolomonShorser-OICR reviewed Aug 18, 2020

View reviewed changes

src/main/java/org/reactome/release/resourcechecker/WebPageResourceChecker.java Show resolved Hide resolved

changes stream/filter/findFirst to array lookup

5f5ba7c