mod-linked-data-import

This software is distributed under the terms of the Apache License, Version 2.0. See the file "LICENSE" for more information.

Introduction

This module provides bulk import functionality for RDF data graphs into the mod-linked-data application. It reads RDF subgraphs in Bibframe 2 format, transforms them into the Builde vocabulary, and delivers them to mod-linked-data via Kafka.

Third party libraries used in this software

This software uses the following Weak Copyleft (Eclipse Public License 1.0 / 2.0) licensed software libraries:

How to Import Data

Upload the RDF file to the S3 bucket specified by the S3_BUCKET environment variable.
Inside that bucket, place the file within the subdirectory corresponding to the target tenant ID.
Trigger the import by calling the following API:

POST /linked-data-import/start?fileUrl={fileNameInS3}&contentType=application/ld+json
x-okapi-tenant: {tenantId}
x-okapi-token: {token}

Response is a job execution id, which could be later used for getting job status or failed lines.

To check the import job status, use:

GET /linked-data-import/jobs/{jobExecutionId}
x-okapi-tenant: {tenantId}
x-okapi-token: {token}

The response includes job information such as:

startDate: Job start date and time
startedBy: User ID who started the job
status: Current job status (COMPLETED, STARTED, FAILED, etc.)
fileName: Name of the imported file
currentStep: Current processing step
linesRead: Total lines read from the file
linesMapped: Lines successfully mapped
linesFailedMapping: Lines failed during mapping
linesCreated: Resources created
linesUpdated: Resources updated
linesFailedSaving: Lines failed during saving

To download failed RDF lines as CSV file:

GET /linked-data-import/jobs/{jobExecutionId}/failed-lines
x-okapi-tenant: {tenantId}
x-okapi-token: {token}

The CSV file contains:

lineNumber: Line number in the original file
description: Error description
failedRdfLine: The RDF line content that failed

To cancel a running import job:

PUT /linked-data-import/jobs/{jobExecutionId}/cancel
x-okapi-tenant: {tenantId}
x-okapi-token: {token}

Note: The job will stop gracefully after completing the current processing chunk or step. It will not stop immediately.

File Format & Contents

The file must be in JSON Lines (jsonl) format.
Each line must contain a complete subgraph of a Bibframe Instance resource, as defined by the Bibframe 2 ontology.

For an example of a valid import file containing two RDF instances, see docs/example-import.jsonl.

Limitations

Only RDF data serialized as application/ld+json is supported. Support for additional formats (e.g., XML, N-Triples) may be added in the future.
Only Bibframe Instances and their connected resources can be imported. Standalone resources—such as a Person not linked to any Instance—cannot be processed.

Batch processing

File contents are processed in batches. You can configure batch processing using following environment variables:

CHUNK_SIZE: Number of lines read from the input file per chunk
OUTPUT_CHUNK_SIZE: Number of Graph resources sent to Kafka per chunk
PROCESS_FILE_MAX_POOL_SIZE: Maximum threads used for parallel chunk processing

Interaction with mod-linked-data

mod-linked-data uses the Builde vocabulary for representing graph data.

During import:

This module transforms Bibframe 2 subgraphs into the equivalent Builde subgraph using the lib-linked-data-rdf4ld library.
The transformed subgraphs are published to the Kafka topic specified by the KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC environment variable.
mod-linked-data consumes messages from this topic, performs additional processing, and persists the graph to its database.

Dependencies on libraries

This module is dependent on the following libraries:

Compiling

mvn clean install

Skip tests:

mvn clean install -DskipTests

Environment variables

This module uses S3 storage for files. AWS S3 and Minio Server are supported for files storage. It is also necessary to specify variable S3_IS_AWS to determine if AWS S3 is used as files storage. By default, this variable is false and means that MinIO server is used as storage. This value should be true if AWS S3 is used.

Name	Default value	Description
SERVER_PORT	8081	Server port
DB_USERNAME	postgres	Database username
DB_PASSWORD	postgres	Database password
DB_HOST	postgres	Database host
DB_PORT	5432	Database port
DB_DATABASE	okapi_modules	Database name
DB_MAXPOOLSIZE	100	Maximum database connection pool size
KAFKA_HOST	kafka	Kafka broker host
KAFKA_PORT	9092	Kafka broker port
KAFKA_CONSUMER_MAX_POLL_RECORDS	100	Maximum number of records returned in a single poll
KAFKA_SECURITY_PROTOCOL	PLAINTEXT	Kafka security protocol
KAFKA_SSL_KEYSTORE_PASSWORD	-	Kafka SSL keystore password
KAFKA_SSL_KEYSTORE_LOCATION	-	Kafka SSL keystore location
KAFKA_SSL_TRUSTSTORE_PASSWORD	-	Kafka SSL truststore password
KAFKA_SSL_TRUSTSTORE_LOCATION	-	Kafka SSL truststore location
ENV	folio	Environment name used in Kafka topic names
KAFKA_RETRY_INTERVAL_MS	2000	Kafka retry interval in milliseconds
KAFKA_RETRY_DELIVERY_ATTEMPTS	6	Number of Kafka delivery retry attempts
KAFKA_IMPORT_RESULT_EVENT_CONCURRENCY	1	Number of concurrent consumers for import result events
KAFKA_IMPORT_RESULT_EVENT_TOPIC_PATTERN	(${ENV}\.)(.*\.)result	Kafka topic pattern for import result events
KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC	linked_data_import.output	Kafka topic where the transformed subgraph is published for mod-linked-data
KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC_PARTITIONS	3	Number of partitions for the output topic
KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC_REPLICATION_FACTOR	-	Replication factor for the output topic
KAFKA_LINKED_DATA_IMPORT_RESULT_TOPIC	linked_data_import.result	Kafka topic for import processing results
KAFKA_LINKED_DATA_IMPORT_RESULT_TOPIC_PARTITIONS	3	Number of partitions for the result topic
KAFKA_LINKED_DATA_IMPORT_RESULT_TOPIC_REPLICATION_FACTOR	-	Replication factor for the result topic
S3_URL	http://127.0.0.1:9000/	S3 url
S3_REGION	-	S3 region
S3_BUCKET	-	S3 bucket
S3_ACCESS_KEY_ID	-	S3 access key
S3_SECRET_ACCESS_KEY	-	S3 secret key
S3_IS_AWS	false	Specify if AWS S3 is used as files storage
CHUNK_SIZE	1000	Number of lines read from the input file per chunk
OUTPUT_CHUNK_SIZE	100	Number of Graph resources sent to Kafka per chunk
JOB_POOL_SIZE	1	Number of concurrent Import Jobs
PROCESS_FILE_MAX_POOL_SIZE	1000	Maximum threads used for parallel chunk processing
DATA_CLEANUP_CRON	0 0 2 * * *	Cron expression for automatic cleanup of completed job data (daily at 2 AM)
DATA_CLEANUP_AGE_DAYS	2	Number of days after which job data is eligible for cleanup

Further information

Issue tracker

Project MODLDI

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
checkstyle		checkstyle
descriptors		descriptors
docs		docs
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
Dockerfile		Dockerfile
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
NEWS.md		NEWS.md
PERSONAL_DATA_DISCLOSURE.md		PERSONAL_DATA_DISCLOSURE.md
README.md		README.md
lombok.config		lombok.config
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mod-linked-data-import

Introduction

Third party libraries used in this software

How to Import Data

To check the import job status, use:

To download failed RDF lines as CSV file:

To cancel a running import job:

File Format & Contents

Limitations

Batch processing

Interaction with mod-linked-data

Dependencies on libraries

Compiling

Environment variables

Further information

Issue tracker

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

folio-org/mod-linked-data-import

Folders and files

Latest commit

History

Repository files navigation

mod-linked-data-import

Introduction

Third party libraries used in this software

How to Import Data

To check the import job status, use:

To download failed RDF lines as CSV file:

To cancel a running import job:

File Format & Contents

Limitations

Batch processing

Interaction with mod-linked-data

Dependencies on libraries

Compiling

Environment variables

Further information

Issue tracker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages