Skip to content

Imports the discogs.com monthly XML dumps into databases

License

Notifications You must be signed in to change notification settings

philipmat/discogs-xml2db

Repository files navigation

discogs-xml2db v2

discogs-xml2db is a python program for importing discogs data dumps into several databases.

It exports to CSV as an intermediate step.
It can then import these CSV files into MySQL, PostgreSQL, and SQLite through provided scripts/instructions.

Instructions for importing into MongoDB, though these are untested.

Let us know how it goes!

Experimental version

In parallel to the original Python codebase, we're working on a parser/exporter that's even faster. This is a complete rewrite in C #, and initial results are highly promising:

File Record Count Python C#
discogs_20200806_artists.xml.gz 7,046,615 6:22 2:35
discogs_20200806_labels.xml.gz 1,571,873 1:15 0:22
discogs_20200806_masters.xml.gz 1,734,371 3:56 1:57
discogs_20200806_releases.xml.gz 12,867,980 1:45:16 42:38

If you're interested in testing one of these versions, read more about it in the .NET Parser README or grab the appropriate binaries from the Releases page.

While this version does not have complete feature-parity with the Python version (yet), the core export-to-csv is there, and it's likely it will eventually replace it.

DotNet Build

Running discogs-xml2db

Build Status - develop

Requirements

discogs-xml2db requires python3 and some python modules.
Additionally, the bash shell is used for automating some tasks.

Note: the minimum version tested is the one that still receives security updates according to https://endoflife.date/python. It might work with earlier versions, but no guarantees.

Importing to some databases may require additional dependencies, see the documentation for your target database below.

It's best that a Python virtual environment is created to install the required modules in a safe location, which does not require elevated security permissions:

# Create a virtual environment and activate it
$ python3 -m venv .discogsenv

# Activate virtual environment
# On Linux/macOS:
$ source .discogsenv/bin/activate
# on Windows, in Powershell
$ .discogsenv\Scripts\Activate.ps1

# Install requirements:
(.discogsenv) $ pip3 install -r requirements.txt

Installation instruction for other platforms can be found in the pip documentation.

Downloading discogs dumps

Download the latest dump files from discogs manually from discogs or run get_latest_dumps.sh.

To check the files' integrity, download the appropriate checksum file from https://data.discogs.com/, place it in the same directory as the dumps and compare the checksums.

# run in folder where the data dump files have been downloaded
$ sha256sum -c discogs_*_CHECKSUM.txt

Converting dumps to CSV

Run run.py to convert the dump files to csv.

There are two run modes:

  1. You can point it to a directory where the discogs dump files are and use one or multiple --export options to indicate which files to process:
# ensure the virtual environment is active
(.discogsenv) $ python3 run.py \
  --bz2 \ # compresses resulting csv files
  --apicounts \ # provides more accurate progress counts
  --export artist --export label --export master --export release \
  --output csv-dir    # folder where to output the csv files
  dump-dir \ # folder where the data dumps are
  1. You can specify the individual files instead:
# ensure the virtual environment is active
(.discogsenv) $ python3 run.py \
  --bz2 \ # compresses resulting csv files
  --apicounts \ # provides more accurate progress counts
  --output csv-dir    # folder where to output the csv files
  path/to/discogs_20200806_artist.xml.gz path/to/discogs_20200806_labels.xml.gz

run.py takes the following arguments:

  • --export: the types of dump files to export: "artist", "label", "master", "release.
    It matches the names of the dump files, e.g. "discogs_20200806_artists.xml.gz" Not needed if the individual files are specified.
  • --bz2: Compresses output csv files using bz2 compression library.
  • --limit=<lines>: Limits export to some number of entities
  • --apicounts: Makes the progress report more accurate by getting total amounts from Discogs API.
  • --output : the folder where to store the csv files; default it current directory

The exporter provides progress information in real time:

Processing      labels:  99%|█████████████████████████████████████████▊| 1523623/1531339 [01:41<00:00, 14979.04labels/s]
Processing     artists: 100%|████████████████████████████████████████▊| 6861991/6894139 [09:02<00:02, 12652.23artists/s]
Processing    releases:  78%|█████████████████████████████▌        | 9757740/12560177 [2:02:15<36:29, 1279.82releases/s]

The total amount and percentages might be off a bit as the exact amount is not known while reading the file.
Specifying --apicounts will provide more accurate predictions by getting the latest amounts from the Discogs API.

Generating XML fixtures

For parity testing between the Python and .NET parsers, you can generate a small, coherent set of XML fixtures (with relationships preserved across files).

Script: tests/fixtures/generate_fixtures.py

Example (runs against tests/samples):

uv run --with lxml python tests/fixtures/generate_fixtures.py \
  --input-dir tests/samples \
  --output-dir tests/fixtures \
  --size 25 \
  --complexity highest

Options:

  • --size: number of releases to seed (default 25).
  • --complexity: highest (default), random, or mixed.
    • highest: pick releases with the largest overall feature count (tracks, artists, labels, identifiers, videos, etc.). If availability scan is enabled, prioritizes releases that reference IDs present in the available dumps.
    • random: uniform random sample (deterministic with --seed).
    • mixed: combine a top slice of complex releases with a random remainder. Controlled by --mixed-ratio.
  • --mixed-ratio: only used when --complexity mixed. Fraction of releases taken from the "top complexity" set; the remainder is random. Default 0.7.
  • --availability-scan: auto (default), always, or never.
    • always: scan artists/labels/masters to prefer releases whose references exist in those files (best coherence, slower on big dumps).
    • never: skip scanning; selection is purely by complexity/randomness.
    • auto: only scan if the combined size of artists/labels/masters is below --availability-max-mb.
  • --availability-max-mb: size threshold used when --availability-scan auto (default 256).
  • --seed: RNG seed used for deterministic sampling (default 1).
  • --input-dir: folder containing the Discogs dumps (default tests/samples).
  • --output-dir: folder to write fixtures (default tests/fixtures).
  • --progress-every: print progress every N parsed entities (default 50,000; set to 0 to disable).
  • --manifest: reuse an existing manifest.json to extract exactly the listed IDs. In this mode, selection/complexity options are ignored, and no graph expansion is performed; the script only pulls the specified entities from the dumps.

Outputs:

  • Fixture XML files are written to tests/fixtures/.
  • A tests/fixtures/manifest.json is produced with IDs, counts, and missing references for debugging.

The script supports both .xml and .xml.gz files and preserves XML header/doctype/namespace if present. To run against the latest dumps in ./tmp, use --input-dir ./tmp.

Importing

If pv is available it will be used to display progress during import.
To install it run $ sudo apt-get install pv on Ubuntu and Debian or check the installation instructions for other platforms.

Example output if using pv:

$ mysql/importcsv.sh 2020-05-01/csv/*
artist_alias.csv.bz2: 12,5MiB 0:00:03 [3,75MiB/s] [===================================>] 100%
artist.csv.bz2:  121MiB 0:00:29 [4,09MiB/s] [=========================================>] 100%
artist_image.csv.bz2:  7,3MiB 0:00:01 [3,72MiB/s] [===================================>] 100%
artist_namevariation.csv.bz2: 2,84MiB 0:00:01 [2,76MiB/s] [==>                         ] 12% ETA 0:00:07

Importing into PostgreSQL

# install PostgreSQL libraries (might be required for next step)
$ sudo apt-get install libpq-dev

# install the PostgreSQL package for python
# ensure the virtual environment has been activated
(.discogsenv) $ pip3 install -r postgresql/requirements.txt

# Configure PostgreSQL username, password, database, ...
$ nano postgresql/postgresql.conf

# Create database tables
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateTables.sql

# Import CSV files
(.discogsenv) $ python3 postgresql/importcsv.py /csvdir/*

# Configure primary keys and constraints, build indexes
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreatePrimaryKeys.sql
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateFKConstraints.sql
(.discogsenv) $ python3 postgresql/psql.py < postgresql/sql/CreateIndexes.sql

Importing into Mysql

# Configure MySQL username, password, database, ...
$ nano mysql/mysql.conf

# Create database tables
$ mysql/exec_sql.sh < mysql/CreateTables.sql

# Import CSV files
$ mysql/importcsv.sh /csvdir/*

# Configure primary keys and build indexes
$ mysql/exec_sql.sh < mysql/AssignPrimaryKeys.sql

Importing into SQLite

# Create database tables
$ sqlite3 /path/to/discogs.sqlite < sqlite/sql/CreateTables.sql

# Import CSV files
$ python3 sqlite/importcsv.py --db=/path/to/discogs.sqlite /csvdir/*

# Create indexes (optional but recommended for querying)
$ sqlite3 /path/to/discogs.sqlite < sqlite/sql/CreateIndexes.sql

Importing into MongoDB

The CSV files can be imported into MongoDB using mongoimport.

mongoimport --db=discogs --collection=releases --type=csv --headerline --file=release.csv

Database schema changes

The database schema was changed in v2.0 to be more consistent and normalize some more data. The following things changed compared to classic discogs-xml2db:

  • renamed table: releases_labels => release_label
  • renamed table: releases_formats => release_format
  • renamed table: releases_artists => release_artist
  • renamed table: tracks_artists => release_track_artist
  • renamed table: track => release_track
  • renamed column: release_artists.join_relation => release_artist.join_string
  • renamed column: release_track_artist.join_relation => release_track_artist.join_string
  • renamed column: release_format.format_name => release_format.name
  • renamed column: label.contactinfo => label.contact_info
  • renamed column: label.parent_label => label.parent_name
  • added: label has new parent_id field
  • added: release_label has extra fields
  • moved: aliases now in artist_alias table
  • moved: tracks_extra_artists now in track_artist table with an extra flag
  • moved: releases_extra_artists now in release_track_artist table with an extra flag
  • moved: release.genres now in own release_genre table
  • moved: release.styles now in own release_style table
  • moved: release.barcode now in release_identifier table
  • moved: artist.anv fields now in artist_namevariation table
  • moved: artist.url fields now in artist_url table
  • removed: release_format.position no longer exists but can use id field to preserve order when release has multiple formats.
  • release_track_artist now use tmp_track_id to match to tmp_track in release_track

Running discogs-xml2db classic

To run the classic version of discogs-xml2db, check out the v1.99 git tag.
It contains both the classic and the speed-up version.

Please be aware that the classic version is no longer maintained.

About

Imports the discogs.com monthly XML dumps into databases

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 17