This project provides a statistical analysis of the bitfield used to represent the unspent transaction output (UTXO) set, as described in the Transaction Output Bitfield Compression research problem. The goal is to visualize and understand the distribution of runs of zeros and ones in the so-called "hints file" to determine which compression method may be most suitable for this data structure.
The UTXO set can be represented as a bitfield over all transaction outputs (TXOs), where each bit indicates whether a TXO is spent (0) or unspent (1). Efficiently compressing this bitfield is important for reducing the size of the hints file, which is used in SwiftSync to speed up initial block download (IBD) and may be integrated into Bitcoin Core in the future.
Note that this repository does not implement compression algorithms; it is a statistical study to inform future development. For more details on the problem, see the original research post.
When you run the analysis, the program displays a graph of the run-length distribution on the screen and also saves it locally as run_distribution.pkl for further inspection.
Additionally, the program outputs statistics similar to the following example, which was generated using the main.hints file based on Bitcoin mainnet at block 916,874 (file size: 447 MB):
runs >= 9 (RLE candidates): 53,510,451x
bit packing with less than 64 bits happens: 52,540,220x
bit packing with more than 64 bits happens: 587,583x
total spent (0): 3,293,272,927
total unspent (1): 167,110,763
Graph output:
This plot shows how often consecutive runs of identical bits appear. The X-axis is the run length (number of consecutive equal bits), and the Y-axis is how many times runs of that length occur; both axes use a logarithmic scale (you can turn it off by pressing 'j' on graph view). Red points represent runs of bit 0 (spent) and green points represent runs of bit 1 (unspent). Each point answers: “how many times does a run of N identical bits appear?”
For example, at run length = 1, a green point near 10^8 means single-bit runs like 1 occur about 10^8 times, while a red point near 10^7 means single-bit runs like 0 occur about 10^7 times. At run length = 10, points correspond to sequences such as 0000000000 or 1111111111, and their vertical position gives the frequency of those exact-length runs.
Runs with more than 9 consecutive zeros are considered candidates for run-length encoding (RLE), which could provide significant compression gains. The statistics on bit packing with less or more than 64 bits give a sense of the size and sparsity of the data, helping to evaluate the effectiveness of different packing and compression strategies.
This analysis is intended as a first step toward selecting the most appropriate compression method for the hints file, based on real-world data characteristics.
To build and run this project in Python, follow these steps:
Make sure you have Python 3.13.11 or newer installed. You can check your version with:
python --versionIt is recommended to use a virtual environment to manage dependencies:
python -m venv .venv
source .venv/bin/activateIf your project has dependencies, list them in a requirements.txt file. Then run:
pip install -r requirements.txtYou can now run the analysis script. It accepts an optional parameter for the hint file to analyze. If not specified, it defaults to main.hints:
python hints_analyser.py [hintfile]After running you will have a run_distribution.pkl file. You can plot the graphic at any time without running the hints_analyser again by executing:
python view_graph.py