This repository provides implementations of run-length BWTs and BWT-based compressed suffix arrays (CSA) leveraging variable-length blocking (VLB), a novel technique that exploits the skew distribution of BWT runs to balance space usage and query speed.
State of the art in practical CSAs:
- The
$sr$ -index is the most space-efficient BWT-based CSA. - The move data structure is the most query-efficient BWT-based CSA.
In terms of time–space trade-offs, they lie at opposite ends of the Pareto frontier.
This repository implements two data structures:
- A run-length BWT with support for
$count$ queries. - The fast variant of the
$sr$ -index (valid areas).
The key takeaway is this:
Our VLB framework strikes a balance: its space usage is comparable to the
VLBT is a promising alternative, as it can effectively handle variation to produce compact representations—essential for terabyte-scale inputs—while still supporting fast pattern-matching queries.
A compressed suffix array (CSA) is a data structure that stores a text in compressed form and allows counting and locating occurrences of a given pattern in the text.
This idea takes many forms, but the most popular are those based on the Burrows-Wheeler Transform (BWT). Combining the BWT of a text with samples of the suffix array yields the so-called FM index, the algorithmic workhorse behind popular bioinformatics tools such as BWA-MEM and Bowtie2.
In practical implementations, the FM index usually uses space proportional to the plain text, which is not ideal for
large texts. So, Gagie et al. created the
However, the
Another relevant problem is that performing pattern matching on the
A way to deal with the space issue is to group BWT runs into blocks, storing global indexing information about the blocks, and recomputing the missing information on the fly during query time. This idea, in principle, should keep variation-related space overhead controlled. The challenge is to find a suitable way to distribute the indexing information across the runs such that we still achieve good query performance.
VLB constructs an unbalanced, shallow tree over the run-length BWT (the VLB-tree), where the leaves encode variable-length BWT blocks and the internal nodes store indexing information that speeds up access to those blocks. Answering rank and successor queries (core operations in pattern matching) involves descending the tree until a leaf is reached and then performing a cache-friendly scan of a bounded number of runs in the leaf.
The key feature of our design is that compressible areas are placed near the tree root and are fast to access (close to one cache miss), while incompressible areas are placed at deeper levels. Incompressible areas contain many short runs that are more costly to access with a linear scan. However, the indexing information in the internal nodes of the path allows skipping many runs, improving the query performance. Deeper nodes trigger more cache misses, but they are still faster than a linear scan. You can think of the VLB-tree as a data structure that relocates space from compressible BWT areas to incompressible ones.
The VLB-tree can also place suffix array samples near their corresponding BWT runs, so you can efficiently get the lexicographically smallest occurrence of the queried pattern.
Additionally, the VLB-tree can be used to encode the function
- C++17 compiler
- CMake
So far, we have tested VLBT on Linux (GCC >=8.2) and OSx (Clang 17). We do not guarantee that VLBT will work on other platforms. We will keep testing on more platforms as we develop the code.
- CLI (already included in the repository)
Enter the repository directory and run:
$ mkdir build
$ cd build
$ cmake ..
$ make
This process will generate a binary called vlbt-cli (among other things) in the build directory, which allows
building and querying indexes. This binary is not a full-fledged sequence aligner. It is meant to be used for
testing the performance of our framework.
We use a reference block size
We limited
This should be a reasonable choice for most applications:
- Text has near-identical sequences: 16,384 or 65,536
- Text highly repetitive but with variation (e.g., metagenomes or pangenomes with some noise): 4,096 or 16,384
These values are rough and based on our experiments. You can explore others. We would like to devise a
mechanism to recommend a suitable
For the moment, we do not provide a mechanism to compute the BWT and/or the
To create the run-length BWT, you can run
./vlbt-cli build test_data/example_text.txt -b 4096 -d 0
The command above will search for the file text_data/example_text.txt.bwt (BWT of example_text.txt) and build
the VLBT index. The input BWT has to be in one-byte-per-symbol encoding (i.e., plain). The -b option specifies the
block size, while the -d option specifies the structure we are building (0 means run-lenth BWT). The option --help
gives more details.
In this case, you also have to have exaple_text.txt.bwt beforehand, but also the files
example_text.txt.ssaexample_text.txt.esa
The first (ssa extension) stores the suffix array samples corresponding to BWT run heads, and the second (esa
extension) stores the suffix array samples corresponding to BWT run tails. Both files must store the samples using
five bytes per symbol and in suffix array order. Notice that if a BWT run has length
The folder test_data contains an example of such files.
The command to build the CSA is
./vlbt-cli build test_data/example_text.txt -b 4096 -d 2 -s 5
Where -d 2 indicates that we are building the CSA and s is the subsampling parameter of the example_text.txt. Our VLB-based CSA for
the moment uses the same block size
To
./vlbt-cli count example_text.rlbwt_vlbt test_data/example_text.pat
Where example_text.rlbwt_vlbt is the run-length BWT and example_text.pat is the pattern file in
Pizza&Chilli format. The same ideas apply to the
To
./vlbt-cli locate example_text.sri_vlbt test_data/example_text.pat
The input index in this case must be a VLBT CSA. The program will automatically detect the template parameters from the file header.
Note: this interface performs the queries, but it only reports statistics (speed, number of occurrences,
etc.). See below how to actually get the
It is also possible to include VLBT as a library in your own project. Copy the include folder and add the following
lines to your source files:
Run-length BWT:
#include "include/vlbt_build_bwt.h"
#include "include/vlbt_bwt.h"
int main() {
//build the index
vlbt_rlbwt<4096> bwt;//block size as a template parameter
build_bwt(bwt, input_bwt_file, PLAIN, "/tmp/folder");//PLAIN means BWT format
//store the index to disk
store_to_file("/path/to/bwt_index", bwt);
//count occurrences
std::string pattern = "atggagag";
size_t count = bwt.count(pattern);
//load from disk
vlbt_rlbwt<4096> bwt2;
load_from_file("/path/to/bwt_index", bwt2);//make sure template parameters match
}The construction algorithm will place temporary files in /tmp/folder. It then will delete them.
CSA:
#include "include/vlbt_sr_index.h"
#include "include/vlbt_build_sr_index.h"
int main() {
size_t s = 5;//subsampling parameter
vlbt_sri_va<4096, 4096> csa;//left is block size for the BWT and right for $\phi^{-1}$
build_sr_index<uint64_t>(csa, bwt_file, PLAIN, s, "/tmp/folder");//SA samples are stored in uint64_t cells
//store to disk
store_to_file("/output/csa/index", csa);
std::string pattern = "atggagag";
//count
size_t count = csa.count(pattern);
//locate
std::vector<size_t> positions = csa.locate(pattern);
//load from disk
vlbt_sri_va<4096, 4096> csa2;
load_from_file("/output/csa/index", csa2);//make sure template parameters match
}We have not tested using VLBT as a library yet, but it should work. If not, please open an issue. The fix should be straightforward.
On x86-64 systems (Intel/AMD), the code relies on SIMD instructions and should be compiled with the -msse4.2 flag.
Failing to enable this flag may result in compilation errors or degraded performance.
On ARM architectures supporting NEON (e.g., Apple Silicon / M1–M3), no additional compiler flags are required for SIMD instructions to be enabled.
It may also be convenient to use the -mbmi2 flag to optimize some bitwise operations.
-
BAC: genome assemblies of 30 bacterial species from the AllTheBacteria collection. Strings belonging to the same species are highly repetitive, while strings from different species are dissimilar.
-
COVID:
$4{,}494{,}508$ SARS-CoV-2 genomes downloaded from the NCBI genome portal. These sequences are short and near-identical, with an average length of$29{,}748$ . -
HUM: genome assemblies of 40 individuals from the Human Pangenome Reference Consortium (HPRC). Individual genomes are near-identical, but assembly differences introduced variability.
-
KERNEL:
$2{,}609{,}417$ versions of the Linux kernel repository. This dataset is highly repetitive and has a large alphabet.
In DNA collections (BAC, COVID, and HUM), we also considered the DNA reverse complement of each string (as is standard in bioinformatics). The numbers presented in the table below already consider these extra sequences.
| Dataset | Size (GB) | Alphabet | Longest BWT run (MB) | |
|---|---|---|---|---|
| BAC | 133.12 | 7 | 116.77 | 0.23 |
| COVID | 267.41 | 17 | 940.49 | 4.11 |
| HUM | 241.24 | 7 | 61.82 | 6.66 |
| KERNEL | 54.45 | 190 | 263.11 | 70.6 |
- mn (release 2.1.1): the run-length BWT of Mäkinen and Navarro as implemented in the SDSL library.
- fbb (commit 9238178): BWT encoding using fixed block boosting.
-
movc and movl (commit bed2fe9): optimized
implementations of the move data structure. The variant movc encodes only the BWT, while movl also
includes
$r$ -suffix array samples. -
ri (commit 7009b53): the original
$r$ -index. -
sri (commit f99b54a): the original
$sr$ -index. We varied the sampling$s$ across values$8,16$ .
Random patterns of length 105 were generated using the script genpatterns of Pizza&Chilli.
The table shows bx in vlbt-bwt instances indicates the block size used for the VLBT tree. For instance, b6 means
that the tree was built using a block size of
| run-length BWT | 30bac | 30bac | 40hum | 40hum | covid | covid | kernel | kernel |
|---|---|---|---|---|---|---|---|---|
| bps | μs/pat | bps | μs/pat | bps | μs/pat | bps | μs/pat | |
| vlbt-bwt-b6 | 0.142 | 45.83 | 0.332 | 52.99 | 0.024 | 33.27 | 0.172 | 44.85 |
| vlbt-bwt-b7 | 0.133 | 53.72 | 0.317 | 75.52 | 0.017 | 34.14 | 0.127 | 39.62 |
| vlbt-bwt-b8 | 0.129 | 72.25 | 0.313 | 91.02 | 0.016 | 41.08 | 0.109 | 49.93 |
| vlbt-bwt-b9 | 0.128 | 83.05 | 0.313 | 101.44 | 0.015 | 49.85 | 0.104 | 60.65 |
| fbb | 0.232 | 102.33 | 0.273 | 121.73 | 0.069 | 71.73 | 0.179 | 69.77 |
| mn | 0.192 | 190.25 | 0.331 | 204.25 | 0.031 | 184.80 | 0.115 | 204.06 |
| movc | 0.962 | 17.99 | NA | NA | 0.125 | 11.96 | 0.438 | 15.53 |
The table shows sri instances indicates the index variant (va=valid area, vm=valid mark) and the subsampling parameter
sri-va-s8 means that the index was built using the valid area variant with subsampling parameter of
| CSA | 30bac | 30bac | hum | hum | covid | covid | kernel | kernel |
|---|---|---|---|---|---|---|---|---|
| bps | μs/occ | bps | μs/occ | bps | μs/occ | bps | μs/occ | |
| movloc | 2.038 | 0.092 | NA | NA | 0.255 | 0.148 | 0.946 | 0.105 |
| ri | 0.824 | 1.406 | 1.552 | 1.545 | 0.109 | 0.741 | 0.381 | 2.544 |
| sri-s8 | 0.363 | 0.765 | 0.627 | 1.086 | 0.096 | 0.600 | 0.232 | 1.543 |
| sri-vm-s8 | 0.365 | 0.771 | 0.632 | 1.089 | 0.097 | 0.612 | 0.233 | 1.543 |
| sri-va-s8 | 0.397 | 0.754 | 0.685 | 1.028 | 0.098 | 0.609 | 0.258 | 1.490 |
| vlbt-sri-va-b6-s8 | 0.392 | 0.337 | 0.762 | 0.344 | 0.117 | 0.415 | 0.341 | 0.342 |
| vlbt-sri-va-b7-s8 | 0.378 | 0.250 | 0.742 | 0.387 | 0.104 | 0.362 | 0.291 | 0.273 |
| vlbt-sri-va-b8-s8 | 0.373 | 0.262 | 0.737 | 0.454 | 0.101 | 0.357 | 0.271 | 0.283 |
| vlbt-sri-va-b9-s8 | 0.372 | 0.292 | 0.736 | 0.510 | 0.100 | 0.473 | 0.266 | 0.335 |
| sri-s16 | 0.300 | 0.691 | 0.492 | 1.307 | 0.091 | 0.647 | 0.182 | 1.675 |
| sri-vm-s16 | 0.301 | 0.703 | 0.495 | 1.093 | 0.092 | 0.627 | 0.183 | 1.674 |
| sri-va-s16 | 0.322 | 0.648 | 0.528 | 1.044 | 0.094 | 0.616 | 0.203 | 1.461 |
| vlbt-sri-va-b6-s16 | 0.325 | 0.317 | 0.607 | 0.345 | 0.112 | 0.411 | 0.285 | 0.340 |
| vlbt-sri-va-b7-s16 | 0.311 | 0.237 | 0.587 | 0.387 | 0.099 | 0.365 | 0.234 | 0.276 |
| vlbt-sri-va-b8-s16 | 0.306 | 0.248 | 0.583 | 0.457 | 0.096 | 0.354 | 0.215 | 0.282 |
| vlbt-sri-va-b9-s16 | 0.305 | 0.279 | 0.582 | 0.513 | 0.095 | 0.467 | 0.210 | 0.342 |
Experimental code
This repository contains experimental research code.
The software is provided “as is”, without any warranty of any kind, express or implied. The authors make no guarantees regarding correctness, performance, or fitness for any particular purpose. Use at your own risk.
If you encounter bugs, unexpected behavior, or have suggestions for improvement, please open an issue in this repository and include:
- a minimal reproducible example (if possible)
- your environment (OS, compiler version, etc.)
- any relevant logs or error messages
VLBT is released as open-source software for research and educational purposes. Commercial use is permitted under the terms of the BSD 3-Clause License.
If you use this software or its ideas in academic work, please cite the associated paper or this repository.
Proper citation helps support and sustain academic research.
Article pending