This repository builds on top of the original RAVE framework. It reduces the pipeline to be non-variational and have among other things added:
- RAVE-Centric Speaker Encoder pretrained on Voxceleb
- HuBERT based knowledge distillation for linguistic teacher guidance
- Pitch contour conditioning
- Information Perturbation
- Random Background Noise Addition
To get started:
Run preprocessing --lazy as normal (see below). Create speaker statistics using:
python rave/pitch_utils.py --root_folder <data_dir> --pitch_estimator <pitch_estimator to use (fcpe or yin)>Set speaker_stats.json path in configs/v2/RAVE
Download the DEMAND noise dataset https://zenodo.org/api/records/1227121/files-archive and run:
python scripts/decode.py -i <input_directory> -o <output_directory> -sr <sample_rate>Set noise_dir in configs/v2/RAVE
Set speaker_encoder_dir in configs/v2/RAVE
Disable additive_noise bool in train_rave.py dataset if desired
TRAIN!
Official implementation of RAVE: A variational autoencoder for fast and high-quality neural audio synthesis (article link) by Antoine Caillon and Philippe Esling.
If you use RAVE as a part of a music performance or installation, be sure to cite either this repository or the article !
If you want to share / discuss / ask things about RAVE you can do so in our discord server !
The original implementation of the RAVE model can be restored using
git checkout v1Install RAVE using
pip install acids-raveYou will need ffmpeg on your computer. You can install it locally inside your virtual environment using
conda install ffmpegA colab to train RAVEv2 is now available thanks to hexorcismos !
Training a RAVE model usually involves 3 separate steps, namely dataset preparation, training and export.
You can know prepare a dataset using two methods: regular and lazy. Lazy preprocessing allows RAVE to be trained directly on the raw files (i.e. mp3, ogg), without converting them first. Warning: lazy dataset loading will increase your CPU load by a large margin during training, especially on Windows. This can however be useful when training on large audio corpus which would not fit on a hard drive when uncompressed. In any case, prepare your dataset using
rave preprocess --input_path /audio/folder --output_path /dataset/path (--lazy)RAVEv2 has many different configurations. The improved version of the v1 is called v2, and can therefore be trained with
rave train --config v2 --db_path /dataset/path --name give_a_nameWe also provide a discrete configuration, similar to SoundStream or EnCodec
rave train --config discrete ...By default, RAVE is built with non-causal convolutions. If you want to make the model causal (hence lowering the overall latency of the model), you can use the causal mode
rave train --config discrete --config causal ...Many other configuration files are available in rave/configs and can be combined. Here is a list of all the available configurations
| Type | Name | Description |
|---|---|---|
| Architecture | v1 | Original continuous model |
| v2 | Improved continuous model (faster, higher quality) | |
| v3 | v2 with Snake activation, descript discriminator and Adaptive Instance Normalization for real style transfer | |
| discrete | Discrete model (similar to SoundStream or EnCodec) | |
| onnx | Noiseless v1 configuration for onnx usage | |
| raspberry | Lightweight configuration compatible with realtime RaspberryPi 4 inference | |
| Regularization (v2 only) | default | Variational Auto Encoder objective (ELBO) |
| wasserstein | Wasserstein Auto Encoder objective (MMD) | |
| spherical | Spherical Auto Encoder objective | |
| Discriminator | spectral_discriminator | Use the MultiScale discriminator from EnCodec. |
| Others | causal | Use causal convolutions |
| noise | Enable noise synthesizer V2 |
Once trained, export your model to a torchscript file using
rave export --run /path/to/your/run (--streaming)Setting the --streaming flag will enable cached convolutions, making the model compatible with realtime processing. If you forget to use the streaming mode and try to load the model in Max, you will hear clicking artifacts.
This section presents how RAVE can be loaded inside nn~ in order to be used live with Max/MSP or PureData.
A pretrained RAVE model named darbouka.gin available on your computer can be loaded inside nn~ using the following syntax, where the default method is set to forward (i.e. encode then decode)
This does the same thing as the following patch, but slightly faster.
Having an explicit access to the latent representation yielded by RAVE allows us to interact with the representation using Max/MSP or PureData signal processing tools:
By default, RAVE can be used as a style transfer tool, based on the large compression ratio of the model. We recently added a technique inspired from StyleGAN to include Adaptive Instance Normalization to the reconstruction process, effectively allowing to define source and target styles directly inside Max/MSP or PureData, using the attribute system of nn~.
Other attributes, such as enable or gpu can enable/disable computation, or use the gpu to speed up things (still experimental).
Several pretrained streaming models are available here. We'll keep the list updated with new models.
If you have questions, want to share your experience with RAVE or share musical pieces done with the model, you can use the Discussion tab !
Demonstration of what you can do with RAVE and the nn~ external for maxmsp !
Using nn~ for puredata, RAVE can be used in realtime on embedded platforms !
This work is led at IRCAM, and has been funded by the following projects
- ANR MakiMono
- ACTOR
- DAFNE+ N° 101061548







