A neural network for end-to-end music source separation, as described in End-to-end music source separation: is it possible in the waveform domain?
Listen to separated samples here
The Wavenet for Music Source Separation is a fully convolutional neural network that directly operates on the raw audio waveform.
It is an adaptation of Wavenet that turns the original causal model (that is generative and slow), into a non-causal model (that is discriminative and parallelizable). This idea was originally proposed by Rethage et al. for speech denoising and now it is adapted for monaural music source separation. Their code is reused.
The main difference between the original Wavenet and the non-causal adaptation used, is that some samples from the future can be used to predict the present one. As a result of removing the autoregressive causal nature of the original Wavenet, this fully convolutional model is now able to predict a target field instead of one sample at a time – due to this parallelization, it is possible to run the model in real-time on a GPU.
See the diagram below for a summary of the network architecture.
git clone https://github.com/francesclluis/source-separation-wavenet.git
- Install conda
conda env create -f environment.yml
source activate sswavenet
Currently the project requires Keras 2.1 and Theano 1.0.1, the large dilations present in the architecture are not supported by the current version of Tensorflow
A pre-trained multi-instrument model (best-performing model described in the paper) can be found in sessions/multi-instrument/checkpoints
and is ready to be used out-of-the-box. The parameterization of this model is specified in sessions/multi-instrument/config.json
A pre-trained singing-voice model (best-performing model described in the paper) can be found in sessions/singing-voice/checkpoints
and is ready to be used out-of-the-box. The parameterization of this model is specified in sessions/singing-voice/config.json
Download the dataset as described below
Example (multi-instrument): THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/multi-instrument/config.json --mixture_input_path audio/
Example (singing-voice): THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/singing-voice/config.json --mixture_input_path audio/
To achieve faster source separation, one can increase the target-field length by use of the optional --target_field_length
argument. This defines the amount of samples that are separated in a single forward propagation, saving redundant calculations. In the following example, it is increased 10x that of when the model was trained, the batch_size is reduced to 4.
Faster Example: THEANO_FLAGS=device=cuda python main.py --mode inference --target_field_length 16001 --batch_size 4 --config sessions/multi-instrument/config.json --mixture_input_path audio/
Example (multi-instrument): THEANO_FLAGS=device=cuda python main.py --mode training --target multi-instrument --config config_multi_instrument.json
Example (singing-voice): THEANO_FLAGS=device=cuda python main.py --mode training --target singing-voice --config config_singing_voice.json
A detailed description of all configurable parameters can be found in config.md
Argument | Valid Inputs | Default | Description |
---|---|---|---|
mode | [training, inference] | training | |
target | [multi-instrument, singing-voice] | multi-instrument | Target of the model to train |
config | string | config.json | Path to JSON-formatted config file |
print_model_summary | bool | False | Prints verbose summary of the model |
load_checkpoint | string | None | Path to hdf5 file containing a snapshot of model weights |
Argument | Valid Inputs | Default | Description |
---|---|---|---|
one_shot | bool | False | Separates each audio file in a single forward propagation |
target_field_length | int | as defined in config.json | Overrides parameter in config.json for separating with different target-field lengths than used in training |
batch_size | int | as defined in config.json | # of samples per batch |
The MUSDB18 is used for training the model. It is provided by the Community-Based Signal Separation Evaluation Campaign (SISEC).
- Download here
- Decode dataset to WAV format as explained here
- Extract to
data/MUSDB