Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post your model #1

Open
ZFTurbo opened this issue Nov 6, 2023 · 31 comments
Open

Post your model #1

ZFTurbo opened this issue Nov 6, 2023 · 31 comments
Labels
enhancement New feature or request

Comments

@ZFTurbo
Copy link
Owner

ZFTurbo commented Nov 6, 2023

To post your model, please, fill the form:

Description: 
Instruments:
Dataset (if known):
Metrics (if known):
Config link: 
Checkpoint link: 
@Beatloo-Labs
Copy link

Beatloo-Labs commented Apr 6, 2024

Model Type: bs_roformer
Description: My first five days in model training, training on three servers with GPU T4 x 2
Instruments: vocals, drums, bass
Dataset: musdb18hq

How to run [example for vocals]: Download config and checkpoint, save to folder with @ZFTurbo training code, run this command:

python inference.py --model_type bs_roformer --config_path config_musdb18_bs_roformer_vocals.yaml --start_check_point model_vocals_bs_roformer_ep_5_sdr_8.0972.ckpt --input_folder input/ --store_dir separation_results/

Input files in input folder, results in separation_results folder

DEMO:
Enjoy this link: https://disk.yandex.ru/d/zc5Bca9nTuB7jg < old

Vocals:
SDR: 8.09 < 7.55
Config link: https://disk.yandex.com/d/eTOZ9BGpTIRNYw
Checkpoint link: https://disk.yandex.com/d/wPdwPZTQMJfAZQ

Drums:
SDR: 7.22 < 7.15
Config link: https://disk.yandex.com/d/ab8glguWFltifA
Checkpoint link: https://disk.yandex.com/d/auEl3aovvWhYMw

Bass:
SDR: 5.78 < 5.28
Config link: https://disk.yandex.com/d/mgqAPCahZQwEgQ
Checkpoint link: https://disk.yandex.com/d/BISwCadSNyYb-g

Last update: 21.07.24

@yolkispalkis
Copy link

yolkispalkis commented Jun 8, 2024

Model Type: mel_band_roformer
Description: My first attempt at training, trained for 5 days on 4070
Instruments: percussions
Dataset: musdb18hq, moisesdb
SDR value based on musdb18hq's test set

SDR: 6.86
SDR: 7.10
SDR: 7.44
SDR: 7.68
Checkpoints: https://disk.yandex.ru/d/MxZ4k-kZ2Q5QqA

Updatet on: 14.06.2024 18:20:00 UTC+3

@alexclarke236
Copy link

Model Type: mel_roformer
Description: my first attempt at training
Instruments: timpani
Dataset: mvsep

@alexclarke236
Copy link

Model Type: mel_band_roformer
Description: My first ai training
Instruments: percussion
Dataset: mvsep

@jarredou
Copy link
Contributor

@alexclarke236 Would you like to share the checkpoints you've trained ? Best way is to host them on a file-sharing site and post the link here like previous users have done.

@alexclarke236
Copy link

alexclarke236 commented Jun 12, 2024 via email

@verosment
Copy link

verosment commented Jun 13, 2024

Architecture: MDX23c
Description: My first somewhat successful attempt at training. Hardware used was my personal RTX 3060 12gb, 64gb DDR4 RAM, Ryzen 5 5600X, Windows 11.
Training stopped due to the inconvenience of training on my personal machine and the slow speeds at which the training was progressing. Had I had the funds, I would've rented a GPU from vast.ai. Trained for a total of 208 epochs or roughly ~2,500 minutes.
Instruments: Strings (Cello, Double Bass, Violin, Viola), Brass (Trumpet, English Horn, Tuba, Trombone), Wind (Piccolo, Flute, Clarinet, Saxophone), Mellotron Flute & Cello. Other instruments that have a similar quality or sound may be present in the dataset but unaccounted for.
Dataset (if known): Custom 97 pair dataset using tracks from isolated-tracks.com, songstems.net, MoisesDB, ARME-Virtuoso-Strings-2.2, traditional-flute-dataset, a bunch of Toby Fox FLP fan recreations and a Dolby atmos rip of the center track of Eleanor Rigby that was layered over a song from MoisesDB.
Metrics (if known): SDR 4.4174 on my very small validation set. Performance of the model depends heavily on the input.
Config link: https://drive.google.com/file/d/1OTuF3534Ax5SJSsk08e2QLgoxiljqelH/view?usp=sharing
Checkpoint link: https://drive.google.com/file/d/1juOW6Q_Puqp_uxMsQpWSWkAm1QSbXIdg/view?usp=sharing

@verosment
Copy link

verosment commented Jun 13, 2024

Same model as above, but trained for a further 54 epochs. Sounds better to the ears than the older model in quite a few cases, but scores a lower SDR on the validation set. Picks up wind instruments better than older model in my testing and has the chance to pick up string sections better too.
Instruments: Same as above
Dataset (if known): Same as above
Metrics (if known): SDR 4.0870 on my very small validation set. Performance of the model depends heavily on the input.
Config link: https://drive.google.com/file/d/1OTuF3534Ax5SJSsk08e2QLgoxiljqelH/view?usp=sharing
Checkpoint link: https://drive.google.com/file/d/1gB6RPUw_knozcY3qF--cpTczoxpkDw5O/view?usp=sharing

Edit: 3/08/2024, currently retraining this model with a larger dataset but using same machine, so will take a while. Results will be posted here if it gets anywhere decent

@jarredou
Copy link
Contributor

jarredou commented Jul 3, 2024

Description:
MDX23C Drums elements separation model (to apply on drums-only audio)
n_fft = 2048 instead of default 8192 was used for more lightweigted required resources.
Baseline training (141 epochs) was done by @aufr33, not fully finished, it can be improved.

Instruments:
kick, snare, toms, hh, ride, crash

Dataset:
created by myself for that task, but had some issues.

Metrics:
Instr SDR kick: 18.4312
Instr SDR snare: 13.6083
Instr SDR toms: 13.2693
Instr SDR hh: 6.6887
Instr SDR ride: 5.3227
Instr SDR crash: 7.5152
SDR Avg: 10.8059:

Config & checkpoint : https://github.com/jarredou/models/releases/tag/aufr33-jarredou_MDX23C_DrumSep_model_v0.1

@ZFTurbo ZFTurbo added the enhancement New feature or request label Jul 4, 2024
@anvuew
Copy link
Contributor

anvuew commented Jul 13, 2024

Description:

mel_band_roformer dereverb model
chunk_size: 352768
dim: 256
depth: 6

Instruments:

noreverb,reverb

Dataset:

~90h vocals, 76 types of reverb

Metrics:

SDR noreverb: 7.5669(small valid)

Config link: config
Checkpoint link: ckpt

@ZFTurbo
Copy link
Owner Author

ZFTurbo commented Jul 14, 2024

@anvuew thank you for great model. I have a question about your training. Did you apply reverb for full tracks or for vocal part only? Can you share your validation? I'd like to compare your model with older ones.

@anvuew
Copy link
Contributor

anvuew commented Jul 14, 2024

@anvuew thank you for great model. I have a question about your training. Did you apply reverb for full tracks or for vocal part only? Can you share your validation? I'd like to compare your model with older ones.

noreverb is vocal only. valid is too inadequate to share. the MDX Reverb-HQ SDR is 6.5 for my valid.

@ZFTurbo ZFTurbo pinned this issue Jul 15, 2024
@deton24
Copy link

deton24 commented Jul 15, 2024

For those who have issues running Roformers from this thread in UVR, you must delete the following line from the YAML file:
linear_transformer_depth: 0

@anvuew
Copy link
Contributor

anvuew commented Jul 15, 2024

Description:

bs_roformer dereverb model
chunk_size: 352768
dim: 256
depth: 8

Metrics:

SDR noreverb: 8.0770(small valid)

Config link: config
Checkpoint link: ckpt

Although this is a dereverb model, it will also remove harmonies or vocal effects that are not in the center channel.

If you want to add this model to UVR5, first place the config file and weights in the corresponding directory (weights in Ultimate Vocal Remover\models\MDX_Net_Models, config file in Ultimate Vocal Remover\models\MDX_Net_Models\model_data\mdx_c_configs). Delete linear_transformer_depth: 0 from the config file and change stft_hop_length: 512 to stft_hop_length: 441. then open UVR5, you will be prompted to add the model (select the MDX architecture if not). Choose the corresponding config file and check the Roformer model checkbox.

@musicalman
Copy link

Congratulations on these fine dereverb models!
I just tried using the bs roformer dereverb model inside UVR, and got this error:
RuntimeError: "The size of tensor a (352768) must match the size of tensor b (352800) at non-singleton dimension 1
Weirdly, the mel band roformer one works fine. But since bs has a slightly better sdr, I wanted to see if it was worth switching to that.

@OMARK313
Copy link

i need model separation include sdr vocals after separation higher than sdr vocals 12.97 pleasse

@lgkt
Copy link

lgkt commented Jul 28, 2024

BS Roformer (viperx) is 12.9755?it is old.The latest one is BS Roformer (finetuned) shown in the image,but where to download? who can tell me?
image

@deton24
Copy link

deton24 commented Jul 30, 2024 via email

@SUC-DriverOld
Copy link
Contributor

SUC-DriverOld commented Sep 13, 2024

Aspiration_Mel_Band_Roformer

model repo: https://huggingface.co/Sucial/Aspiration_Mel_Band_Roformer
You can try listening to the performance of this model here.

Description: The model is used to separate aspiration, which will be useful for mixing to some mixrs.
Instruments: aspiration, other
Dataset: My own datasets(171 songs for training and 17 songs for validation).
Metrics: Based on the SDR of 17 songs for validation.
Finetuned from: model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt

Configs: config_aspiration_mel_band_roformer.yaml

Model: aspiration_mel_band_roformer_sdr_18.9845.ckpt
Epoch: 123
Instr SDR aspiration: 9.8554
Instr SDR other: 28.1136
SDR Avg: 18.9845

Model: aspiration_mel_band_roformer_less_aggr_sdr_18.1201.ckpt
Epoch: 27
Instr SDR aspiration: 9.0704
Instr SDR other: 27.1699
SDR Avg: 18.1201

@deton24
Copy link

deton24 commented Sep 18, 2024

To use it in UVR5 GUI, use this config (thx Essid)

linear_transformer_depth: 0 removed. UVR doesn't recognize that parameter.
target_instrument: null → aspiration

https://drive.google.com/file/d/1IrzeCliVNS8zuSJ51IbACMN9FNuza-hE/view?usp=sharing

@wesleyr36
Copy link

wesleyr36 commented Oct 16, 2024

Description:
MDX23C Similarity/Phantom Centre extraction model based on the lightweight MDX23C config used by jarredou in his drumsep model.

Instruments:
Similarity
Difference

Dataset:
Custom dataset containing 1057 randomly mixed pairs for training and 107 pairs for validation. The difference stems consisted of one channel 2 randomly picked songs and the similarity stem consisted of a dual-mono version of a third randomly chosen song.

Metrics:
Model_1 (ep_237):
Similarity: 71.9982 L1_Freq

Model_2 (ep_271):
Similarity: 72.2383 L1_Freq

Downloads:
https://drive.google.com/drive/folders/1KHEnvsrvvIDlO-pBT-Hw8UKKebpJCyTe?usp=sharing

@ZFTurbo
Copy link
Owner Author

ZFTurbo commented Oct 16, 2024

@wesleyr36 can you explain what this model do? What are the use cases?

@wesleyr36
Copy link

It extracts the phantom centre from stereo audio i.e. the content that is the same between the two channels and is percieved to be in the middle. It's main intended usecase was inside of a similarity extractor recreation which can be done in the following way:

  1. Take the L channel from Audio_1 and the L channel from Audio_2 and merge them into a stereo file.
  2. Run that through the model
  3. Repeat for R channels
  4. Merge the L and R channels back together and you have the similarity assuming the audio files were perfectly aligned.

@Super-YH
Copy link

Super-YH commented Nov 5, 2024

Description: SYH99999/MelBandRoformerSYHFTV2
Mel-Band Roformer SYH Fine-tuned model Beta V2
Instruments: Vocals
Config link: Kim's same
Checkpoint link: https://huggingface.co/SYH99999/MelBandRoformerSYHFTV2

Please Check the quality on your ears.

@SUC-DriverOld
Copy link
Contributor

SUC-DriverOld commented Nov 22, 2024

Dereverb-Echo_Mel_Band_Roformer

Huggingface link: https://huggingface.co/Sucial/Dereverb-Echo_Mel_Band_Roformer
Description: This model is used to separate reverb and delay effects in vocals. In addition, it can also separate partial harmony, but it cannot completely separate them. You can try listening to the performance of this model here!
Config: config_dereverb-echo_mel_band_roformer.yaml
Model: dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt
Instruments: [dry, other]
Finetuned from: model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt
Datasets:

  • Training datasets: 270 songs from opencpop and GTSinger
  • Validation datasets: 30 songs from my own collection
  • All random reverbs and delay effects are generated by this python script and sorted into the mustb18 dataset format.

Metrics: Based on the sdr value of 30 songs for validation.

Instr dry sdr: 13.1507 (Std: 4.1088)
Instr dry l1_freq: 53.7715 (Std: 13.3363)
Instr dry si_sdr: 12.7707 (Std: 4.6134)
Instr other sdr: 6.8830 (Std: 2.5547)
Instr other l1_freq: 52.7358 (Std: 11.8587)
Instr other si_sdr: 5.9448 (Std: 2.8721)
Metric avg sdr        : 10.0169
Metric avg l1_freq    : 53.2536
Metric avg si_sdr     : 9.3577

Training logs: train.log, Tensorboard image: tensorboard.png

Thanks

@SUC-DriverOld
Copy link
Contributor

SUC-DriverOld commented Dec 7, 2024

Huggingface model link: https://huggingface.co/Sucial/Chorus_Male_Female_BS_Roformer

Recently, I attempted to train a model for separating male and female voices in choir singing, and the results were quite good, far exceeding my expectations. However, due to the lack of a certain degree of universality in the training and validation data (all the training and validation data used were Chinese songs), I personally classify this model as an experimental model.

The model can separate the male and female voices in a chorus. However, if male and female are singing at intervals (one by one), they cannot be separated. The model separation effect can be heard here!

I used a total of 750 songs for training, of which 700 were used as the training set and 50 as the validation set. All the songs are from opencpop and m4singer datasets. Fine tuning training from model_bs_roformer_ep_317_sdr_12.9755.ckpt

Of these, model_chorus_bs_roformer_ep_267_sdr_24.1275.ckpt has the following validation values

Train epoch: 267
Instr male sdr: 24.4762 (Std: 1.5505)
Instr female sdr: 23.7788 (Std: 1.5168)
Metric avg sdr        : 24.1275

Thanks to CN17161 for the GPU math support!

@deton24
Copy link

deton24 commented Dec 8, 2024

  • If you want to use the model in UVR, use this config (thx Essid)
  • If you have "The size of tensor a (352768) must match the size of tensor b (352800) at non-singleton dimension 1"
    e.g. in python-audio-separator, use this config (thx Eddycrack864)

@SUC-DriverOld
Copy link
Contributor

SUC-DriverOld commented Dec 11, 2024

Dereverb-Echo_Mel_Band_Roformer

Huggingface link: https://huggingface.co/Sucial/Dereverb-Echo_Mel_Band_Roformer Description: This model is used to separate reverb and delay effects in vocals. In addition, it can also separate partial harmony, but it cannot completely separate them. You can try listening to the performance of this model here! Config: config_dereverb-echo_mel_band_roformer.yaml Model: dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt Instruments: [dry, other] Finetuned from: model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt Datasets:

  • Training datasets: 270 songs from opencpop and GTSinger
  • Validation datasets: 30 songs from my own collection
  • All random reverbs and delay effects are generated by this python script and sorted into the mustb18 dataset format.

I used more data (1000+ songs) and the same reverb&delay creation script and fine-tuned this model. Now, the value of sdr_dry has reached 13.4843 (Std: 4.8675). And I set target_instrument=dry, which greatly reduced the size of the model.
The model is in the same huggingface repository.
Model: Download Config: Download

@happyTonakai
Copy link

happyTonakai commented Dec 18, 2024

Dereverb-Echo_Mel_Band_Roformer

Huggingface link: https://huggingface.co/Sucial/Dereverb-Echo_Mel_Band_Roformer Description: This model is used to separate reverb and delay effects in vocals. In addition, it can also separate partial harmony, but it cannot completely separate them. You can try listening to the performance of this model here! Config: config_dereverb-echo_mel_band_roformer.yaml Model: dereverb-echo_mel_band_roformer_sdr_10.0169.ckpt Instruments: [dry, other] Finetuned from: model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt Datasets:

  • Training datasets: 270 songs from opencpop and GTSinger
  • Validation datasets: 30 songs from my own collection
  • All random reverbs and delay effects are generated by this python script and sorted into the mustb18 dataset format.

I used more data (1000+ songs) and the same reverb&delay creation script and fine-tuned this model. Now, the value of sdr_dry has reached 13.4843 (Std: 4.8675). And I set target_instrument=dry, which greatly reduced the size of the model. The model is in the same huggingface repository. Model: Download Config: Download

Thank you for sharing the great dereverb model. The model works fine on most of the audio but I found some bad cases when the reverb is strong. The original model considers most of the sound waves as 'other', and 'dry' is silent in many time. The fine-tuned v2 model improves the performance but the problem still exists. You can find the files before and after the model here. Please have a look at them. I think increasing the reverberation intensity during the training phase may help. You may need to add a penalty term to the loss function when the target stem (dry) is small enough (silent) because we can assume that the reverb audio is totally generated from the dry audio by convolution with the impulse response.

@deton24
Copy link

deton24 commented Dec 19, 2024

Is it any better with anvuew's mel v2 model?
Model
Config
Colab

@SUC-DriverOld
Copy link
Contributor

SUC-DriverOld commented Dec 27, 2024

@happyTonakai @deton24
I sincerely apologize for the delay in responding to you. Due to academic commitments, I was unable to reply sooner.

First of all, I must acknowledge that my v2 model is still somewhat behind the anvuew's v2 model in certain aspects. Over the past few days, I have continued to fine-tune my model, focusing on handling large reverb, an area where the anvuew's v2 model has limitations. I also took note of the issue raised by happyTonakai: "some bad cases when the reverb is strong. The original model considers most of the sound waves as 'other,' and 'dry' is silent in many cases." In response to this, I made some adjustments to the validation code and trained two new models specifically targeting large reverb removal. After training, I combined these two models with my v2 model through a blending process, to better handle all scenarios.

At this stage, I am still unsure whether my new models outperform the anvuew's v2 model overall, but I can confidently say that they are more effective in removing large reverb. You can listen to sample outputs here. For more detailed information about these two models and the blended version, please refer to the README.md in the HuggingFace model repository.

Here are the three new models:
de_big_reverb_mbr_ep_362.ckpt
de_super_big_reverb_mbr_ep_346.ckpt
dereverb_echo_mbr_fused_0.5_v2_0.25_big_0.25_super.ckpt
The configuration file of the model is the same as v2 model.

Update:
I would also like to thank ZFTurbo for releasing the code to train the LoRA model, which makes fine-tuning the model easier. Unfortunately, when I was training these models, the LoRA code had not yet been released, so I still used the original method for fine-tuning the training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests