Mvsep – AI-driven music and voice separation

Mvsep launched iOS and Android apps with auto-update and review features, introduced new instrument separation models including SATB Choir, added VibeVoice for voice cloning and TTS, and integrated Transkun for piano transcription and Basic Pitch from Spotify.

1 We added an iOS app and updated our Android app. They are both now live. https://apps.apple.com/us/app/mvsep/id6753040381 The latest release adds the following features: - Auto-update checker - You can now send reviews for separations, just like on the website - Bug fixes 2 We have introduced a variety of new models for separating individual instruments: | 1 | MVSep Plucked Strings ../../home?sep type=102 Demo ../../result/20251120092757-f0bb276157-mixture.wav MVSep Percussion ../../home?sep type=105 Demo ../../result/20251128141738-f0bb276157-mixture.wav MVSep Keys ../../home?sep type=106 Demo ../../result/20251128142835-f0bb276157-mixture.wav MVSep Brass ../../home?sep type=107 Demo ../../result/20251128142905-f0bb276157-mixture.wav MVSep Woodwind ../../home?sep type=108 Demo ../../result/20251128143157-f0bb276157-mixture.wav MVSep Xylophone ../../home?sep type=109 Demo ../../result/20251223210226-f0bb276157-mixture.wav MVSep Celesta ../../home?sep type=110 Demo ../../result/20251230133507-f0bb276157-mixture.wav MVSep Choir ../../home?sep type=112 Demo ../../result/20260107221631-f0bb276157-mixture.wav MVSep Bagpipes ../../home?sep type=116 Demo ../../result/20260221123255-f0bb276157-mixture.wav MVSep Braam ../../home?sep type=117 Demo ../../result/20260221124005-f0bb276157-mixture.wav MVSep FX ../../home?sep type=122 Demo ../../result/20260318224517-f0bb276157-mixture.wav The current separation scheme can be found below: 3 A new model, MVSep SATB Choir soprano, alto, tenor, bass , has been added. Description: https://mvsep.com/algorithms/104 ../../algorithms/104 Demo 1 vocals ../../result/20260108154639-f0bb276157-mixture.wav Demo 2 vocals ../../result/20260108155023-f0bb276157-mixture.wav Demo strings ../../result/20260108154828-f0bb276157-mixture.wav A huge thanks to @Dry Paint Dealer Undr for helping me create this model. P.S. The model works not only with vocals but also with strings and some other instruments. 4 We added the powerful VibeVoice model to the Experimental section. It is available in 2 variants: Voice Cloning and Text-to-Speech. Key Features: - Two models: small 1.5B parameters and large 7B parameters - Up to 4 speakers in a single recording - Up to 90 minutes of generated audio - Language support: Officially supports English default and Chinese, but it has been verified to work decently for other languages as well. - Voice cloning: The ability to upload a reference audio recording VibeVoice Voice Cloning : Info ../../algorithms/89 | Demo 1 ../../result/20251127132310-e047a5397d-mhmw0h.mp3 | Demo 2 ../../en/demo?algorithm id=103 VibeVoice TTS : Info ../../algorithms/91 | Demo 1 ../../result/20251126143934-904cf527ec-jeyeyotuga.wav We also noted that if a sample contains some music along with words, it can make the generated voice sing. 5 We added a new Crowd removal model based on the BSRoformer architecture. It's available in " MVSep Crowd removal crowd, other " under the name "BS Roformer SDR crowd: 7.21 ". The SDR has increased from 6.27 to 7.21. 6 Three new vocal models have been added. In BS Roformer vocals, instrumental : - unwa BS Roformer HyperACE v2 instrum SDR instrum: 17.40 - unwa BS Roformer HyperACE v2 vocals SDR vocals: 11.39 In MelBand Roformer vocals, instrumental : - becruily deux SDR vocals: 11.35, SDR instrum: 17.66 7 We added the new Transkun model. Transkun is a modern, open-source model for automatic piano music transcription Audio-to-MIDI . The official page for the model is here https://github.com/Yujia-Yan/Transkun . It is considered one of the best SOTA — State of the Art in its class. The model can recognize not only the notes themselves but also their duration, loudness velocity , and pedal usage. 8 We added the new Basic Pitch model. Basic Pitch is a modern neural network from Spotify’s Audio Intelligence Lab that converts melodic audio recordings into notes MIDI format . Unlike outdated converters, this model can "hear" not only individual notes but also chords, along with the finest nuances of a performance. Basic Pitch is an "instrument-agnostic" model. This means it handles different timbres equally well: - Vocals - Strings: Acoustic and electric guitars, violins, and cellos. - Keyboards: Pianos, organs, and synthesizers. - Winds: Flutes, saxophones, trumpets, and others. Important: The model is designed for melodic instruments. It is not suitable for drums or percussion, as it focuses on pitch rather than rhythmic noise. Demo ../../result/20260130215808-15168fece5-c82db6e6-c96e-435d-8721-9dacaa69d256.wav | Description ../../algorithms/109?lang=en | Model link ../../home?sep type=114 9 We added the Bark Speech Gen algorithm to the Experimental section. Bark is a transformer-based model created by Suno, representing not just a traditional text-to-speech tool, but a fully generative "text-to-audio" system. Its capabilities go far beyond ordinary voicing: besides creating highly realistic speech in multiple languages, Bark can generate music, background noises, and simple sound effects. A unique feature of the model is its ability to reproduce subtle non-verbal communication, such as laughter, sighs, and crying, making the resulting sound maximally alive and natural. In our experiments, it sometimes doesn't follow the text or instructions. See the demo as an example. 10 We added Qwen3-TTS , a powerful speech generation model offering support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. At MVSep, we use the largest 1.7 billion parameter model. The model is available in 3 variants: - Qwen3-TTS Custom Voice - A model with predefined speakers | Demo ../../result/20260227230307-b3932b7068-hiyipedave.wav - Qwen3-TTS Voice Design - A model capable of creating a voice based on a description | Demo ../../result/20260227230729-b31350612e-mobicenusa.wav - Qwen3-TTS Voice Cloning - A model capable of cloning a voice based on a reference audio file | Demo ../../result/20260227231021-c5fcc365ef-hello-everyone-try-this-new-qwen3-text-to-speech-model-on-mvsepc.flac 11 We added the new HeartMuLa algorithm to the site. It is an advanced open-source family of multimodal foundation models Apache 2.0 license designed for high-quality music synthesis and audio processing. Unlike proprietary cloud services such as Suno or Udio , HeartMuLa gives developers the ability to run it locally on their own hardware. The quality of the generated songs is quite good. Official repository https://github.com/HeartMuLa/heartlib | Demos 1 ../../result/20260321140712-9dd81cde49-debuhopije.wav | Demos 2 ../../ru/demo?algorithm id=121 | Documentation ../../algorithms/123?lang=en Current limitations: 1 The model struggles to follow tags. 2 The model is computationally heavy and uses a lot of VRAM.