Summary

Creating extensive datasets for training source separation models is a time-consuming and resource-intensive task, often requiring acoustically isolated recording environments for each source. While there is a wealth of available live recordings, they cannot be directly utilized for training such models due to the presence of significant bleeding effects. Bleeding refers to the undesired pickup of sound from sources other than the intended one, complicating the task of source separation.


Approaches

Learning free optimisation algorithm

We propose a optimization-based technique that iteratively estimates the extent of interference (bleed) between sources and derives clean, interference-free signals from raw time-domain multi-source recordings. These bleed-reduced outputs are used as high-quality training targets for large-scale source separation models. Experiments show that this method significantly outperforms prior spectrogram-based approaches, particularly in terms of Source-to-Distortion Ratio (SDR) and perceptual sound quality.

  • Pros: Achieves high SDR performance; produces high-fidelity targets for training.
  • Cons: Assumes instantaneous mixing, which limits real-world applicability; slow due to iterative nature.
Download

Convolutional Autoencoders

Assuming interference behaves like additive noise, a simple convolutional autoencoder (CAE) is trained separately for each source. The model performs well on both instantaneous and convolutive mixtures, producing clean outputs with competitive SDR values.

  • Pros: Effective for convolutive mixtures; fast training; low computational cost.
  • Cons: Requires a dedicated CAE per source; phase information is not preserved.
Download

t-UNets

This approach models the problem as instantaneous mixing and operates directly in the waveform domain. Neural networks replace optimization routines and learn the interference matrix implicitly. The network captures inter-microphone relationships and leverages them to suppress interference.

  • Pros: Fast training; low computational load; fast inference; minimal artifacts.
  • Cons: Assumes instantaneous mixing; limited performance on real-world live recordings.
Download

GIRNET

GIRNet is designed for convolutive mixtures with additional noise and works in the time domain. It uses a graph attention mechanism to directly estimate interference-reduced signals. Each microphone recording is modeled as a node in a graph, and the graph attention network captures their dependencies to reduce interference.

  • Pros: Handles convolutive mixing; performs well on out-of-domain data with post-processing.
  • Cons: High training time and compute requirements.
Download

Generative based approach

(Under Review)


Learnable front ends

(Under Review)


Results on MUSDB18HQ

We evaluate the proposed approaches on the MUSDB18HQ dataset—a widely-used benchmark for music source separation. To simulate realistic interference, the dataset was augmented with:

  • Instantaneous Mixtures: Basic signal-level bleed simulation.
  • Reverberant Mixing: Convolutive mixtures generated using room impulse responses via Pyroomacoustics.
Performance of Proposed Models

The table below compares various proposed methods against the baseline KAMIR algorithm. The median SDR (Source-to-Distortion Ratio) is reported across vocal, bass, drums, and other stems:

ModelsVocalBassDrumsOthersOverall SDR
Reference1.864.446.785.965.82
KAMIR13.846.756.835.617.00
DI-CAE1.895.816.184.486.92
Optimisation*39.2542.9044.2242.1142.12
t-UNet8.059.058.2556.698.83
f-UNet6.159.417.017.237.45
df-UNet6.509.848.858.328.37
df-UNet-GAT12.5311.9711.7713.0012.31

* Optimization is non-real-time and assumes ideal multichannel access.