Datasets for automix systems

Datasets for automix systems#

In this notebook, we will first discuss the datasets used to train the automix systems. Thereafter, we will see how to pre-process the data and set up the dataloaders for training the deep learning models for these systems.

Training automix models requires paired multitrack stems and their corresponding mixdowns. Below listed are the desired properties for these datasets:

Time alligned stems and mixes : We require time-alligned stems and mixes to allow the models to learn timewise transformation relationships.
Diverse instrument categories : The more diverse the number of instruments in the dataset, the more likely is the trained system to perform well with real-world songs.
Diverse genres of songs : The mixing practices vary slightly from one genre to another. Hence, if the dataset has multitrack mixes from different genres, the trained system will be exposed to more diverse distribution of data.
Dry multitrack stems : Mixing involves processing the recorded dry stems for corrective and aesthetic reasons before summing them to form a cohesive mixture. For a model to learn the correct way to process the stems to generate mixes, we need to train it on dry unprocessed stems and mix pairs. However, more recently approaches to use processed stems from datasets like MUSEDB to train automix systems have been explored. These approaches use a pre-processing effect normalisation method to deal with pre-processed wet stems. For the scope of this tutorial, we do not discuss these methods. However, we recommend having a look at this paper being presented at ISMIR 2022.

Here we list the datasets available for training automix systems.

Dataset	Size(Hrs)	no. of Songs	no. of Instrument Category	no. of tracks	Type	Usage Permissions	Other info	Remarks
MedleyDB	7.2	122	82	1-26	Multitrack, Wav	Open	44.1KHz, 16 bit, stereo	-
ENST Drums	1.25	-	1	8	Drums, Wav/AVI	Limited	44.1KHz, 16 bit, stereo	Drums only dataset
Cambridge Multitrack	>3	>50	>5	5-70	Multitrack, Wav	open	44.1KHz, 16/24 bit, Stereo	Not time alligned, recordings for all the songs are not uniform
MUSEDB	~10	150	4	4	Multitrack, Wav	open	44.1KHz, Stereo	used mainly for source separation, wet stems
Slakh	145	2100	34	4-48	Synthesised, Flac,	open	44.1KHz, 16 bit, stereo	used mainly for source separation; sometimes wet stems
Shaking Through	4.5	68	>30	>40	Multitrack, Wav	User only	44.1/88.2KHz, 16/24 bit, stereo	-
BitMIDI	-	>1M	>5	>5	Multitrack MIDI	open	MIDI data	MIDI data submitted by users across world

For this tutorial, we will use ENST-drums for training Wave-U-Net and ENST-drums, DSD100, and MedleyDB for training Differentiable Mixing Console(DMC).

In the following section, we will discuss the recommended pre-processing methods for these datasets and the methods to set up dataloaders for training the models. This notebook assumes that you have already installed the automix package.

We define dataset classes for DSD100, MedleyDB, and ENSTdrums, and then use getitem() function to load the audio data into the dataloader for training and testing.

Listed below are few of the advised variables that you should define in the dataset class definition:#

Root directory of the folder containing the dataset.
Length of the audio you wish to load for training/testing.
Sample rate at which you wish to load the audio data.

Pre-processing advice for loading multitrack data:#

Discard the examples from the dataset that have length shorter than the prescribed length.

 ```
 #code from automix/data/drums.py
 #remove any mixes that are shorter than the requested length
 self.mix_filepaths = [
     fp
     for fp in self.mix_filepaths
     
     # use torchaudio.info to get information about the audio. This is much faster than loading the whole audio.
     if torchaudio.info(fp).num_frames > self.length
 ]
 ```

Loudness normalise the stems and the mixes after loading.

 ```
 #code from automix/data/drums.py
 y /= y.abs().max().clamp(1e-8) 
 ```

Look out for silence in the loaded audio: Common practice is to generate a random starting index for the frame from which the audio is loaded. However, it is likely that some of the multitrack stem or the mix as a whole could have just silence in this chunk of loaded audio. This results in generation of NaN in the audio tensor when it is normalised. In the below shown code block, we show how to check for silence. We keep generating a new starting index(offset)) for loading the audio until the audio has some content and is not just silence(silent is False).

         ```
         #code from automix/data/drums.py
         # load the chunk of the mix
                 silent = True
                 while silent:
                 # get random offset
                 offset = np.random.randint(0, md.num_frames - self.length - 1)

                 y, sr = torchaudio.load(
                         mix_filepath,
                         frame_offset=offset,
                         num_frames=self.length,
                 )
                 energy = (y**2).mean()
                 if energy > 1e-8:
                         silent = False

                 # only normalise the audio that are not silent
                 y /= y.abs().max().clamp(1e-8)  # peak normalize
         ```

ENST Drums#

Below described is the folder structure of the ENST Drums dataset:

ENST-Drums
- drummer_1
  - annotation
  - audio
    - accompaniment
    - dry mix
    - hi-hat
    - kick
    - overhead L
    - overhead R
    - snare
    - tom 1
    - tom 2
    - wet mix
- drummer_2
  - (same structure as drummer_1)
- drummer_3
  - (same structure as drummer_1)

We are going to use audios from the wet mix folder for this tutorial.

In the automix/data/drums, we define an ENSTDrumsdataset class and use the getitem() to load data for the dataloader in our training loop.

class ENSTDrumsDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        root_dir: str,
        length: int,
        sample_rate: float,
        drummers: List[int] = [1, 2],
        track_names: List[str] = [
            "kick",
            "snare",
            "hi-hat",
            "overhead_L",
            "overhead_R",
            "tom_1",
            "tom_2",
            "tom_3",
        ],
        indices: Tuple[int, int] = [0, 1],
        wet_mix: bool = False,
        hits: bool = False,
        num_examples_per_epoch: int = 1000,
        seed: int = 42,
    ) -> None:

We use indices to define the train-test split.

In the getitem() of the dataset class, we first generate a mix_idx which is a random number in the range of 0 and the number of songs in the directory(len of mix_filepaths). This allows to randomly pick a mix/song from the mix_filepath.

def __getitem__(self, _):
        # select a mix at random
        mix_idx = np.random.randint(0, len(self.mix_filepaths))
        mix_filepath = self.mix_filepaths[mix_idx]
        example_id = os.path.basename(mix_filepath)
        drummer_id = os.path.normpath(mix_filepath).split(os.path.sep)[-4]

        md = torchaudio.info(mix_filepath)  # check length

Next, we load the mix(y) from the filepath. Make sure to check for silence as discussed above. Once the mix is loaded, peak normalise it.

        # load the chunk of the mix
        silent = True
        while silent:
            # get random offset
            offset = np.random.randint(0, md.num_frames - self.length - 1)

            y, sr = torchaudio.load(
                mix_filepath,
                frame_offset=offset,
                num_frames=self.length,
            )
            energy = (y**2).mean()
            if energy > 1e-8:
                silent = False

        y /= y.abs().max().clamp(1e-8)  # peak normalize

Last step is to load the stems. max_num_tracks is the maximum number of tracks you want to load. Some songs might have less or more stems than this number. We keep a track of empty stems using pad which is True whenever the stem is empty. The getitem() returns stems tensor (x), mix (y), and pad information.

  # -------------------- load the tracks from disk --------------------
  x = torch.zeros((self.max_num_tracks, self.length))
  pad = [True] * self.max_num_tracks  # note which tracks are empty

  for tidx, track_name in enumerate(self.track_names):
      track_filepath = os.path.join(
          self.root_dir,
          drummer_id,
          "audio",
          track_name,
          example_id,
      )
      if os.path.isfile(track_filepath):
          x_s, sr = torchaudio.load(
              track_filepath,
              frame_offset=offset,
              num_frames=self.length,
          )
          x_s /= x_s.abs().max().clamp(1e-6)
          x_s *= 10 ** (-12 / 20.0)
          x[tidx, :] = x_s
          pad[tidx] = False

  return x, y, pad

DSD100 dataset#

Below described is the folder structure of the DSD100 dataset:

ENST Drums
- Train
  - Songdir(songname)
    - vocals.wav
    - bass.wav
    - drums.wav
    - other.wav
    - accompaniment.wav
    - mixture.wav
- Test
  - Songdir(songname)
    - vocals.wav
    - bass.wav
    - drums.wav
    - other.wav
    - accompaniment.wav
    - mixture.wav

Note: Accompaniment is the sum of bass, drums, and other.

For the purpose of training our models, we use:

Input: vocals, bass, drums, and other

Output: Mixture

We will first define a dataset class and use the getitem() function to load items into the dataloader.

#Code from automix/data/dsd100.py

class DSD100Dataset(torch.utils.data.Dataset):
    def __init__(
        self,
        root_dir: str,
        length: int,
        sample_rate: float,
        indices: Tuple[int, int],
        track_names: List[str] = ["bass", "drums", "other", "vocals"],
        num_examples_per_epoch: int = 1000,
    ) -> None:

Hereafter, we follow similar structure in getitem() as in the case of ENSTDrums.

We first pick a mix_filepath on random and then look for non-silent part to load the mix(y).
Then, we load stems(x) starting with the same start_idx of the prescribed length.
We peak normalise all the loaded stems and mix and save the empty stem inofrmation in the pad variable.
We then return x, y, and pad.

MedleyDB Dataset#

Described below is the folder structure for MedleyDB:

MedleyDB
- songnames
  - songname_MIX.wav
  - songname_STEMS
    - songname_STEMS_{stem_number}.wav
  - songname_RAW
    - songname_STEMS_{stem_number}_{track_number}.wav
STEMS folder have some of the RAW audio tracks combined into a single audio file.
RAW folder contains all of the audio tracks individually.

We define the corresponding dataset class like before.

class MedleyDBDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        root_dirs: List[str],
        length: int,
        sample_rate: float,
        indices: List[int] = [0, 100],
        max_num_tracks: int = 16,
        num_examples_per_epoch: int = 1000,
        buffer_size_gb: float = 3.0,
        buffer_reload_rate: int = 200,
        normalization: str = "peak",
    ) -> None:

indices define the train-test split.
buffer_size_gb specifies the amount of data loaded onto RAM
buffer_reload_rate specifies the rate of loading new data onto the RAM.

In case of large datasets like MedleyDB which have large number of stems in the songs, it could be very time-consuming to always load audio tracks from the disk. However, we could load a small subset of the dataset randomly onto the RAM every few iterations to speed up the process.

We load nbytes_loaded amount of data onto the RAM everytime the items_since_load > buffer_reload_rate

#code from automix/data/medleydb.py

def reload_buffer(self):

        self.examples = []  # clear buffer
        self.items_since_load = 0  # reset iteration counter
        nbytes_loaded = 0  # counter for data in RAM

        # different subset in each
        random.shuffle(self.mix_dirs)

        # load files into RAM
        for mix_dir in self.mix_dirs:
            mix_id = os.path.basename(mix_dir)
            mix_filepath = glob.glob(os.path.join(mix_dir, "*.wav"))[0]

            # now check the length of the mix
            try:
                y, sr = torchaudio.load(mix_filepath)
            except:
                print(f"Skipping {mix_filepath}")
                continue

            mix_num_frames = y.shape[-1]
            nbytes = y.element_size() * y.nelement()
            nbytes_loaded += nbytes

            # now find all the track filepaths
            track_filepaths = glob.glob(os.path.join(mix_dir, f"{mix_id}_RAW", "*.wav"))

            if len(track_filepaths) > self.max_num_tracks:
                continue

            # check length of each track
            tracks = []
            for tidx, track_filepath in enumerate(track_filepaths):
                x, sr = torchaudio.load(track_filepath)
                tracks.append(x)

                nbytes = x.element_size() * x.nelement()
                nbytes_loaded += nbytes

                track_num_frames = x.shape[-1]
                if track_num_frames < mix_num_frames:
                    mix_num_frames = track_num_frames

            # store this example
            example = {
                "mix_id": os.path.dirname(mix_filepath).split(os.sep)[-1],
                "mix_filepath": mix_filepath,
                "mix_audio": y,
                "num_frames": mix_num_frames,
                "track_filepaths": track_filepaths,
                "track_audio": tracks,
            }

            self.examples.append(example)

            # check the size of loaded data
            if nbytes_loaded > self.buffer_size_gb * 1e9:
                break

!pip install git+https://github.com/csteinmetz1/automix-toolkit

from automix.data import DSD100Dataset
import torch
import torchaudio
import matplotlib.pyplot as plt
import librosa
import librosa.display
import IPython
import numpy as np
import os

Now we will download a subset of DSD100 and load it using the dataloader.

#First lets download a subset of DSD100
!wget https://huggingface.co/csteinmetz1/automix-toolkit/resolve/main/DSD100subset.zip
!unzip -o DSD100subset.zip 

--2024-08-29 16:40:46--  https://huggingface.co/csteinmetz1/automix-toolkit/resolve/main/DSD100subset.zip
Resolving huggingface.co (huggingface.co)... 2600:9000:2751:5e00:17:b174:6d00:93a1, 2600:9000:2751:7800:17:b174:6d00:93a1, 2600:9000:2751:5800:17:b174:6d00:93a1, ...
Connecting to huggingface.co (huggingface.co)|2600:9000:2751:5e00:17:b174:6d00:93a1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/ec/ee/ecee38df047e3f2db1bd8c31a742f3a08f557470cd67cb487402a9c3ed91b5ea/3544bf18ffbea78aee3273ba8267a6cb15aa04b52bc430e2f39755d40d212208?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27DSD100subset.zip%3B+filename%3D%22DSD100subset.zip%22%3B&response-content-type=application%2Fzip&Expires=1725176424&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNTE3NjQyNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9lYy9lZS9lY2VlMzhkZjA0N2UzZjJkYjFiZDhjMzFhNzQyZjNhMDhmNTU3NDcwY2Q2N2NiNDg3NDAyYTljM2VkOTFiNWVhLzM1NDRiZjE4ZmZiZWE3OGFlZTMyNzNiYTgyNjdhNmNiMTVhYTA0YjUyYmM0MzBlMmYzOTc1NWQ0MGQyMTIyMDg%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=efG0YVpSC2a031dmaHhg5hN4fGo1-jlpRIi41kKRjxvzu-Zv6TC4mas7TrIVKoB8zXIZqXD1-lc0NDkfta7a2qishNEwk3Hgr5OWsBO-NL8wyQ1ZhwwrZom3mbeADwutQRJo-0JzpJm1o%7ED5Q5Z%7EBlRyhOYQdvkmkc0JbyKCJO9GbFQbnq88ZeNSSK4C8SmnZKvOCwFq4QrHSmxb44jQEwdGvyCw6GLyiNtHgmoSp6sYFnUOFHBWacpWqfVqqOD7OW1ptacSRXC7m%7EbA5MwMrOkkl59AnuUJvMc68vvnw8xIanRqnElI-sxaXSyKuCR6vmxuESCV4DBD9U5f-Fq97g__&Key-Pair-Id=K3ESJI6DHPFC7 [following]
--2024-08-29 16:40:46--  https://cdn-lfs.huggingface.co/repos/ec/ee/ecee38df047e3f2db1bd8c31a742f3a08f557470cd67cb487402a9c3ed91b5ea/3544bf18ffbea78aee3273ba8267a6cb15aa04b52bc430e2f39755d40d212208?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27DSD100subset.zip%3B+filename%3D%22DSD100subset.zip%22%3B&response-content-type=application%2Fzip&Expires=1725176424&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNTE3NjQyNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9lYy9lZS9lY2VlMzhkZjA0N2UzZjJkYjFiZDhjMzFhNzQyZjNhMDhmNTU3NDcwY2Q2N2NiNDg3NDAyYTljM2VkOTFiNWVhLzM1NDRiZjE4ZmZiZWE3OGFlZTMyNzNiYTgyNjdhNmNiMTVhYTA0YjUyYmM0MzBlMmYzOTc1NWQ0MGQyMTIyMDg%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=efG0YVpSC2a031dmaHhg5hN4fGo1-jlpRIi41kKRjxvzu-Zv6TC4mas7TrIVKoB8zXIZqXD1-lc0NDkfta7a2qishNEwk3Hgr5OWsBO-NL8wyQ1ZhwwrZom3mbeADwutQRJo-0JzpJm1o%7ED5Q5Z%7EBlRyhOYQdvkmkc0JbyKCJO9GbFQbnq88ZeNSSK4C8SmnZKvOCwFq4QrHSmxb44jQEwdGvyCw6GLyiNtHgmoSp6sYFnUOFHBWacpWqfVqqOD7OW1ptacSRXC7m%7EbA5MwMrOkkl59AnuUJvMc68vvnw8xIanRqnElI-sxaXSyKuCR6vmxuESCV4DBD9U5f-Fq97g__&Key-Pair-Id=K3ESJI6DHPFC7
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 2600:9000:20c4:2800:11:f807:5180:93a1, 2600:9000:20c4:4800:11:f807:5180:93a1, 2600:9000:20c4:8000:11:f807:5180:93a1, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|2600:9000:20c4:2800:11:f807:5180:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126074934 (120M) [application/zip]
Saving to: ‘DSD100subset.zip.7’

DSD100subset.zip.7  100%[===================>] 120.23M  7.73MB/s    in 11s     

2024-08-29 16:40:58 (10.7 MB/s) - ‘DSD100subset.zip.7’ saved [126074934/126074934]

Archive:  DSD100subset.zip
  inflating: DSD100subset/dsd100.xlsx  
  inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/drums.wav  
  inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/other.wav  
  inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/bass.wav  
  inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/vocals.wav  
  inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/vocals.wav  
  inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/bass.wav  
  inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/drums.wav  
  inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/other.wav  
  inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/bass.wav  
  inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/vocals.wav  
  inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/other.wav  
  inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/drums.wav  
  inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/vocals.wav  
  inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/drums.wav  
  inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/other.wav  
  inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/bass.wav  
  inflating: DSD100subset/Mixtures/Test/005 - Angela Thomas Wade - Milk Cow Blues/mixture.wav  
  inflating: DSD100subset/Mixtures/Test/049 - Young Griffo - Facade/mixture.wav  
  inflating: DSD100subset/Mixtures/Dev/055 - Angels In Amplifiers - I'm Alright/mixture.wav  
  inflating: DSD100subset/Mixtures/Dev/081 - Patrick Talbot - Set Me Free/mixture.wav  
  inflating: DSD100subset/dsd100subset.txt  

Load the dataset.#

We will use the DSD100Dataset class from the automix.data module. We load data at 44.1KHz sample rate. Let’s have the train length = 65536 frames We will split the dataset to have the first four examples as train and the rest as test; this is indicated using indices.

num_frames = 65536
sample_rate = 44100

train_dataset = DSD100Dataset(
    "./DSD100subset",
    num_frames,
    sample_rate,
    indices=[0, 4],
    num_examples_per_epoch=100,)

#Define the dataloader
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=1,
    shuffle=False,
    num_workers=1,
    persistent_workers=True,
)

print(train_dataloader)

100%|██████████| 4/4 [00:00<00:00, 1800.52it/s]

Found 4 mixes. Using 4 in this subset.
<torch.utils.data.dataloader.DataLoader object at 0x72b81f5401f0>

Lop over the dataloader to load examples for batch size of 1. We will see the shape of the loaded data.

for i,( stems, mix, pad) in enumerate(train_dataloader):
    print("Stems shape: ", stems.shape)
    print("Mix shape: ", mix.shape)
    print("Pad shape: ", len(pad))
    print("Pad: ", pad)
    break

Stems shape:  torch.Size([1, 8, 65536])
Mix shape:  torch.Size([1, 2, 65536])
Pad shape:  1
Pad:  tensor([[False, False, False, False, False, False, False, False]])