Datasets for automix systems#
In this notebook, we will first discuss the datasets used to train the automix systems. Thereafter, we will see how to pre-process the data and set up the dataloaders for training the deep learning models for these systems.
Training automix models requires paired multitrack stems and their corresponding mixdowns. Below listed are the desired properties for these datasets:
Time alligned stems and mixes : We require time-alligned stems and mixes to allow the models to learn timewise transformation relationships.
Diverse instrument categories : The more diverse the number of instruments in the dataset, the more likely is the trained system to perform well with real-world songs.
Diverse genres of songs : The mixing practices vary slightly from one genre to another. Hence, if the dataset has multitrack mixes from different genres, the trained system will be exposed to more diverse distribution of data.
Dry multitrack stems : Mixing involves processing the recorded dry stems for corrective and aesthetic reasons before summing them to form a cohesive mixture. For a model to learn the correct way to process the stems to generate mixes, we need to train it on dry unprocessed stems and mix pairs. However, more recently approaches to use processed stems from datasets like MUSEDB to train automix systems have been explored. These approaches use a pre-processing effect normalisation method to deal with pre-processed wet stems. For the scope of this tutorial, we do not discuss these methods. However, we recommend having a look at this paper being presented at ISMIR 2022.
Here we list the datasets available for training automix systems.
Dataset |
Size(Hrs) |
no. of Songs |
no. of Instrument Category |
no. of tracks |
Type |
Usage Permissions |
Other info |
Remarks |
---|---|---|---|---|---|---|---|---|
7.2 |
122 |
82 |
1-26 |
Multitrack, Wav |
Open |
44.1KHz, 16 bit, stereo |
- |
|
1.25 |
- |
1 |
8 |
Drums, Wav/AVI |
Limited |
44.1KHz, 16 bit, stereo |
Drums only dataset |
|
>3 |
>50 |
>5 |
5-70 |
Multitrack, Wav |
open |
44.1KHz, 16/24 bit, Stereo |
Not time alligned, recordings for all the songs are not uniform |
|
~10 |
150 |
4 |
4 |
Multitrack, Wav |
open |
44.1KHz, Stereo |
used mainly for source separation, wet stems |
|
145 |
2100 |
34 |
4-48 |
Synthesised, Flac, |
open |
44.1KHz, 16 bit, stereo |
used mainly for source separation; sometimes wet stems |
|
4.5 |
68 |
>30 |
>40 |
Multitrack, Wav |
User only |
44.1/88.2KHz, 16/24 bit, stereo |
- |
|
- |
>1M |
>5 |
>5 |
Multitrack MIDI |
open |
MIDI data |
MIDI data submitted by users across world |
For this tutorial, we will use ENST-drums for training Wave-U-Net and ENST-drums, DSD100, and MedleyDB for training Differentiable Mixing Console(DMC).
In the following section, we will discuss the recommended pre-processing methods for these datasets and the methods to set up dataloaders for training the models. This notebook assumes that you have already installed the automix
package.
We define dataset classes for DSD100, MedleyDB, and ENSTdrums, and then use getitem()
function to load the audio data into the dataloader for training and testing.
Listed below are few of the advised variables that you should define in the dataset class definition:#
Root directory of the folder containing the dataset.
Length of the audio you wish to load for training/testing.
Sample rate at which you wish to load the audio data.
Pre-processing advice for loading multitrack data:#
Discard the examples from the dataset that have length shorter than the prescribed length.
``` #code from automix/data/drums.py #remove any mixes that are shorter than the requested length self.mix_filepaths = [ fp for fp in self.mix_filepaths # use torchaudio.info to get information about the audio. This is much faster than loading the whole audio. if torchaudio.info(fp).num_frames > self.length ] ```
Loudness normalise the stems and the mixes after loading.
``` #code from automix/data/drums.py y /= y.abs().max().clamp(1e-8) ```
Look out for silence in the loaded audio: Common practice is to generate a random starting index for the frame from which the audio is loaded. However, it is likely that some of the multitrack stem or the mix as a whole could have just silence in this chunk of loaded audio. This results in generation of NaN in the audio tensor when it is normalised. In the below shown code block, we show how to check for silence. We keep generating a new starting index(
offset
)) for loading the audio until the audio has some content and is not just silence(silent is False
).``` #code from automix/data/drums.py # load the chunk of the mix silent = True while silent: # get random offset offset = np.random.randint(0, md.num_frames - self.length - 1) y, sr = torchaudio.load( mix_filepath, frame_offset=offset, num_frames=self.length, ) energy = (y**2).mean() if energy > 1e-8: silent = False # only normalise the audio that are not silent y /= y.abs().max().clamp(1e-8) # peak normalize ```
ENST Drums#
Below described is the folder structure of the ENST Drums dataset:
ENST-Drums
drummer_1
annotation
audio
accompaniment
dry mix
hi-hat
kick
overhead L
overhead R
snare
tom 1
tom 2
wet mix
drummer_2
(same structure as drummer_1)
drummer_3
(same structure as drummer_1)
We are going to use audios from the wet mix folder for this tutorial.
In the automix/data/drums, we define an ENSTDrumsdataset class and use the getitem()
to load data for the dataloader in our training loop.
class ENSTDrumsDataset(torch.utils.data.Dataset):
def __init__(
self,
root_dir: str,
length: int,
sample_rate: float,
drummers: List[int] = [1, 2],
track_names: List[str] = [
"kick",
"snare",
"hi-hat",
"overhead_L",
"overhead_R",
"tom_1",
"tom_2",
"tom_3",
],
indices: Tuple[int, int] = [0, 1],
wet_mix: bool = False,
hits: bool = False,
num_examples_per_epoch: int = 1000,
seed: int = 42,
) -> None:
We use indices to define the train-test split.
In the
getitem()
of the dataset class, we first generate amix_idx
which is a random number in the range of 0 and the number of songs in the directory(len of mix_filepaths). This allows to randomly pick a mix/song from the mix_filepath.def __getitem__(self, _): # select a mix at random mix_idx = np.random.randint(0, len(self.mix_filepaths)) mix_filepath = self.mix_filepaths[mix_idx] example_id = os.path.basename(mix_filepath) drummer_id = os.path.normpath(mix_filepath).split(os.path.sep)[-4] md = torchaudio.info(mix_filepath) # check length
Next, we load the mix(
y
) from the filepath. Make sure to check for silence as discussed above. Once the mix is loaded, peak normalise it.# load the chunk of the mix silent = True while silent: # get random offset offset = np.random.randint(0, md.num_frames - self.length - 1) y, sr = torchaudio.load( mix_filepath, frame_offset=offset, num_frames=self.length, ) energy = (y**2).mean() if energy > 1e-8: silent = False y /= y.abs().max().clamp(1e-8) # peak normalize
Last step is to load the stems.
max_num_tracks
is the maximum number of tracks you want to load. Some songs might have less or more stems than this number. We keep a track of empty stems usingpad
which isTrue
whenever the stem is empty. Thegetitem()
returns stems tensor (x
), mix (y
), andpad
information.# -------------------- load the tracks from disk -------------------- x = torch.zeros((self.max_num_tracks, self.length)) pad = [True] * self.max_num_tracks # note which tracks are empty for tidx, track_name in enumerate(self.track_names): track_filepath = os.path.join( self.root_dir, drummer_id, "audio", track_name, example_id, ) if os.path.isfile(track_filepath): x_s, sr = torchaudio.load( track_filepath, frame_offset=offset, num_frames=self.length, ) x_s /= x_s.abs().max().clamp(1e-6) x_s *= 10 ** (-12 / 20.0) x[tidx, :] = x_s pad[tidx] = False return x, y, pad
DSD100 dataset#
Below described is the folder structure of the DSD100 dataset:
ENST Drums
Train
Songdir(songname)
vocals.wav
bass.wav
drums.wav
other.wav
accompaniment.wav
mixture.wav
Test
Songdir(songname)
vocals.wav
bass.wav
drums.wav
other.wav
accompaniment.wav
mixture.wav
Note: Accompaniment is the sum of bass, drums, and other.
For the purpose of training our models, we use:
Input: vocals, bass, drums, and other
Output: Mixture
We will first define a dataset class and use the getitem()
function to load items into the dataloader.
#Code from automix/data/dsd100.py
class DSD100Dataset(torch.utils.data.Dataset):
def __init__(
self,
root_dir: str,
length: int,
sample_rate: float,
indices: Tuple[int, int],
track_names: List[str] = ["bass", "drums", "other", "vocals"],
num_examples_per_epoch: int = 1000,
) -> None:
Hereafter, we follow similar structure in getitem()
as in the case of ENSTDrums.
We first pick a mix_filepath on random and then look for non-silent part to load the mix(
y
).Then, we load stems(
x
) starting with the same start_idx of the prescribed length.We peak normalise all the loaded stems and mix and save the empty stem inofrmation in the
pad
variable.We then return
x
,y
, andpad
.
MedleyDB Dataset#
Described below is the folder structure for MedleyDB:
MedleyDB
songnames
songname_MIX.wav
songname_STEMS
songname_STEMS_{stem_number}.wav
songname_RAW
songname_STEMS_{stem_number}_{track_number}.wav
STEMS folder have some of the RAW audio tracks combined into a single audio file.
RAW folder contains all of the audio tracks individually.
We define the corresponding dataset class like before.
class MedleyDBDataset(torch.utils.data.Dataset):
def __init__(
self,
root_dirs: List[str],
length: int,
sample_rate: float,
indices: List[int] = [0, 100],
max_num_tracks: int = 16,
num_examples_per_epoch: int = 1000,
buffer_size_gb: float = 3.0,
buffer_reload_rate: int = 200,
normalization: str = "peak",
) -> None:
indices
define the train-test split.buffer_size_gb
specifies the amount of data loaded onto RAMbuffer_reload_rate
specifies the rate of loading new data onto the RAM.
In case of large datasets like MedleyDB which have large number of stems in the songs, it could be very time-consuming to always load audio tracks from the disk. However, we could load a small subset of the dataset randomly onto the RAM every few iterations to speed up the process.
We load nbytes_loaded
amount of data onto the RAM everytime the items_since_load
> buffer_reload_rate
#code from automix/data/medleydb.py
def reload_buffer(self):
self.examples = [] # clear buffer
self.items_since_load = 0 # reset iteration counter
nbytes_loaded = 0 # counter for data in RAM
# different subset in each
random.shuffle(self.mix_dirs)
# load files into RAM
for mix_dir in self.mix_dirs:
mix_id = os.path.basename(mix_dir)
mix_filepath = glob.glob(os.path.join(mix_dir, "*.wav"))[0]
# now check the length of the mix
try:
y, sr = torchaudio.load(mix_filepath)
except:
print(f"Skipping {mix_filepath}")
continue
mix_num_frames = y.shape[-1]
nbytes = y.element_size() * y.nelement()
nbytes_loaded += nbytes
# now find all the track filepaths
track_filepaths = glob.glob(os.path.join(mix_dir, f"{mix_id}_RAW", "*.wav"))
if len(track_filepaths) > self.max_num_tracks:
continue
# check length of each track
tracks = []
for tidx, track_filepath in enumerate(track_filepaths):
x, sr = torchaudio.load(track_filepath)
tracks.append(x)
nbytes = x.element_size() * x.nelement()
nbytes_loaded += nbytes
track_num_frames = x.shape[-1]
if track_num_frames < mix_num_frames:
mix_num_frames = track_num_frames
# store this example
example = {
"mix_id": os.path.dirname(mix_filepath).split(os.sep)[-1],
"mix_filepath": mix_filepath,
"mix_audio": y,
"num_frames": mix_num_frames,
"track_filepaths": track_filepaths,
"track_audio": tracks,
}
self.examples.append(example)
# check the size of loaded data
if nbytes_loaded > self.buffer_size_gb * 1e9:
break
!pip install git+https://github.com/csteinmetz1/automix-toolkit
from automix.data import DSD100Dataset
import torch
import torchaudio
import matplotlib.pyplot as plt
import librosa
import librosa.display
import IPython
import numpy as np
import os
Now we will download a subset of DSD100 and load it using the dataloader.
#First lets download a subset of DSD100
!wget https://huggingface.co/csteinmetz1/automix-toolkit/resolve/main/DSD100subset.zip
!unzip -o DSD100subset.zip
--2024-08-29 16:40:46-- https://huggingface.co/csteinmetz1/automix-toolkit/resolve/main/DSD100subset.zip
Resolving huggingface.co (huggingface.co)... 2600:9000:2751:5e00:17:b174:6d00:93a1, 2600:9000:2751:7800:17:b174:6d00:93a1, 2600:9000:2751:5800:17:b174:6d00:93a1, ...
Connecting to huggingface.co (huggingface.co)|2600:9000:2751:5e00:17:b174:6d00:93a1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/ec/ee/ecee38df047e3f2db1bd8c31a742f3a08f557470cd67cb487402a9c3ed91b5ea/3544bf18ffbea78aee3273ba8267a6cb15aa04b52bc430e2f39755d40d212208?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27DSD100subset.zip%3B+filename%3D%22DSD100subset.zip%22%3B&response-content-type=application%2Fzip&Expires=1725176424&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNTE3NjQyNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9lYy9lZS9lY2VlMzhkZjA0N2UzZjJkYjFiZDhjMzFhNzQyZjNhMDhmNTU3NDcwY2Q2N2NiNDg3NDAyYTljM2VkOTFiNWVhLzM1NDRiZjE4ZmZiZWE3OGFlZTMyNzNiYTgyNjdhNmNiMTVhYTA0YjUyYmM0MzBlMmYzOTc1NWQ0MGQyMTIyMDg%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=efG0YVpSC2a031dmaHhg5hN4fGo1-jlpRIi41kKRjxvzu-Zv6TC4mas7TrIVKoB8zXIZqXD1-lc0NDkfta7a2qishNEwk3Hgr5OWsBO-NL8wyQ1ZhwwrZom3mbeADwutQRJo-0JzpJm1o%7ED5Q5Z%7EBlRyhOYQdvkmkc0JbyKCJO9GbFQbnq88ZeNSSK4C8SmnZKvOCwFq4QrHSmxb44jQEwdGvyCw6GLyiNtHgmoSp6sYFnUOFHBWacpWqfVqqOD7OW1ptacSRXC7m%7EbA5MwMrOkkl59AnuUJvMc68vvnw8xIanRqnElI-sxaXSyKuCR6vmxuESCV4DBD9U5f-Fq97g__&Key-Pair-Id=K3ESJI6DHPFC7 [following]
--2024-08-29 16:40:46-- https://cdn-lfs.huggingface.co/repos/ec/ee/ecee38df047e3f2db1bd8c31a742f3a08f557470cd67cb487402a9c3ed91b5ea/3544bf18ffbea78aee3273ba8267a6cb15aa04b52bc430e2f39755d40d212208?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27DSD100subset.zip%3B+filename%3D%22DSD100subset.zip%22%3B&response-content-type=application%2Fzip&Expires=1725176424&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNTE3NjQyNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9lYy9lZS9lY2VlMzhkZjA0N2UzZjJkYjFiZDhjMzFhNzQyZjNhMDhmNTU3NDcwY2Q2N2NiNDg3NDAyYTljM2VkOTFiNWVhLzM1NDRiZjE4ZmZiZWE3OGFlZTMyNzNiYTgyNjdhNmNiMTVhYTA0YjUyYmM0MzBlMmYzOTc1NWQ0MGQyMTIyMDg%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=efG0YVpSC2a031dmaHhg5hN4fGo1-jlpRIi41kKRjxvzu-Zv6TC4mas7TrIVKoB8zXIZqXD1-lc0NDkfta7a2qishNEwk3Hgr5OWsBO-NL8wyQ1ZhwwrZom3mbeADwutQRJo-0JzpJm1o%7ED5Q5Z%7EBlRyhOYQdvkmkc0JbyKCJO9GbFQbnq88ZeNSSK4C8SmnZKvOCwFq4QrHSmxb44jQEwdGvyCw6GLyiNtHgmoSp6sYFnUOFHBWacpWqfVqqOD7OW1ptacSRXC7m%7EbA5MwMrOkkl59AnuUJvMc68vvnw8xIanRqnElI-sxaXSyKuCR6vmxuESCV4DBD9U5f-Fq97g__&Key-Pair-Id=K3ESJI6DHPFC7
Resolving cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)... 2600:9000:20c4:2800:11:f807:5180:93a1, 2600:9000:20c4:4800:11:f807:5180:93a1, 2600:9000:20c4:8000:11:f807:5180:93a1, ...
Connecting to cdn-lfs.huggingface.co (cdn-lfs.huggingface.co)|2600:9000:20c4:2800:11:f807:5180:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126074934 (120M) [application/zip]
Saving to: ‘DSD100subset.zip.7’
DSD100subset.zip.7 100%[===================>] 120.23M 7.73MB/s in 11s
2024-08-29 16:40:58 (10.7 MB/s) - ‘DSD100subset.zip.7’ saved [126074934/126074934]
Archive: DSD100subset.zip
inflating: DSD100subset/dsd100.xlsx
inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/drums.wav
inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/other.wav
inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/bass.wav
inflating: DSD100subset/Sources/Dev/081 - Patrick Talbot - Set Me Free/vocals.wav
inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/vocals.wav
inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/bass.wav
inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/drums.wav
inflating: DSD100subset/Sources/Dev/055 - Angels In Amplifiers - I'm Alright/other.wav
inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/bass.wav
inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/vocals.wav
inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/other.wav
inflating: DSD100subset/Sources/Test/049 - Young Griffo - Facade/drums.wav
inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/vocals.wav
inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/drums.wav
inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/other.wav
inflating: DSD100subset/Sources/Test/005 - Angela Thomas Wade - Milk Cow Blues/bass.wav
inflating: DSD100subset/Mixtures/Test/005 - Angela Thomas Wade - Milk Cow Blues/mixture.wav
inflating: DSD100subset/Mixtures/Test/049 - Young Griffo - Facade/mixture.wav
inflating: DSD100subset/Mixtures/Dev/055 - Angels In Amplifiers - I'm Alright/mixture.wav
inflating: DSD100subset/Mixtures/Dev/081 - Patrick Talbot - Set Me Free/mixture.wav
inflating: DSD100subset/dsd100subset.txt
Load the dataset.#
We will use the DSD100Dataset class from the automix.data module. We load data at 44.1KHz sample rate. Let’s have the train length = 65536 frames We will split the dataset to have the first four examples as train and the rest as test; this is indicated using indices.
num_frames = 65536
sample_rate = 44100
train_dataset = DSD100Dataset(
"./DSD100subset",
num_frames,
sample_rate,
indices=[0, 4],
num_examples_per_epoch=100,)
#Define the dataloader
train_dataloader = torch.utils.data.DataLoader(
train_dataset,
batch_size=1,
shuffle=False,
num_workers=1,
persistent_workers=True,
)
print(train_dataloader)
100%|██████████| 4/4 [00:00<00:00, 1800.52it/s]
Found 4 mixes. Using 4 in this subset.
<torch.utils.data.dataloader.DataLoader object at 0x72b81f5401f0>
Lop over the dataloader to load examples for batch size of 1. We will see the shape of the loaded data.
for i,( stems, mix, pad) in enumerate(train_dataloader):
print("Stems shape: ", stems.shape)
print("Mix shape: ", mix.shape)
print("Pad shape: ", len(pad))
print("Pad: ", pad)
break
Stems shape: torch.Size([1, 8, 65536])
Mix shape: torch.Size([1, 2, 65536])
Pad shape: 1
Pad: tensor([[False, False, False, False, False, False, False, False]])