Abstract

We propose a system for mapping arbitrary percussive sound gestures to high-fidelity drum recordings. Our system, dubbed TRIA (The Rhythm In Anything), takes as input two audio prompts -- one specifying the desired drum timbre, and one specifying the desired rhythm -- and generates audio satisfying both prompts (i.e. playing the desired rhythm with the desired timbre). TRIA can synthesize realistic drum audio given rhythm prompts from a variety of non-drum sound sources (e.g. beatboxing, environmental sound) in a zero-shot manner, enabling novel creative interactions.



TRIA processes timbre and rhythm prompts

TRIA is trained as a masked language model to generate neural codec tokens given contextual tokens extracted from the timbre prompt and rhythm features extracted from the rhythm prompt. For our rhythm features, we adaptively split a spectrogram into two equal-energy bands and perform normalization and quantization. During training, we augment audio with pitch shift, noise, and other distortions prior to rhythm feature extraction, allowing our system to process a wide variety of rhythm prompts under realistic recording conditions.



The proposed architecture

We provide examples of TRIA processing selected timbre and rhythm prompts to create new output audio below.


Audio Examples


# Timbre Prompt Rhythm Prompt Output
1 Timbre Prompt 1 Rhythm Prompt 1 Output 1
2 Timbre Prompt 2 Rhythm Prompt 2 Output 2
3 Timbre Prompt 3 Rhythm Prompt 3 Output 3
4 Timbre Prompt 4 Rhythm Prompt 4 Output 4
5 Timbre Prompt 5 Rhythm Prompt 5 Output 5
6 Timbre Prompt 6 Rhythm Prompt 6 Output 6