Code Drift: Towards Idempotent Neural Audio Codecs

Abstract

Neural codecs have demonstrated strong performance in high-fidelity compression of audio signals at low bitrates. The token- based representations produced by these codecs have proven particularly useful for generative modeling. While much research has focused on improvements in compression ratio and perceptual transparency, recent works have largely overlooked another desirable codec property – idempotence, the stability of compressed outputs under multiple rounds of encoding. We find that state-of-the-art neural codecs exhibit varied degrees of idempotence, with some degrading audio outputs significantly after as few as three encodings.

We investigate possible causes of low idempotence and devise a method for improving idempotence through fine-tuning a codec model. We then examine the effect of idempotence on a simple conditional generative modeling task, and find that increased idempotence can be achieved without negatively impacting downstream modeling performance – potentially extending the usefulness of neural codecs for practical file compression and iterative generative modeling workflows.

Re-Encoding Examples (Existing Codecs)

We find that many existing codecs exhibit a significant drop in audio quality after just a few encodings, and that token representations often change considerably. Iteration-to-iteration match rates can increase as codecs converge to stable tokenizations.

Example (Expresso)

Codec	1 Encoding	5 Encodings	10 Encodings	25 Encodings	50 Encodings	500 Encodings
DAC (44.1kHz / 7.7kbps)
		45% match vs. orig.	38% match vs. orig.	36% match vs. orig.	35% match vs. orig.	34% match vs. orig.
		85% match vs. prev.	94% match vs. prev.	98% match vs. prev.	98% match vs. prev.	100% match vs. prev.
ESC (16kHz / 9.0kbps)
		29% match vs. orig.	24% match vs. orig.	21% match vs. orig.	19% match vs. orig.	19% match vs. orig.
		73% match vs. prev.	84% match vs. prev.	95% match vs. prev.	99% match vs. prev.	100% match vs. prev.
Encodec-VoiceCraft (16Hz / 2.2kbps)
		36% match vs. orig.	27% match vs. orig.	25% match vs. orig.	25% match vs. orig.	25% match vs. orig.
	84% match vs. prev.	86% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.
Spectral-Codec (22.05kHz / 6.9kbps)
		24% match vs. orig.	18% match vs. orig.	16% match vs. orig.	14% match vs. orig.	13% match vs. orig.
		64% match vs. prev.	74% match vs. prev.	83% match vs. prev.	85% match vs. prev.	98% match vs. prev.
SpeechTokenizer (16kHz / 4.0kbps)
		23% match vs. orig.	13% match vs. orig.	2% match vs. orig.	1% match vs. orig.	1% match vs. orig.
		55% match vs. prev.	58% match vs. prev.	52% match vs. prev.	71% match vs. prev.	100% match vs. prev.
FACodec (16kHz / 4.8kbps)
		11% match vs. orig.	5% match vs. orig.	1% match vs. orig.	1% match vs. orig.	1% match vs. orig.
		49% match vs. prev.	53% match vs. prev.	53% match vs. prev.	55% match vs. prev.	75% match vs. prev.

Example (VCTK)

Codec	5 Encodings	10 Encodings	25 Encodings	50 Encodings	500 Encodings
DAC (44.1kHz / 7.7kbps)
	65% match vs. orig.	62% match vs. orig.	61% match vs. orig.	61% match vs. orig.	61% match vs. orig.
	95% match vs. prev.	99% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.
ESC (16kHz / 9.0kbps)
	34% match vs. orig.	28% match vs. orig.	27% match vs. orig.	27% match vs. orig.	27% match vs. orig.
	78% match vs. prev.	91% match vs. prev.	99% match vs. prev.	100% match vs. prev.	100% match vs. prev.
Encodec-VoiceCraft (16Hz / 2.2kbps)
	41% match vs. orig.	35% match vs. orig.	33% match vs. orig.	33% match vs. orig.	33% match vs. orig.
	87% match vs. prev.	95% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.
Spectral-Codec (22.05kHz / 6.9kbps)
	20% match vs. orig.	13% match vs. orig.	9% match vs. orig.	8% match vs. orig.	8% match vs. orig.
	68% match vs. prev.	74% match vs. prev.	82% match vs. prev.	83% match vs. prev.	99% match vs. prev.
SpeechTokenizer (16kHz / 4.0kbps)
	27% match vs. orig.	16% match vs. orig.	6% match vs. orig.	2% match vs. orig.	1% match vs. orig.
	64% match vs. prev.	68% match vs. prev.	71% match vs. prev.	76% match vs. prev.	71% match vs. prev.
FACodec (16kHz / 4.8kbps)
	8% match vs. orig.	3% match vs. orig.	1% match vs. orig.	1% match vs. orig.	1% match vs. orig.
	50% match vs. prev.	54% match vs. prev.	59% match vs. prev.	55% match vs. prev.	79% match vs. prev.

Re-Encoding Examples (Fine-Tuned Codecs)

We experiment with fine-tuning codec models to improve idempotence through auxiliary losses on encoded representations. For our re-implementation of DAC, we find that an auxiliary loss penalizing the distance between codebook vectors and re-encoded latents yields strong gains in idempotence without significantly degrading audio quality. In particular, fine-tuning with our codebook loss L_code results in over 90% token idempotence on average at the first encoding iteration across all 9 RVQ levels of our DAC reproduction.

Example (Expresso)

Codec	5 Encodings	10 Encodings	25 Encodings	50 Encodings	500 Encodings
DAC-Ours (48kHz / 6.7kbps)
	46% match vs. orig.	40% match vs. orig.	38% match vs. orig.	38% match vs. orig.	38% match vs. orig.
	86% match vs. prev.	94% match vs. prev.	97% match vs. prev.	100% match vs. prev.	100% match vs. prev.
DAC-Ours + L_Enc (48kHz / 6.7kbps)
	43% match vs. orig.	38% match vs. orig.	33% match vs. orig.	32% match vs. orig.	32% match vs. orig.
	85% match vs. prev.	92% match vs. prev.	97% match vs. prev.	99% match vs. prev.	100% match vs. prev.
DAC-Ours + L_Proj (48kHz / 6.7kbps)
	55% match vs. orig.	47% match vs. orig.	45% match vs. orig.	45% match vs. orig.	45% match vs. orig.
	89% match vs. prev.	94% match vs. prev.	99% match vs. prev.	100% match vs. prev.	100% match vs. prev.
DAC-Ours + L_Code (48kHz / 6.7kbps)
	72% match vs. orig.	68% match vs. orig.	65% match vs. orig.	65% match vs. orig.	65% match vs. orig.
	94% match vs. prev.	97% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.

Example (VCTK)

Codec	5 Encodings	10 Encodings	25 Encodings	50 Encodings	500 Encodings
DAC-Ours (48kHz / 6.7kbps)
	61% match vs. orig.	57% match vs. orig.	57% match vs. orig.	57% match vs. orig.	57% match vs. orig.
	94% match vs. prev.	99% match vs. prev.	99% match vs. prev.	100% match vs. prev.	100% match vs. prev.
DAC-Ours + L_Enc (48kHz / 6.7kbps)
	55% match vs. orig.	51% match vs. orig.	48% match vs. orig.	47% match vs. orig.	47% match vs. orig.
	91% match vs. prev.	96% match vs. prev.	99% match vs. prev.	100% match vs. prev.	100% match vs. prev.
DAC-Ours + L_Proj (48kHz / 6.7kbps)
	75% match vs. orig.	73% match vs. orig.	73% match vs. orig.	73% match vs. orig.	73% match vs. orig.
	96% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.
DAC-Ours + L_Code (48kHz / 6.7kbps)
	92% match vs. orig.	92% match vs. orig.	92% match vs. orig.	92% match vs. orig.	92% match vs. orig.
	99% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.	100% match vs. prev.

Vocoding Examples (Fine-Tuned Codecs)

DAC-Ours (48kHz / 6.7kbps)	DAC-Ours + L_Enc (48kHz / 6.7kbps)	DAC-Ours + L_Proj (48kHz / 6.7kbps)	DAC-Ours + L_Code (48kHz / 6.7kbps)

SQUIM-MOS: 4.38	SQUIM-MOS: 4.39	SQUIM-MOS: 4.46	SQUIM-MOS: 4.42

SQUIM-MOS: 4.41	SQUIM-MOS: 4.35	SQUIM-MOS: 4.42	SQUIM-MOS: 4.41

BibTeX