Neural codecs have demonstrated strong performance in high-fidelity compression of audio signals at low bitrates. The token- based representations produced by these codecs have proven particularly useful for generative modeling. While much research has focused on improvements in compression ratio and perceptual transparency, recent works have largely overlooked another desirable codec property – idempotence, the stability of compressed outputs under multiple rounds of encoding. We find that state-of-the-art neural codecs exhibit varied degrees of idempotence, with some degrading audio outputs significantly after as few as three encodings.
We investigate possible causes of low idempotence and devise a method for improving idempotence through fine-tuning a codec model. We then examine the effect of idempotence on a simple conditional generative modeling task, and find that increased idempotence can be achieved without negatively impacting downstream modeling performance – potentially extending the usefulness of neural codecs for practical file compression and iterative generative modeling workflows.
We find that many existing codecs exhibit a significant drop in audio quality after just a few encodings, and that token representations often change considerably. Iteration-to-iteration match rates can increase as codecs converge to stable tokenizations.
Example (Expresso) |
---|
Codec | 1 Encoding | 5 Encodings | 10 Encodings | 25 Encodings | 50 Encodings | 500 Encodings |
---|---|---|---|---|---|---|
DAC (44.1kHz / 7.7kbps) |
||||||
45% match vs. orig. | 38% match vs. orig. | 36% match vs. orig. | 35% match vs. orig. | 34% match vs. orig. | ||
85% match vs. prev. | 94% match vs. prev. | 98% match vs. prev. | 98% match vs. prev. | 100% match vs. prev. | ||
ESC (16kHz / 9.0kbps) |
||||||
29% match vs. orig. | 24% match vs. orig. | 21% match vs. orig. | 19% match vs. orig. | 19% match vs. orig. | ||
73% match vs. prev. | 84% match vs. prev. | 95% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | ||
Encodec-VoiceCraft (16Hz / 2.2kbps) |
||||||
36% match vs. orig. | 27% match vs. orig. | 25% match vs. orig. | 25% match vs. orig. | 25% match vs. orig. | ||
84% match vs. prev. | 86% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | |
Spectral-Codec (22.05kHz / 6.9kbps) |
||||||
24% match vs. orig. | 18% match vs. orig. | 16% match vs. orig. | 14% match vs. orig. | 13% match vs. orig. | ||
64% match vs. prev. | 74% match vs. prev. | 83% match vs. prev. | 85% match vs. prev. | 98% match vs. prev. | ||
SpeechTokenizer (16kHz / 4.0kbps) |
||||||
23% match vs. orig. | 13% match vs. orig. | 2% match vs. orig. | 1% match vs. orig. | 1% match vs. orig. | ||
55% match vs. prev. | 58% match vs. prev. | 52% match vs. prev. | 71% match vs. prev. | 100% match vs. prev. | ||
FACodec (16kHz / 4.8kbps) |
||||||
11% match vs. orig. | 5% match vs. orig. | 1% match vs. orig. | 1% match vs. orig. | 1% match vs. orig. | ||
49% match vs. prev. | 53% match vs. prev. | 53% match vs. prev. | 55% match vs. prev. | 75% match vs. prev. |
Example (VCTK) |
---|
Codec | 1 Encoding | 5 Encodings | 10 Encodings | 25 Encodings | 50 Encodings | 500 Encodings |
---|---|---|---|---|---|---|
DAC (44.1kHz / 7.7kbps) |
||||||
65% match vs. orig. | 62% match vs. orig. | 61% match vs. orig. | 61% match vs. orig. | 61% match vs. orig. | ||
95% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
ESC (16kHz / 9.0kbps) |
||||||
34% match vs. orig. | 28% match vs. orig. | 27% match vs. orig. | 27% match vs. orig. | 27% match vs. orig. | ||
78% match vs. prev. | 91% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
Encodec-VoiceCraft (16Hz / 2.2kbps) |
||||||
41% match vs. orig. | 35% match vs. orig. | 33% match vs. orig. | 33% match vs. orig. | 33% match vs. orig. | ||
87% match vs. prev. | 95% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
Spectral-Codec (22.05kHz / 6.9kbps) |
||||||
20% match vs. orig. | 13% match vs. orig. | 9% match vs. orig. | 8% match vs. orig. | 8% match vs. orig. | ||
68% match vs. prev. | 74% match vs. prev. | 82% match vs. prev. | 83% match vs. prev. | 99% match vs. prev. | ||
SpeechTokenizer (16kHz / 4.0kbps) |
||||||
27% match vs. orig. | 16% match vs. orig. | 6% match vs. orig. | 2% match vs. orig. | 1% match vs. orig. | ||
64% match vs. prev. | 68% match vs. prev. | 71% match vs. prev. | 76% match vs. prev. | 71% match vs. prev. | ||
FACodec (16kHz / 4.8kbps) |
||||||
8% match vs. orig. | 3% match vs. orig. | 1% match vs. orig. | 1% match vs. orig. | 1% match vs. orig. | ||
50% match vs. prev. | 54% match vs. prev. | 59% match vs. prev. | 55% match vs. prev. | 79% match vs. prev. |
We experiment with fine-tuning codec models to improve idempotence through auxiliary losses on encoded representations. For our re-implementation of DAC, we find that an auxiliary loss penalizing the distance between codebook vectors and re-encoded latents yields strong gains in idempotence without significantly degrading audio quality. In particular, fine-tuning with our codebook loss Lcode results in over 90% token idempotence on average at the first encoding iteration across all 9 RVQ levels of our DAC reproduction.
Example (Expresso) |
---|
Codec | 1 Encoding | 5 Encodings | 10 Encodings | 25 Encodings | 50 Encodings | 500 Encodings |
---|---|---|---|---|---|---|
DAC-Ours (48kHz / 6.7kbps) |
||||||
46% match vs. orig. | 40% match vs. orig. | 38% match vs. orig. | 38% match vs. orig. | 38% match vs. orig. | ||
86% match vs. prev. | 94% match vs. prev. | 97% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
DAC-Ours + LEnc (48kHz / 6.7kbps) |
||||||
43% match vs. orig. | 38% match vs. orig. | 33% match vs. orig. | 32% match vs. orig. | 32% match vs. orig. | ||
85% match vs. prev. | 92% match vs. prev. | 97% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | ||
DAC-Ours + LProj (48kHz / 6.7kbps) |
||||||
55% match vs. orig. | 47% match vs. orig. | 45% match vs. orig. | 45% match vs. orig. | 45% match vs. orig. | ||
89% match vs. prev. | 94% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
DAC-Ours + LCode (48kHz / 6.7kbps) |
||||||
72% match vs. orig. | 68% match vs. orig. | 65% match vs. orig. | 65% match vs. orig. | 65% match vs. orig. | ||
94% match vs. prev. | 97% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. |
Example (VCTK) |
---|
Codec | 1 Encoding | 5 Encodings | 10 Encodings | 25 Encodings | 50 Encodings | 500 Encodings |
---|---|---|---|---|---|---|
DAC-Ours (48kHz / 6.7kbps) |
||||||
61% match vs. orig. | 57% match vs. orig. | 57% match vs. orig. | 57% match vs. orig. | 57% match vs. orig. | ||
94% match vs. prev. | 99% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
DAC-Ours + LEnc (48kHz / 6.7kbps) |
||||||
55% match vs. orig. | 51% match vs. orig. | 48% match vs. orig. | 47% match vs. orig. | 47% match vs. orig. | ||
91% match vs. prev. | 96% match vs. prev. | 99% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
DAC-Ours + LProj (48kHz / 6.7kbps) |
||||||
75% match vs. orig. | 73% match vs. orig. | 73% match vs. orig. | 73% match vs. orig. | 73% match vs. orig. | ||
96% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | ||
DAC-Ours + LCode (48kHz / 6.7kbps) |
||||||
92% match vs. orig. | 92% match vs. orig. | 92% match vs. orig. | 92% match vs. orig. | 92% match vs. orig. | ||
99% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. | 100% match vs. prev. |
Original | DAC-Ours (48kHz / 6.7kbps) |
DAC-Ours + LEnc (48kHz / 6.7kbps) |
DAC-Ours + LProj (48kHz / 6.7kbps) |
DAC-Ours + LCode (48kHz / 6.7kbps) |
---|---|---|---|---|
SQUIM-MOS: 4.38 | SQUIM-MOS: 4.39 | SQUIM-MOS: 4.46 | SQUIM-MOS: 4.42 | |
SQUIM-MOS: 4.41 | SQUIM-MOS: 4.35 | SQUIM-MOS: 4.42 | SQUIM-MOS: 4.41 |
@misc{oreilly2024codedrift,
title={Code Drift: Towards Idempotent Neural Audio Codecs},
author={O'Reilly, Patrick and Seetharaman, Prem and Su, Jiaqi and Jin, Zeyu and Pardo, Bryan},
year={2024},
eprint={2410.11025},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2410.11025},
}