Code Drift: Towards Idempotent Neural Audio Codecs

1Department of Computer Science, Northwestern University 2Adobe Research

Existing neural audio codecs lack idempotence -- encoded token representations are unstable under successive re-encodings. We fine-tune codec models to improve idempotence without degrading audio quality.

Abstract

Neural codecs have demonstrated strong performance in high-fidelity compression of audio signals at low bitrates. The token- based representations produced by these codecs have proven particularly useful for generative modeling. While much research has focused on improvements in compression ratio and perceptual transparency, recent works have largely overlooked another desirable codec property – idempotence, the stability of compressed outputs under multiple rounds of encoding. We find that state-of-the-art neural codecs exhibit varied degrees of idempotence, with some degrading audio outputs significantly after as few as three encodings.

We investigate possible causes of low idempotence and devise a method for improving idempotence through fine-tuning a codec model. We then examine the effect of idempotence on a simple conditional generative modeling task, and find that increased idempotence can be achieved without negatively impacting downstream modeling performance – potentially extending the usefulness of neural codecs for practical file compression and iterative generative modeling workflows.

Re-Encoding Examples (Existing Codecs)

We find that many existing codecs exhibit a significant drop in audio quality after just a few encodings, and that token representations often change considerably. Iteration-to-iteration match rates can increase as codecs converge to stable tokenizations.



Example (Expresso)

Codec 1 Encoding 5 Encodings 10 Encodings 25 Encodings 50 Encodings 500 Encodings

DAC (44.1kHz / 7.7kbps)

           45% match vs. orig. 38% match vs. orig. 36% match vs. orig. 35% match vs. orig. 34% match vs. orig.
           85% match vs. prev. 94% match vs. prev. 98% match vs. prev. 98% match vs. prev. 100% match vs. prev.

ESC (16kHz / 9.0kbps)

           29% match vs. orig. 24% match vs. orig. 21% match vs. orig. 19% match vs. orig. 19% match vs. orig.
           73% match vs. prev. 84% match vs. prev. 95% match vs. prev. 99% match vs. prev. 100% match vs. prev.

Encodec-VoiceCraft (16Hz / 2.2kbps)

           36% match vs. orig. 27% match vs. orig. 25% match vs. orig. 25% match vs. orig. 25% match vs. orig.
           84% match vs. prev. 86% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev.

Spectral-Codec (22.05kHz / 6.9kbps)

           24% match vs. orig. 18% match vs. orig. 16% match vs. orig. 14% match vs. orig. 13% match vs. orig.
           64% match vs. prev. 74% match vs. prev. 83% match vs. prev. 85% match vs. prev. 98% match vs. prev.

SpeechTokenizer (16kHz / 4.0kbps)

           23% match vs. orig. 13% match vs. orig. 2% match vs. orig. 1% match vs. orig. 1% match vs. orig.
           55% match vs. prev. 58% match vs. prev. 52% match vs. prev. 71% match vs. prev. 100% match vs. prev.

FACodec (16kHz / 4.8kbps)

           11% match vs. orig. 5% match vs. orig. 1% match vs. orig. 1% match vs. orig. 1% match vs. orig.
           49% match vs. prev. 53% match vs. prev. 53% match vs. prev. 55% match vs. prev. 75% match vs. prev.


Example (VCTK)

Codec 1 Encoding 5 Encodings 10 Encodings 25 Encodings 50 Encodings 500 Encodings

DAC (44.1kHz / 7.7kbps)

           65% match vs. orig. 62% match vs. orig. 61% match vs. orig. 61% match vs. orig. 61% match vs. orig.
           95% match vs. prev. 99% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev.

ESC (16kHz / 9.0kbps)

           34% match vs. orig. 28% match vs. orig. 27% match vs. orig. 27% match vs. orig. 27% match vs. orig.
           78% match vs. prev. 91% match vs. prev. 99% match vs. prev. 100% match vs. prev. 100% match vs. prev.

Encodec-VoiceCraft (16Hz / 2.2kbps)

           41% match vs. orig. 35% match vs. orig. 33% match vs. orig. 33% match vs. orig. 33% match vs. orig.
           87% match vs. prev. 95% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev.

Spectral-Codec (22.05kHz / 6.9kbps)

           20% match vs. orig. 13% match vs. orig. 9% match vs. orig. 8% match vs. orig. 8% match vs. orig.
           68% match vs. prev. 74% match vs. prev. 82% match vs. prev. 83% match vs. prev. 99% match vs. prev.

SpeechTokenizer (16kHz / 4.0kbps)

           27% match vs. orig. 16% match vs. orig. 6% match vs. orig. 2% match vs. orig. 1% match vs. orig.
           64% match vs. prev. 68% match vs. prev. 71% match vs. prev. 76% match vs. prev. 71% match vs. prev.

FACodec (16kHz / 4.8kbps)

           8% match vs. orig. 3% match vs. orig. 1% match vs. orig. 1% match vs. orig. 1% match vs. orig.
           50% match vs. prev. 54% match vs. prev. 59% match vs. prev. 55% match vs. prev. 79% match vs. prev.


Re-Encoding Examples (Fine-Tuned Codecs)

We experiment with fine-tuning codec models to improve idempotence through auxiliary losses on encoded representations. For our re-implementation of DAC, we find that an auxiliary loss penalizing the distance between codebook vectors and re-encoded latents yields strong gains in idempotence without significantly degrading audio quality. In particular, fine-tuning with our codebook loss Lcode results in over 90% token idempotence on average at the first encoding iteration across all 9 RVQ levels of our DAC reproduction.



Example (Expresso)

Codec 1 Encoding 5 Encodings 10 Encodings 25 Encodings 50 Encodings 500 Encodings

DAC-Ours (48kHz / 6.7kbps)

           46% match vs. orig. 40% match vs. orig. 38% match vs. orig. 38% match vs. orig. 38% match vs. orig.
           86% match vs. prev. 94% match vs. prev. 97% match vs. prev. 100% match vs. prev. 100% match vs. prev.

DAC-Ours + LEnc (48kHz / 6.7kbps)

           43% match vs. orig. 38% match vs. orig. 33% match vs. orig. 32% match vs. orig. 32% match vs. orig.
           85% match vs. prev. 92% match vs. prev. 97% match vs. prev. 99% match vs. prev. 100% match vs. prev.

DAC-Ours + LProj (48kHz / 6.7kbps)

           55% match vs. orig. 47% match vs. orig. 45% match vs. orig. 45% match vs. orig. 45% match vs. orig.
           89% match vs. prev. 94% match vs. prev. 99% match vs. prev. 100% match vs. prev. 100% match vs. prev.

DAC-Ours + LCode (48kHz / 6.7kbps)

           72% match vs. orig. 68% match vs. orig. 65% match vs. orig. 65% match vs. orig. 65% match vs. orig.
           94% match vs. prev. 97% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev.


Example (VCTK)

Codec 1 Encoding 5 Encodings 10 Encodings 25 Encodings 50 Encodings 500 Encodings

DAC-Ours (48kHz / 6.7kbps)

           61% match vs. orig. 57% match vs. orig. 57% match vs. orig. 57% match vs. orig. 57% match vs. orig.
           94% match vs. prev. 99% match vs. prev. 99% match vs. prev. 100% match vs. prev. 100% match vs. prev.

DAC-Ours + LEnc (48kHz / 6.7kbps)

           55% match vs. orig. 51% match vs. orig. 48% match vs. orig. 47% match vs. orig. 47% match vs. orig.
           91% match vs. prev. 96% match vs. prev. 99% match vs. prev. 100% match vs. prev. 100% match vs. prev.

DAC-Ours + LProj (48kHz / 6.7kbps)

           75% match vs. orig. 73% match vs. orig. 73% match vs. orig. 73% match vs. orig. 73% match vs. orig.
           96% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev.

DAC-Ours + LCode (48kHz / 6.7kbps)

           92% match vs. orig. 92% match vs. orig. 92% match vs. orig. 92% match vs. orig. 92% match vs. orig.
           99% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev. 100% match vs. prev.


Vocoding Examples (Fine-Tuned Codecs)

Original

DAC-Ours (48kHz / 6.7kbps)

DAC-Ours + LEnc (48kHz / 6.7kbps)

DAC-Ours + LProj (48kHz / 6.7kbps)

DAC-Ours + LCode (48kHz / 6.7kbps)

SQUIM-MOS: 4.38 SQUIM-MOS: 4.39 SQUIM-MOS: 4.46 SQUIM-MOS: 4.42
SQUIM-MOS: 4.41 SQUIM-MOS: 4.35 SQUIM-MOS: 4.42 SQUIM-MOS: 4.41


BibTeX

@misc{oreilly2024codedrift,
      title={Code Drift: Towards Idempotent Neural Audio Codecs},
      author={O'Reilly, Patrick and Seetharaman, Prem and Su, Jiaqi and Jin, Zeyu and Pardo, Bryan},
      year={2024},
      eprint={2410.11025},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2410.11025},
}