Recent developments in Generative AI for Audio

Company

AssemblyAI

Date Published

June 27, 2023

Author

Marco Ramponi

Word count

4075

Language

English

Hacker News points

URL

www.assemblyai.com/blog/recent-developments-in-generative-ai-for-audio

Summary

In recent years, the field of generative audio models has seen rapid advancements with several notable models being developed for both music generation and text-to-speech synthesis. We will discuss some of these key developments in this article. Music Generation Models: Text-to-Music Synthesis: A growing trend among AI researchers is the development of text-to-music generative models that can produce music based on natural language descriptions, akin to how text-to-image diffusion models work. One such model, MuLan, is a transformer-based model trained on an extensive dataset consisting of soundtracks from 44 million online music videos alongside their text descriptions. It generates embeddings for the text prompt and a spectrogram of the target audio. Once trained, MuLan can either take a piece of music as input and generate textual descriptions and attributes or it can take textual descriptions as input and outputs a representation of musical elements that align with the text. Music Generation Models: Generative Adversarial Networks (GANs): Another approach to music generation is through the use of GANs, which have been successfully applied in various domains for content generation tasks. For instance, GANSynth is a generative model that uses WaveNet as its discriminator and can generate high-quality audio samples of musical notes based on random noise inputs. Speech Synthesis Models: Text-to-Speech (TTS): In the field of TTS synthesis, several breakthroughs have been made over the past few years with models like VALL-E, NaturalSpeech 2, and Voicebox showcasing exceptional performance in terms of voice cloning and naturalness. These models leverage advanced architectures such as Latent Diffusion Models and Flow-Matching for non-autoregressive audio generation tasks. In summary, generative audio models have made significant strides forward in recent years with various innovative approaches being explored across different subdomains within this field.