Recent developments in Generative AI for Audio
In recent years, the field of generative audio models has seen rapid advancements with several notable models being developed for both music generation and text-to-speech synthesis. We will discuss some of these key developments in this article. Music Generation Models: Text-to-Music Synthesis: A growing trend among AI researchers is the development of text-to-music generative models that can produce music based on natural language descriptions, akin to how text-to-image diffusion models work. One such model, MuLan, is a transformer-based model trained on an extensive dataset consisting of soundtracks from 44 million online music videos alongside their text descriptions. It generates embeddings for the text prompt and a spectrogram of the target audio. Once trained, MuLan can either take a piece of music as input and generate textual descriptions and attributes or it can take textual descriptions as input and outputs a representation of musical elements that align with the text. Music Generation Models: Generative Adversarial Networks (GANs): Another approach to music generation is through the use of GANs, which have been successfully applied in various domains for content generation tasks. For instance, GANSynth is a generative model that uses WaveNet as its discriminator and can generate high-quality audio samples of musical notes based on random noise inputs. Speech Synthesis Models: Text-to-Speech (TTS): In the field of TTS synthesis, several breakthroughs have been made over the past few years with models like VALL-E, NaturalSpeech 2, and Voicebox showcasing exceptional performance in terms of voice cloning and naturalness. These models leverage advanced architectures such as Latent Diffusion Models and Flow-Matching for non-autoregressive audio generation tasks. In summary, generative audio models have made significant strides forward in recent years with various innovative approaches being explored across different subdomains within this field.
Company
AssemblyAI
Date published
June 27, 2023
Author(s)
Marco Ramponi
Word count
4075
Language
English
Hacker News points
7