How to build a LiveKit app with real-time Speech-to-Text

Company

AssemblyAI

Date Published

Dec. 18, 2024

Author

Ryan O'Connor

Word count

2748

Language

English

Hacker News points

None

URL

www.assemblyai.com/blog/livekit-realtime-speech-to-text

Summary

LiveKit is an open-source platform for building real-time audio and video applications. It abstracts away the complicated details of building real-time applications, allowing developers to rapidly build and deploy applications such as video conferencing, livestreaming, interactive virtual events, and more. LiveKit provides a flexible agents system that allows developers to incorporate programmatic agents into their applications for additional functionality. In this guide, we'll show you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI's new Python LiveKit integration. This allows you to transcribe audio streams in real-time so that you can do backend processing, or so you can display the transcriptions in your application's UI. To build a real-time Speech-to-Text agent for your LiveKit application, you'll need three essential components: a LiveKit Server, a frontend application, and an AI Agent that will transcribe the audio streams in real-time. You'll set up the LiveKit Server by creating a project directory, navigating into it, and creating a .env file to store the credentials for your application. Next, you'll set up the frontend application using the LiveKit Agents Playground, which is a web application that allows you to test out the LiveKit agents system. You'll then build the agent by defining an entrypoint function that executes when the agent connects to the room, and inner functions that handle the parallel tasks of sending audio to the STT service and forwarding transcriptions back to the app. Finally, you'll define the main loop of your agent, which is responsible for connecting to the LiveKit room and running the entrypoint function. When the script is run, you'll use LiveKit's cli.run_app method to run the agent, specifying the entrypoint function as the entrypoint for the agent. You can now connect the agent to your LiveKit application and transcribe audio streams in real-time, displaying the transcripts in your application's UI.