Speed is crucial for voice AI interfaces, with response times of 500ms being typical and anything longer than 800ms feeling unnatural. The key technical drivers to optimize for fast voice-to-voice response times are network architecture, AI model performance, and voice processing logic. Today's state-of-the-art components include WebRTC for sending audio from the user's device to the cloud, Deepgram's fast transcription models, Llama 3 70B or 8B, and Deepgram's Aura voice model. By self-hosting all three AI models together in the same Cerebrium container, it is possible to achieve median voice-to-voice response times as low as 500ms.