The development of real-time voice and video communication with Large Language Models (LLMs) is hindered by significant challenges, including latency. Latency can be broken down into mouth-to-ear delay and turn-taking delay in conversation. The ideal mouth-to-ear delay is around 208 ms, similar to human response time. However, when users are separated by distance, the total mouth-to-ear delay increases significantly due to network stack and transit delays. These delays can cause user dissatisfaction with conversational AI experiences. To minimize latency, it's essential to partner with a provider that optimizes both device-level and network-level latencies, as well as consider LLM providers who have demonstrated performance in turn-taking delay reduction. By understanding the impact of latency on speech-driven conversational AI applications, developers can build more satisfying conversational AI experiences.