/plushcap/analysis/assemblyai/built-with-assemblyai-real-time-speech-to-image-generation

Built with AssemblyAI - Real-time Speech-to-Image Generation

What's this blog post about?

In the AssemblyAI project, students at ASU HACKML 2022 utilized the Core Transcription API to create real-time speech-to-image generation. They reproduced elements of DALL-E 2's zero-shot capabilities with a simpler model. The project integrates Machine Learning models and web interface framework with AssemblyAI API, enabling corrective language modeling. The inspiration came from Open AI's paper on Zero-Shot Text-to-Image Generation. The build consists of real-time audio transcription using the AssemblyAI API, HTML/CSS for client-side interface, Node.js and Express for server hosting, pretrained models running in parallel with client and server, and Selenium to pass messages between components. The main takeaways include impressive improvements in audio transcription tools, the potential of less data for similar results as larger models, and the impact of changing pretraining paradigms. Future directions involve incorporating knowledge graphs for semantic correctness checks, associating natural language with image sequences, and generating videos from resultant image vectors.

Company
AssemblyAI

Date published
April 18, 2022

Author(s)
Kelsey Foster

Word count
481

Language
English

Hacker News points
None found.