HackathonParty

Framing the Problem

At a global tech conference, most of the room isn't hearing the talk in their first language. English is the default stage language, but for most developers worldwide it's a second language. Attendees miss nuance and rarely risk asking questions in a language that isn't their own. This hurts attendees, who get a degraded version of the content, and organisers, who can't make a talk accessible to a multilingual room without expensive interpreter booths and headsets.

Idea Explanation

Unison is real-time conference dubbing. A speaker talks in one language, and every attendee hears and reads it in their own, live, with no headsets, no app, and no human interpreters. We stream the speech to text, translate it per language, and speak it back in native neural voices in about three seconds. It's also two-way: attendees ask questions in their language and the speaker receives them in theirs.

Implementation

Unison runs as two processes. A Next.js + Kendo UI app serves the speaker, attendee, schedule, and organiser surfaces, with API routes for AI and proxying. A separate WebSocket + HTTP server is the real-time core: the speaker streams mic audio up, and it runs through Deepgram Nova-2 (STT), Google Translate (cached, per language), and Deepgram Aura-2 (neural TTS over a persistent socket pool), then fans dubbed audio and transcript back to each listener.

The frontend hits Next.js over HTTP, and the API routes proxy to the WS server for stats, transcript, questions, and session CRUD. AI Q&A and summaries run through OpenRouter. Instead of a database, state lives in in-memory maps plus file-backed stores, so organiser stats stay real even after a talk ends.

Unison builds on PolyDub, an earlier dubbing prototype. For this hackathon I added the whole conference layer: the Kendo UI organiser dashboard, the sessions schedule, transcript-grounded AI Q&A, post-talk summaries, the two-way question loop, attendee networking, persisted session/event CRUD, and real ended-session stats.

Challenges

The hardest part was keeping latency low without sacrificing audio quality. Generating speech with a fresh request per sentence was too slow, so the neural TTS runs over a persistent WebSocket pool with one long-lived connection per voice, which cut out repeated handshakes and got the full speech-to-dub pipeline down to about three seconds.

The other challenge was making the organiser analytics trustworthy. When a host disconnects, the session shouldn't vanish or fall back to fake demo data, so stats persist a per-session aggregate (peak, history, and served-vs-offered languages) that survives the host leaving, with reach computed as languages actually served over languages offered rather than a hardcoded number. Grounding the AI Q&A was a smaller but real one because the early answers drifted, so the model is constrained to answer strictly from the live transcript.

Accomplishments

I shipped a full, live, two-way conference dubbing experience, wired end to end across STT, translation, neural TTS, AI, and a polished Kendo UI front end. I learned a lot about taming real-time audio, and I'm proud the organiser analytics reflect real usage with zero hardware needed on the attendee side.

Next Steps

Speaker voice cloning, so the dubbed audio keeps the speaker's own timbre across languages.
More languages and dialect coverage, plus per-region voice tuning.
Recording and on-demand playback of dubbed talks, so attendees can revisit any session in their language.
Scale testing for large rooms (hundreds of concurrent listeners per language) and a hosted deployment.

Unison