HackathonParty

Framing the Problem

Visual art has always been one of humanity’s most powerful forms of expression — yet for people who are blind or visually impaired, most artworks remain inaccessible.
While text-based content can be read by screen readers, images and paintings are often reduced to minimal alt-text like “a landscape” or “a portrait.” This strips away the emotional, thematic, and visual richness that defines art.

The lack of detailed, human-like image descriptions limits access not only for the visually impaired but also for anyone using assistive technology, educators teaching art remotely, or people engaging with online exhibitions.
The goal is to make visual art audibly understandable, preserving as much context, structure, and emotion as possible.

Idea Explanation

Talk Art to Me bridges art and health by making visual experiences accessible to people with visual impairments or neurodiverse conditions. By converting visual art into spoken descriptions using AI, it promotes inclusion, emotional well-being, and cognitive engagement — demonstrating how accessibility technology can directly enhance quality of life.

Using multimodal AI, the app analyzes an image to produce both a holistic description and localized captions for different regions of the artwork. These are then read aloud using built-in text-to-speech, allowing users to explore art through sound.

Users can tap on different areas of an artwork to hear specific details, toggle an interactive grid overlay, or load different paintings to explore. This web app is best used via a touchscreen because they provide better coordination than mouse pointers.
The idea bridges art and accessibility — giving voice to visual creativity, and enabling those who can’t see to still experience art.

Implementation

The app is built with Next.js (React + TypeScript) on the frontend and leverages a custom backend API route that connects to Google Gemini’s multimodal vision model.

The frontend handles image loading, grid rendering, and speech synthesis. It first scales the image to normalize the file size, and then preprocesses each image into smaller regions before sending it to the backend.
The backend receives the image and region data, forwards them to Gemini, and returns a structured JSON response containing both a global description and per-region captions.
The frontend then displays the results, overlaying a grid that users can interact with. Clicking or tapping on a region triggers text-to-speech for that section’s description.

I also optimized the display of artworks on screens to maximize the display, so that users can have better view and control.

Currently, the app runs fully client-side with on-demand backend requests — future versions will include server-side caching and local uploads for user-submitted artworks.

Challenges

Browser Audio Restrictions — Many browsers block audio playback until the user interacts with the page. I solved this by detecting the first click or touch event to “unlock” the speech synthesis context before playing the AI-generated narration.
Responsive Scaling — Ensuring the clickable grid perfectly aligns with the image across devices required dynamic scaling calculations and resize observers.
Async Timing Between Model & TTS — The AI output sometimes arrived before the browser was ready to play speech, so I used controlled delays and state tracking to synchronize them smoothly.
- Because I'm using the broswer built-in TTS, I only tried with browsers and devices I have (Chrome on Windows & Safari on iPhone). The TTS might not work on different machines / browsers.
I'm working alone because I couldn't find any teammates. I have only 48 hours - my tasks include coming up with an idea, selecting the technical stack & designing the technical framework, implementing it, and making the video and writing the documentation. Time management was very difficult and I didn't achieve the full functionality I was intending to, but I will continue working on the project.

Accomplishments

Successfully built a fully interactive, accessible AI art narrator with real-time text-to-speech.
Learned how to combine multimodal AI, frontend state management, and accessibility design principles in one project.
Created a responsive interface that dynamically adjusts to any artwork and device.
Produced meaningful output that demonstrates how AI can assist inclusivity in digital media and art.
Gained experience on time management and project planning.

Next Steps

My next steps focus on improving both accessibility and scalability:

Accessibility Enhancements: Add keyboard navigation, ARIA roles, and user-adjustable speaking speed and language options.
Performance: Implement server-side caching for faster responses and reduced API calls.
User Features: Allow local artwork uploads for personal exploration and support more AI models for descriptive variety.
Compatibility: Test on more browsers and machines to ensure compatibility.
Stability: Refine front-backend connection for smoother refreshing and loading for images.
Backend: Improve performance to enable splitting the image into finer grids for more details.
Expansion: Adapt Talk Art to Me into a browser plug-in or API service that can bring AI-generated art narration to digital galleries, educational platforms, and museum websites.

Ultimately, the goal is to make Talk Art to Me a tool that empowers more people — whether they’re visually impaired, students, or simply art lovers — to experience creativity through sound.

Talk Art to Me