HackathonParty

Framing the Problem

Visually impaired individuals often miss out on understanding visual content. While there are tools that look to solve this problem, their descriptions are generic and fail to fully convey emotion and mood. This means that visually impaired individuals miss out on a wide range of experiences that could potentially change their lives and connection to the world.

Idea Explanation

Our idea is a captioning system powered by AI, that is specially designed for visually impaired audiences. It provides rich, meaningful image descriptions that go beyond just identifying objects in the frame, instead it conveys how the image could make someone feel. The goal is to translate the visual experience of what it means to process an image into emotional and contextual understanding by someone who can't see the image.

Implementation

The frontend is built using React, where users are guided to upload an image. Once an image is uploaded, our Flask backend generates a description of the image using OpenAI's API. The description of the image is then converted to speech using Google Text to Speech (gTTS) and sent back to the browser. The project works by doing everything in memory for each request.

Challenges

One of our biggest struggles was deciding on what kind of model to use that could process images and turn them into descriptive text. We originally started working with a PyTorch model and trained it on a dataset of 10,000 images that had corresponding descriptions. However, after training, we found that our model struggled to carry over complexity and mood that was present in images. So we decided to change our model to utilize OpenAI and run off of an API rather than locally on a PyTorch model. This solved our problem of generating text that was a good representation of the nuance of the images uploaded. At the same time, this change forced us to completely restart the work that we had already accomplished. It was difficult to decide to restart everything, but we understood that this would make our project even better.

Accomplishments

We built a complete pipeline that can take an uploaded image, process it through a flask backend that calls OpenAI to create a vivid image description, and returns an audio of the description for users. Our design integrates speech prompts and simple controls to support the navigation of the app for visually impaired users. We learned how to use audio as a way to engage our users of the app and make the interface simple to navigate without visual access.

Next Steps

The next step for our project is to add a database, probably using PostgreSQL, so that users can store images and captions for playback for future playback. Another addition that we would try to add in the future would be support for multiple languages. Right now, our project only operates in English, with no way to change it. This limits the amount of people that can be helped by our project to only those that understand English already.

SightSync

AI-Powered Image Descriptions for Accessibility