A Digital Picture Frame that Listens to your Conversations and Illustrates Them

I had this idea: hook a digital picture frame up to a microphone, listen to everything that’s said in the room with it, and use an AI art generator to create images based on what it overheard. It could be a physical object, like a digital picture frame with a microphone attached to it. I liked watching the production of AI art for previous projects like the generated Magic cards, and I wanted to do another project that would have that fun element.

The basic idea is this:

  • A microphone will record audio, and a speech-to-text tool will transcribe it
  • An LLM will take that speech and suggest an artistic description of an image inspired by it
  • An AI art tool will convert that text into an image
  • The overall picture will be updated to include this new image within it

It ends up producing an image that changes over time, like this:

This post explains how it all works. The code’s all on Github if you want to mess with it or follow along.

Audio

To my surprise, this turned out to be the hardest part of the project. I used OpenAI’s whisper API, because I’d heard good things and it was convenient, but this turned out to be the wrong choice. It expects audio files rather than streaming audio, so the code creates an audio file every 5 seconds to send to the API. This turns out to create artifacts by splitting up words, and generally the transcription quality isn’t great. It also meant that the Python program has to do multithreadding in order to constantly record audio. If I were to keep working on this project, this is something I’d want to rewrite.

I display the text on screen like a subtitle, it looks something like this:

I added a little “Recording” message in the top corner to let people know, because I care about privacy

Making the Art

I wanted the generated images to be artistic and interesting, but to take inspiration from what people were saying, so I use an LLM to create a prompt that gets sent to the AI image generator.

You can see the prompt here, but basically I use a prompt that asks the LLM to fill out a sort of MadLibs style prompt:

I’m making art based on a conversation that I’m having. Here’s a loose transcription of the conversation so far:

{audio_transcript}

I want you to think about the imagery suggested by this conversation, and then fill out this form:

“A picture of [concrete adjective] [concrete noun] with [lighting]. The art is [concrete or realistic style] and [creative technique], in the style of [artist].”

message from the programmer to the LLM, 2024

I tried generating art without this step, just sending recorded text directly to the art generator, but it turns out that DALLE has trouble making artistic sense of random snippets of conversation, so it looks pretty bad. One side effect of doing this is that GPT-4 is a real prude that practices self-censorship, so it’s not possible to get risque images such as nudity through this step.

I found that the art was often conceptually disorganized and thematically jumbled, and that it helped to tell it to generate all of the images in the same style. Here, all of the images except the initial are mimicking the style of Rebecca Guay:

Updating the Image

My idea for this project was that the image in the picture frame would update every minute or so to include new art based on the conversation that it overheard. So I didn’t want to fully overwrite the image, just include a bit of new content. So to do that, the program maintains one large, high resolution image, and keeps track of how long ago each section of the image was last edited. Then it chooses the oldest part of the image and commissions new content for it.

In order to do that, the program generates a random polygon in the region that it wants to edit, and inpaints into that polygon. I had originally wanted to use image segmentation to split the image up intelligently, but I didn’t get to that step. I experimented with Voronoi cells to do this, but in the end, generating random polygons worked well. To do that, it generates a bunch of random points within a circle, and then takes their convex hull.

I wanted to inpaint into an existing image, so I went with Stability.ai, which offers an API call for inpainting. This means I can send it the image so far, and it can draw into just a part of that image. The full image is large and rectangular, but Stability’s model wants a square image of a fixed resolution, so it zooms in on the image to the area around the polygon being edited.

Here, you can see a series images showing how the image is inpainted. First, a larger image is zoomed into a smaller square section (image 2). Then, a random polygon is generated, to select part of the image to be replaced (image 3). Then, Stability’s API is called with the image, the mask, and a prompt like “weird alien face”. Stability sends back an edited version of the cropped image with the new content, which is then copied onto the original image.

Future Work

I’m happy with how this project turned out. It’s fun to put it on in the background while I’m talking to a friend and see what shows up on screen.

The biggest improvement that it needs is better audio transcription, probably through a streaming API. A local service could be used instead, but I wanted to keep everything lightweight so that it could live inside a digital picture frame. I’d like to set this project up on an Arduino so that I can keep it going all the time on my wall, maybe on something like The Frame. For now, I set it up on an external monitor acting as a picture frame, driven by an old laptop.

There are some improvements that could be made that would make the resulting image more coherent. To start with, the image could be segmented properly rather than with random polygons, perhaps using something like Meta’s Segment Anything. After that, the prompt that’s sent to the AI artist could include more directions about what is where, so instead of an image depicting a dragon, it could be an image of “a dragon looking at the castle to its left in the style of the forest to its right”. Similarly, if the conversation is about a topic that’s related to art already on the canvas, it might make sense to colocate those image.

But overall, it’s good art and I’m happy I did the project!

If you’d like to run the code, I’d recommend running dev_make_gif.py, which does the iterative image generation without messing around with the microphone. Or you can run app.py, which does the full thing, although it’s kind of jank. Please reach out if you want to run it but run into issues.


Posted

in

by

Tags:


Related Posts