Hidden in Plain Sight

Embedding text in AI images with ControlNet and Modal

Aug 05, 2023

Can you see the forest for the trees?

You might need to squint.

A few weeks ago, a viral post on Reddit caught my attention. It was about embedding text in plain sight within AI-generated images using Stable Diffusion and ControlNet, creating an ingenious blend of art and technology. Here’s one example from that post:

r/StableDiffusion - Free Tool to Generate “Hidden” Text (Using Stable Diffusion + ControlNet)

If you’re like me, you’re probably a bit confused right now: where’s the text? I couldn’t see it myself until I looked at a thumbnail preview of the same image:

It’s hard to believe those are the same picture! Try resizing (or squinting) to prove it to yourself.

(Mobile readers: you’re probably REALLY confused right now because you see the text in both images, which are naturally small on your device. Zoom in or come back from a laptop to see the full effect.)

Essentially, the text is hidden as a low-frequency component of the image. Zooming out or even squinting can hide the high-frequency details that prevent your brain from resolving it.

I thought this was so clever and fascinating and wanted to try myself. I only found one tutorial online (unfortunately, it has since disappeared), and it relied on the popular Automatic1111 UI. I wanted a way to do generate these images programmatically, so I began writing a script to do so. This gave me a greater opportunity to learn how they worked, as well as to exercise more control over the output.

I tweeted some of my early results and got such an warm response that I decided to publish my code and write this piece.

What’s going on here?

Stable Diffusion (SD) in an open-source text-to-image model, meaning it translates a description of a scene into a picture of that scene. This technology is incredibly powerful, but it can be difficult to control. For example, I can use it to generate infinite variations of “a house,” but getting it to produce one that looks the same as the house I grew up in would defy even the most precise natural language prompt.

ControlNet (CN) is a technique that can precisely guide the output of a SD model. There are a variety of ControlNets optimized for different types of images, but broadly speaking it would let me show the SD model a picture of my house and have it generate an image that appeared to be of the same house, but under different lighting, weather, or completely fanciful conditions.

Recently, a CN was developed that excelled at producing natural-looking images that are actually valid QR codes, like this one (try scanning it):

It works by using the black-and-white QR code pattern to inform areas of high-and-low contrast in the final image, with SD filling in the details in an aesthetic way.

It turns out we can use this same technique to embed text instead of QR codes. The effect can be extremely subtle or painfully obvious, depending on the settings used. The most fascinating thing to me is how the text completely disappears when the image is large and reappears when it’s small (or you stand back, or squint) and only focus on the low-frequency components.

Here are a few examples I generated:

Try it yourself

When I started generating these images, I quickly ran into a problem. My M1 MBP takes minutes to generate a single image, which is just too slow. I’m a fan of Modal for rapid access to serverless compute, so I used Modal to offload computation to a remote A10G GPU, which brought generation time down to seconds.

I’ve published the script that I used to generate many of these images. Its only requirement is a Modal account.

The process isn't without its challenges. Fine-tuning the visibility of the text in the images is intricate and depends on various factors, including guidance scales, image size, and the prompt. Some scenes are more amenable to this technique than others, and naturally high-contrast scenes lend themselves particularly well.

There's a lot to be explored in the interplay of text and images in this way. The melding of high and low-frequency components to create, hide, and reveal meaning opens up exciting possibilities, from art to cryptography to novel forms of communication. There is a frontier here that's ripe for exploration, and I’m thrilled to get to dive into it! I hope you’ll give it a shot, too.

FactsMachine

Discussion about this post