So around two weeks back, I was experimenting with different prompts against different models to try to squeeze some of the promise of multimodality out of the drawing game. It was a little terrifying.
For some quick context, I've been working with two versions of the game: one where the user's lines are composited onto an image which the model modifies directly by returning an edited .png,  and one where the model does something closer to drawing with a pen by passing back path commands.
gemini-2.5-flash is absolutely terrible at drawing curves with path commands. It only ever adds a little squiggle here or there which barely corresponds to its reasoning. I tried out gpt-5, and it worked okay!
So at this point I'm thinking two things: either gemini-2.5-flash just needs a little push to take bigger swings, or else I need to lean into the multimodality, and ask it to plan its changes by modifying the image directly before attempting to recreate those modifications with path commands.
Beat the other player at their own game
I tried giving gemini the push first with a quick addition to the prompt:
Don't just add one curve or two. Go absolutely crazy and err heavily on the side of "too much." You're going to beat the other player at their own game.
And it responded in the most childlike way it could. Scribbling all over the page.
Just incredible stuff.
Multimodality???
Maybe some planning is a good idea. I swapped out gemini-2.5-flash (which only takes text) for the image-generating gemini-2.5-flash-image-preview and updated the prompt:
PLANNING: - Before writing any path commands, plan your addition by modifying the rasterized image with these rules, but don't send it back: - Only use 2px black strokes against the white background - Draw with a single line, think "don't lift the pen" - Don't change the size of the image - After planning, use as many curves as necessary to approximate your changes
But during the test run, I noticed that the response would occasionally come back with an image part anyway. I decided to render these images to see if the model was really planning as directed, and at first it seemed pretty promising!
I mean, the image part doesn't correspond to the path in the text part whatsoever, but it could be related to planning—possibly by approximately replicating the geometry of the image? Pretty promising! I kept going for a couple more turns without receiving another image part, and then...
Is that a basement? Lit with a flashlight? Why is the subject a blank wall? Why is the wall so scratched up? Why does it look so compressed? I have no answers, but I do feel threatened.
I had to keep going, but none of the other responses were quite as interesting, much less terrifying. It did send this nice clay-ified version of the sketch at one point.
And then "fixed" the sketch later on as gemini loves to do.
That seemed like as good a place as any to call it. Clearly, this didn't make gemini any better at drawing with the pen. A total failure in that sense, but I am a big fan of horror, so in some ways a clear success!