← Back to Experiments
Tool Comparison

Three AI Models Walk Into an Unsolvable Puzzle

My second grader came home with a logic puzzle worksheet. After some frustration at the kitchen table, I realized there wasn't enough information to solve it. I decided to test whether AI could confirm my suspicion.

Three AI Models Walk Into an Unsolvable Puzzle

Get curious. Stay grounded. Keep testing.

Tools tested: Claude Opus 4.5, Google Gemini 3, GPT 5.2 Cost: Subscription tiers across platforms

My second grader came home with a logic puzzle worksheet. You know the type: a grid where you match people to attributes using process of elimination. This one had four kids (Mia, Zoey, Sam, and Leo) who each had an ornament color and a type of cookie they baked.

After some frustration at the kitchen table, I took a closer look at the worksheet and realized there wasn't enough information to solve it. Something was missing.

I decided to test whether AI could confirm my suspicion. I uploaded the same image to three different models and asked: "Is this missing a clue?"

The Setup

The puzzle had four clues:

  1. The student with the green ornament baked sugar cookies
  2. The student with the gold ornament baked chocolate cookies
  3. Mia has the green ornament
  4. Zoey did not bake chocolate cookies and does not have the red ornament

Simple enough for second grade, right?

How Each Model Approached It

Claude Opus 4.5 started conversationally, working through what we could determine step by step. It identified what was certain (Mia has green, Mia baked sugar cookies, Zoey must have blue by elimination) and then clearly stated the problem: "There's no clue that distinguishes Sam from Leo."

It wrapped up by asking if this was a worksheet from school and suggesting a clue might have been cut off at the bottom of the page. That's exactly what I suspected.

Gemini 3 took a more structured, almost academic approach. It walked through the logic in organized sections: "What We Can Solve," "The Missing Information," and "Conclusion." It even offered to create a hypothetical 5th clue so my second grader could finish the puzzle for fun.

GPT 5.2 landed somewhere in the middle. It used checkmarks and clear headers to separate what could be determined from what couldn't. Like the others, it correctly identified that Sam and Leo were indistinguishable with the given clues, and offered to show both possible solution grids.

Different Approaches Meet Different Needs

All three models reached the same correct conclusion: the puzzle is missing information. None of them guessed or made up an answer. That's actually reassuring when you're using AI with kids.

But the differences in approach reveal something about how these models communicate:

Opus 4.5 felt like a tutor. It talked through the problem the way you might explain it to a child, building understanding step by step. It also picked up on the real-world context (school worksheet, possible copying error) without me mentioning it.

Gemini 3 felt like a textbook. The formal sections and systematic breakdown were great for someone who wants comprehensive documentation of the reasoning process. Probably overwhelming for a second grader looking at it over your shoulder.

GPT 5.2 felt like a middle ground. Clear organization with visual markers, but less formal than Gemini. The offer to show both possible solutions was practical and kid-friendly.

What This Means for Homework Help

If you're using AI to help kids with schoolwork, the model you choose might matter more than you think. Not because one got the answer right and others didn't (they all nailed it), but because of how they explain their thinking.

For my second grader, the Opus response was the easiest to follow. We could read it together and understand why the puzzle couldn't be solved. The Gemini response would have been great for me to understand the full picture, but the formal structure wasn't what a frustrated kid needed in the moment.

The Takeaway

  • All three models correctly identified an unsolvable puzzle. This is a win for AI reliability on logical reasoning tasks.
  • Presentation style varies significantly. Same correct answer, very different user experience.
  • Context awareness differs. Opus was the only one that guessed this might be a photocopying error from school, picking up on cues I hadn't explicitly provided.
  • "Helpful" looks different for different users. Gemini's offer to create a hypothetical clue was thoughtful. Opus asking about the worksheet's origin was practical. GPT's offer to show both solutions was hands-on. All valid approaches, depending on what you need.

Oh, and the missing clue? After I emailed the teacher, she confirmed there should have been a fifth clue: "Sam has the red ornament." Mystery solved. Puzzle completed. Second grader satisfied.