OpenAI GPT-5.4 Test: Impressive Answers, Off-Target Responses

GPT-5.4 Thinking: Strong Reasoning, But Often Answers the Wrong Question

Key Facts

What: OpenAI released GPT-5.4 Thinking, a specialized model focused on deeper cognitive tasks and advanced reasoning rather than a general-purpose update.
When: Released last week as of March 9, 2026.
Availability: Available via the Codex programming tool, OpenAI API, and paid ChatGPT plans including the $20-per-month ChatGPT Plus tier.
Performance: Delivers stronger analysis and reasoning than prior models; excels in professional-level tasks but sometimes deviates from the exact user query.
Limitations: Image generation uses a non-advanced model that frequently ignores prompt details; formatting tends toward excessively long numbered lists.

Lead paragraph

OpenAI has launched GPT-5.4 Thinking, a new reasoning-focused model that skips the incremental 5.3 version and jumps directly to 5.4. Designed for "bigger thoughts and challenges," the model demonstrates notably deeper analysis and constructive responses across complex tasks, according to hands-on testing by ZDNet Senior Contributing Editor David Gewirtz. While the text output impressed in reasoning depth and avoided hallucinations, the model occasionally answered questions different from those asked and showed weaknesses in image generation and formatting.

Deeper Analysis and Professional Capabilities

GPT-5.4 Thinking represents a shift in OpenAI's release strategy. Instead of a broad incremental update, the company positioned this version specifically for cognitively demanding work. Gewirtz tested the model using the $20-per-month ChatGPT Plus plan and found that its responses required more comprehensive prompts and produced far more extensive outputs than typical ChatGPT interactions.

The model excelled in text-based reasoning. Gewirtz reported that most challenges resulted in thoughtful, constructive answers. Importantly, he did not observe any hallucinations during his tests. The depth of analysis appeared significantly stronger than earlier ChatGPT models, aligning with separate reports that GPT-5.4 performs at or above experienced human professionals in structured evaluations.

One particularly notable test involved designing a "helicarrier" — an aircraft carrier capable of flight. When asked to explain the vehicle's structure, how it would remain aloft, potential constraints, and tactical advantages, GPT-5.4 Thinking provided a detailed response that critically analyzed why "four downward-facing turbo-propellers are a weak solution." This level of critical reasoning stood out as a strength of the "Thinking" designation.

Testing Methodology and Specific Challenges

Because GPT-5.4 Thinking favors in-depth responses, Gewirtz adapted his testing approach. Rather than short prompts with easily excerptable answers, he used more involved challenges and shared full chat transcripts via links for readers to examine complete interactions.

The first test focused on image generation: "Create an image of an aircraft carrier flying in the sky, held up by four upward-facing turbo-propellers in round fan housings, carrying a squadron of fighter jets on its deck." Consistent with behavior seen in other AI models, GPT-5.4 generated an image with propellers facing backward rather than upward as specified. Even when prompted to design a helicarrier with explicit structural requirements, the image generation component failed to fully incorporate the requested details, indicating that the underlying image model is not as advanced as the text reasoning capabilities.

Subsequent tests explored the model's reasoning on complex, real-world style problems. Gewirtz noted that the model generally provided value but required ongoing management to stay on topic. The tendency to answer slightly different questions than those posed emerged as a recurring pattern across multiple interactions.

Competitive Context and Benchmark Performance

GPT-5.4 Thinking enters a competitive landscape where reasoning capabilities have become the primary battleground. Separate evaluations reported alongside the ZDNet review highlight the model's strengths. In professional-level work tests, GPT-5.4 reportedly outperformed or matched experienced human professionals in 83% of cases, according to a related ZDNet article.

Independent tester Nate B. Jones, who conducts structured blind evaluations, found GPT-5.4 demonstrating advanced structured reasoning on multi-variable problems while occasionally missing simpler logical questions that even children might answer correctly. This contrast — excelling at complex professional tasks while sometimes failing basic common-sense scenarios — appears characteristic of the current generation of frontier models.

Comparisons with competitors like Claude Opus 4.6 and Gemini 3.1 show GPT-5.4 performing strongly on real-world professional assignments. The model reportedly handled long-document summarization effectively, producing concise outputs that captured core arguments. However, its tendency to reframe questions rather than answer them precisely remains a noted differentiator from models like Claude, which in one documented case provided a direct one-sentence answer to a simple decision problem where GPT-5.4 offered a more elaborate but less directly relevant response.

Impact on Developers and Users

For developers accessing GPT-5.4 Thinking through the API or Codex tool, the model offers enhanced capabilities for complex problem-solving, code architecture discussions, and analytical tasks. The deeper reasoning suggests potential value in professional workflows where nuanced analysis matters more than perfect prompt adherence.

End users on ChatGPT Plus will find the model particularly useful for "bigger challenges and questions," as Gewirtz concluded. The constructive value of most responses makes it suitable for research, planning, and detailed exploration. However, the need for continuous management to keep outputs aligned with original intent may frustrate users seeking quick, precise answers.

The formatting preferences — particularly very long numbered lists — could impact readability in professional documents or client deliverables. Users may need to add explicit instructions about response style to mitigate this tendency.

Image generation limitations mean that for visual tasks, users might still need to rely on specialized tools like DALL-E 3 or competitors' image models rather than expecting state-of-the-art results from GPT-5.4 Thinking's multimodal features.

What's Next

OpenAI has not detailed the timeline for a potential GPT-5.5 or general availability of GPT-5.4 Thinking beyond current paid tiers and API access. The company's decision to jump from 5.2 directly to 5.4 and brand this release as "Thinking" suggests a strategic focus on specialized reasoning models rather than frequent general updates.

Future iterations will likely address the observed tendency to answer adjacent rather than exact questions — a challenge that appears common across frontier models as they prioritize comprehensive analysis. Improvements in image generation integration and formatting control are also probable areas for enhancement.

As AI models continue advancing in professional capabilities, the gap between "impressive reasoning" and "precise execution of user intent" remains a key area of differentiation. GPT-5.4 Thinking demonstrates clear progress in the former while highlighting that the latter still requires active user guidance.

The release reinforces OpenAI's position in the reasoning arms race with Anthropic's Claude and Google's Gemini models, though independent evaluations suggest no single model dominates across all task types. Users and developers will likely continue testing multiple models for different use cases as capabilities evolve rapidly.

I tested GPT-5.4, and the answers were really good - just not always what I asked

Sources

Original Source

Related Topics

Comments