OpenAI GPT-5.4 Test: Impressive Answers, Off-Target Responses

Q: The full process?

#### 1. Define the goal Write a one-sentence success metric before you touch any code. **Example goal:** “Build a reusable TypeScript wrapper that sends any user request to GPT-5.4 Thinking, forces the model to answer only the exact question asked, and returns clean Markdown with strict section headings and a maximum of 3 bullet levels.” Success looks like: zero off-topic paragraphs, consistent formatting across 50 test cases, and <2% hallucination rate on factual queries.

Q: Copy-paste starter templates?

**Improved system prompt (v2)** Add this paragraph if you still see occasional drift: ```text Before answering, write a one-sentence internal thought: "Exact question: [repeat user question verbatim]". Then answer ONLY that question. Delete the internal thought before outputting. ``` **Python version snippet** ```python from openai import OpenAI client = OpenAI() def ask_locked(prompt: str) -> str: system = open("systemTargetLock.md").read() resp = client.chat.completions.create(

Q: Pitfalls and guardrails?

**### What if the model still drifts even with the system prompt?** Lower temperature to 0.2 and add the “internal thought” trick above. If that fails, switch to a two-step chain: first call a cheaper model (gpt-4o-mini) to rewrite the user question into an ultra-explicit version, then send that to GPT-5.4 Thinking. **### What if formatting is still ugly (long numbered lists)?** Post-process with a small regex or use a follow-up call to “rewrite the previous answer using only headings and 2-

Vibe Coding Guide: Building a "Stay on Target" Prompt Wrapper for GPT-5.4 Thinking

Why this matters for builders
GPT-5.4 Thinking lets you tackle complex, multi-step professional tasks with deeper reasoning and structured analysis than previous models, but it frequently answers questions you didn’t ask and produces overly long numbered lists or off-track responses. This guide gives you a reliable, code-first process to wrap the model so you get the high-quality reasoning while forcing it to stay exactly on the prompt you gave it.

The release (March 2026) jumps straight to 5.4 and ships a dedicated “Thinking” variant optimized for bigger cognitive challenges. It is available today in the Codex programming tool, via the OpenAI API, and for all paid ChatGPT plans ($20/mo ChatGPT Plus and higher). Benchmarks show it beats experienced human professionals on pro-level work 83% of the time, yet real-world tests reveal a persistent “answer drift” problem that breaks many production workflows.

When to use it

You are building internal tools, research agents, or code generators that must produce exactly the requested output format.
You need reliable multi-step reasoning (business case analysis, system design, long-document summarization) but cannot tolerate hallucinations or scope creep.
You want to ship a customer-facing feature where predictability and formatting matter.
You are already using GPT-4o or Claude and want a drop-in upgrade path that adds guardrails instead of switching models entirely.

The full process

1. Define the goal

Write a one-sentence success metric before you touch any code.

Example goal:
“Build a reusable TypeScript wrapper that sends any user request to GPT-5.4 Thinking, forces the model to answer only the exact question asked, and returns clean Markdown with strict section headings and a maximum of 3 bullet levels.”

Success looks like: zero off-topic paragraphs, consistent formatting across 50 test cases, and <2% hallucination rate on factual queries.

2. Shape the spec/prompt

Create a system prompt that acts as a permanent “target lock.” This is the single most important artifact.

You are GPT-5.4-Thinking-TargetLock, an extremely precise reasoning engine.
Rules (never break them):
1. Answer ONLY the exact question or task the user gave you. Do not add related advice, alternatives, or “while you’re here” thoughts.
2. Never output more than 3 levels of bullets or numbered lists.
3. Use this exact Markdown structure:
   # Main Title (one line)
   ## Section 1
   Content...
   ## Section 2
   ...
4. If the request is for an image or diagram, describe it in text only and say “Image generation not supported in this wrapper.”
5. End every response with exactly this line: "---\nResponse strictly followed the target question."

Store this as systemTargetLock.md in your repo. Every new feature starts by editing this file.

3. Scaffold the wrapper

Use a minimal Node.js + TypeScript project (or Python + Pydantic if you prefer).

mkdir gpt54-targetlock
cd gpt54-targetlock
npm init -y
npm install openai zod dotenv
tsc --init

Create src/wrapper.ts:

import OpenAI from "openai";
import { z } from "zod";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const TargetResponseSchema = z.object({
  content: z.string(),
  followedExactly: z.boolean(),
});

export async function askThinkingLocked(userPrompt: string): Promise<string> {
  const systemPrompt = fs.readFileSync("./systemTargetLock.md", "utf8");

  const completion = await openai.chat.completions.create({
    model: "gpt-5.4-thinking",   // exact model name per OpenAI docs
    temperature: 0.3,
    max_tokens: 4096,
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userPrompt },
    ],
  });

  const raw = completion.choices[0].message.content || "";
  
  // Light post-processing guard
  const cleaned = raw
    .replace(/^\s*[\d]+\.\s+/gm, "- ")           // normalize long numbered lists
    .replace(/^(?!#|##|---).+$/gm, m => m.trim()); // remove stray lines

  const followed = cleaned.includes("Response strictly followed the target question");

  return cleaned;
}

4. Implement the core loop

Build a CLI and a simple web playground so you can iterate fast.

CLI example (src/cli.ts)

import { askThinkingLocked } from "./wrapper";

async function main() {
  const question = process.argv.slice(2).join(" ");
  if (!question) {
    console.error("Usage: tsx cli.ts \"your exact question here\"");
    process.exit(1);
  }

  console.log("🔒 Sending to GPT-5.4 Thinking with target lock...");
  const start = Date.now();
  const answer = await askThinkingLocked(question);
  const duration = Date.now() - start;

  console.log(answer);
  console.log(`\n⏱️  ${duration}ms | Model: gpt-5.4-thinking`);
}

main();

Add a tiny Express playground if you want instant browser testing.

5. Validate ruthlessly

Run the exact tests from the ZDNet article plus your own regression suite.

Create tests/regression.ts:

const testCases = [
  {
    name: "Aircraft carrier physics",
    prompt: "Explain why four downward-facing turbo-propellers are a weak solution for keeping an aircraft carrier aloft.",
    mustContain: ["aerodynamics", "structural", "energy"],
    mustNotContain: ["design a new vehicle", "tactical advantages"] // must stay on exact question
  },
  {
    name: "Simple decision drift",
    prompt: "I need to wash my car. The carwash is 100 meters away. Should I walk or drive?",
    mustContain: ["walk", "drive", "100 meters"],
    mustNotContain: ["climate change", "exercise benefits"] // classic drift example
  },
  // add 20+ more from your domain
];

async function runRegression() {
  for (const t of testCases) {
    const answer = await askThinkingLocked(t.prompt);
    const passed = t.mustContain.every(w => answer.includes(w)) &&
                   !t.mustNotContain.some(w => answer.includes(w));
    console.log(`${passed ? "✅" : "❌"} ${t.name}`);
  }
}

Run this suite after every model update. Aim for >95% exact-match rate.

6. Ship it safely

Add rate-limit and cost guards (track tokens, set monthly budget).
Version the system prompt in Git so you can rollback when OpenAI changes behavior.
Expose a /strict endpoint in your product that always uses the wrapper; keep a /creative endpoint for when you want the full Thinking power.
Document the exact model name and temperature you settled on so the next developer doesn’t undo your guardrails.

Copy-paste starter templates

Improved system prompt (v2)
Add this paragraph if you still see occasional drift:

Before answering, write a one-sentence internal thought: "Exact question: [repeat user question verbatim]". 
Then answer ONLY that question. Delete the internal thought before outputting.

Python version snippet

from openai import OpenAI
client = OpenAI()

def ask_locked(prompt: str) -> str:
    system = open("systemTargetLock.md").read()
    resp = client.chat.completions.create(
        model="gpt-5.4-thinking",
        temperature=0.2,
        messages=[{"role":"system","content":system}, {"role":"user","content":prompt}]
    )
    return resp.choices[0].message.content

Pitfalls and guardrails

### What if the model still drifts even with the system prompt?
Lower temperature to 0.2 and add the “internal thought” trick above. If that fails, switch to a two-step chain: first call a cheaper model (gpt-4o-mini) to rewrite the user question into an ultra-explicit version, then send that to GPT-5.4 Thinking.

### What if formatting is still ugly (long numbered lists)?
Post-process with a small regex or use a follow-up call to “rewrite the previous answer using only headings and 2-level bullets.” This adds ~15% cost but guarantees clean output.

### What if image generation is required?
The Thinking model uses a weaker image engine. Route image requests to DALL·E 3 or Flux via a separate flag and return a placeholder card that says “Image generated by dedicated model.”

### What if API pricing changes?
GPT-5.4 Thinking is currently only on paid plans. Monitor your token usage; complex reasoning prompts can easily exceed 15k output tokens. Implement a hard cap and fallback to gpt-4o when budget is exceeded.

### What if OpenAI deprecates the exact model name?
Keep the model string in a config file. The wrapper pattern lets you swap gpt-5.4-thinking for gpt-5.5-thinking or even Claude 4 in one line.

What to do next

Ship the wrapper as an internal package (@yourorg/gpt54-lock).
Instrument 3 production features with it this week.
Run the regression suite against the next OpenAI release the same day it drops.
A/B test “locked” vs “raw” mode with real users and measure task completion rate.
Write a short internal style guide: “All new agents must use TargetLock unless creativity is the primary goal.”

This pattern turns GPT-5.4 Thinking from an impressive but occasionally disobedient researcher into a dependable engineering teammate.

Sources

Original ZDNet review: “I tested GPT-5.4, and the answers were really good - just not always what I asked” — https://www.zdnet.com/article/gpt-5-4-thinking-tests-review/
“OpenAI's new GPT-5.4 clobbers humans on pro-level work in tests - by 83%” — https://www.zdnet.com/article/openai-gpt-5-4/
Nate B Jones structured evaluations and car-wash decision test
Additional real-world prompt tests from Tom’s Guide, Analytics Vidhya, and independent blind evals (March 2026)

(Word count: 1,248)

I tested GPT-5.4, and the answers were really good - just not always what I asked