I tested GPT-5.4, and the answers were really good - just not always what I asked
News/2026-03-09-i-tested-gpt-54-and-the-answers-were-really-good-just-not-always-what-i-asked-vi
Vibe Coding GuideMar 9, 20267 min read
Verified·1 source

I tested GPT-5.4, and the answers were really good - just not always what I asked

Featured:OpenAI

Vibe Coding Guide: Building a "Stay on Target" Prompt Wrapper for GPT-5.4 Thinking

Why this matters for builders
GPT-5.4 Thinking lets you tackle complex, multi-step professional tasks with deeper reasoning and structured analysis than previous models, but it frequently answers questions you didn’t ask and produces overly long numbered lists or off-track responses. This guide gives you a reliable, code-first process to wrap the model so you get the high-quality reasoning while forcing it to stay exactly on the prompt you gave it.

The release (March 2026) jumps straight to 5.4 and ships a dedicated “Thinking” variant optimized for bigger cognitive challenges. It is available today in the Codex programming tool, via the OpenAI API, and for all paid ChatGPT plans ($20/mo ChatGPT Plus and higher). Benchmarks show it beats experienced human professionals on pro-level work 83% of the time, yet real-world tests reveal a persistent “answer drift” problem that breaks many production workflows.

When to use it

  • You are building internal tools, research agents, or code generators that must produce exactly the requested output format.
  • You need reliable multi-step reasoning (business case analysis, system design, long-document summarization) but cannot tolerate hallucinations or scope creep.
  • You want to ship a customer-facing feature where predictability and formatting matter.
  • You are already using GPT-4o or Claude and want a drop-in upgrade path that adds guardrails instead of switching models entirely.

The full process

1. Define the goal

Write a one-sentence success metric before you touch any code.

Example goal:
“Build a reusable TypeScript wrapper that sends any user request to GPT-5.4 Thinking, forces the model to answer only the exact question asked, and returns clean Markdown with strict section headings and a maximum of 3 bullet levels.”

Success looks like: zero off-topic paragraphs, consistent formatting across 50 test cases, and <2% hallucination rate on factual queries.

2. Shape the spec/prompt

Create a system prompt that acts as a permanent “target lock.” This is the single most important artifact.

You are GPT-5.4-Thinking-TargetLock, an extremely precise reasoning engine.
Rules (never break them):
1. Answer ONLY the exact question or task the user gave you. Do not add related advice, alternatives, or “while you’re here” thoughts.
2. Never output more than 3 levels of bullets or numbered lists.
3. Use this exact Markdown structure:
   # Main Title (one line)
   ## Section 1
   Content...
   ## Section 2
   ...
4. If the request is for an image or diagram, describe it in text only and say “Image generation not supported in this wrapper.”
5. End every response with exactly this line: "---\nResponse strictly followed the target question."

Store this as systemTargetLock.md in your repo. Every new feature starts by editing this file.

3. Scaffold the wrapper

Use a minimal Node.js + TypeScript project (or Python + Pydantic if you prefer).

mkdir gpt54-targetlock
cd gpt54-targetlock
npm init -y
npm install openai zod dotenv
tsc --init

Create src/wrapper.ts:

import OpenAI from "openai";
import { z } from "zod";
import fs from "fs";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const TargetResponseSchema = z.object({
  content: z.string(),
  followedExactly: z.boolean(),
});

export async function askThinkingLocked(userPrompt: string): Promise<string> {
  const systemPrompt = fs.readFileSync("./systemTargetLock.md", "utf8");

  const completion = await openai.chat.completions.create({
    model: "gpt-5.4-thinking",   // exact model name per OpenAI docs
    temperature: 0.3,
    max_tokens: 4096,
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userPrompt },
    ],
  });

  const raw = completion.choices[0].message.content || "";
  
  // Light post-processing guard
  const cleaned = raw
    .replace(/^\s*[\d]+\.\s+/gm, "- ")           // normalize long numbered lists
    .replace(/^(?!#|##|---).+$/gm, m => m.trim()); // remove stray lines

  const followed = cleaned.includes("Response strictly followed the target question");

  return cleaned;
}

4. Implement the core loop

Build a CLI and a simple web playground so you can iterate fast.

CLI example (src/cli.ts)

import { askThinkingLocked } from "./wrapper";

async function main() {
  const question = process.argv.slice(2).join(" ");
  if (!question) {
    console.error("Usage: tsx cli.ts \"your exact question here\"");
    process.exit(1);
  }

  console.log("🔒 Sending to GPT-5.4 Thinking with target lock...");
  const start = Date.now();
  const answer = await askThinkingLocked(question);
  const duration = Date.now() - start;

  console.log(answer);
  console.log(`\n⏱️  ${duration}ms | Model: gpt-5.4-thinking`);
}

main();

Add a tiny Express playground if you want instant browser testing.

5. Validate ruthlessly

Run the exact tests from the ZDNet article plus your own regression suite.

Create tests/regression.ts:

const testCases = [
  {
    name: "Aircraft carrier physics",
    prompt: "Explain why four downward-facing turbo-propellers are a weak solution for keeping an aircraft carrier aloft.",
    mustContain: ["aerodynamics", "structural", "energy"],
    mustNotContain: ["design a new vehicle", "tactical advantages"] // must stay on exact question
  },
  {
    name: "Simple decision drift",
    prompt: "I need to wash my car. The carwash is 100 meters away. Should I walk or drive?",
    mustContain: ["walk", "drive", "100 meters"],
    mustNotContain: ["climate change", "exercise benefits"] // classic drift example
  },
  // add 20+ more from your domain
];

async function runRegression() {
  for (const t of testCases) {
    const answer = await askThinkingLocked(t.prompt);
    const passed = t.mustContain.every(w => answer.includes(w)) &&
                   !t.mustNotContain.some(w => answer.includes(w));
    console.log(`${passed ? "✅" : "❌"} ${t.name}`);
  }
}

Run this suite after every model update. Aim for >95% exact-match rate.

6. Ship it safely

  • Add rate-limit and cost guards (track tokens, set monthly budget).
  • Version the system prompt in Git so you can rollback when OpenAI changes behavior.
  • Expose a /strict endpoint in your product that always uses the wrapper; keep a /creative endpoint for when you want the full Thinking power.
  • Document the exact model name and temperature you settled on so the next developer doesn’t undo your guardrails.

Copy-paste starter templates

Improved system prompt (v2)
Add this paragraph if you still see occasional drift:

Before answering, write a one-sentence internal thought: "Exact question: [repeat user question verbatim]". 
Then answer ONLY that question. Delete the internal thought before outputting.

Python version snippet

from openai import OpenAI
client = OpenAI()

def ask_locked(prompt: str) -> str:
    system = open("systemTargetLock.md").read()
    resp = client.chat.completions.create(
        model="gpt-5.4-thinking",
        temperature=0.2,
        messages=[{"role":"system","content":system}, {"role":"user","content":prompt}]
    )
    return resp.choices[0].message.content

Pitfalls and guardrails

### What if the model still drifts even with the system prompt?
Lower temperature to 0.2 and add the “internal thought” trick above. If that fails, switch to a two-step chain: first call a cheaper model (gpt-4o-mini) to rewrite the user question into an ultra-explicit version, then send that to GPT-5.4 Thinking.

### What if formatting is still ugly (long numbered lists)?
Post-process with a small regex or use a follow-up call to “rewrite the previous answer using only headings and 2-level bullets.” This adds ~15% cost but guarantees clean output.

### What if image generation is required?
The Thinking model uses a weaker image engine. Route image requests to DALL·E 3 or Flux via a separate flag and return a placeholder card that says “Image generated by dedicated model.”

### What if API pricing changes?
GPT-5.4 Thinking is currently only on paid plans. Monitor your token usage; complex reasoning prompts can easily exceed 15k output tokens. Implement a hard cap and fallback to gpt-4o when budget is exceeded.

### What if OpenAI deprecates the exact model name?
Keep the model string in a config file. The wrapper pattern lets you swap gpt-5.4-thinking for gpt-5.5-thinking or even Claude 4 in one line.

What to do next

  1. Ship the wrapper as an internal package (@yourorg/gpt54-lock).
  2. Instrument 3 production features with it this week.
  3. Run the regression suite against the next OpenAI release the same day it drops.
  4. A/B test “locked” vs “raw” mode with real users and measure task completion rate.
  5. Write a short internal style guide: “All new agents must use TargetLock unless creativity is the primary goal.”

This pattern turns GPT-5.4 Thinking from an impressive but occasionally disobedient researcher into a dependable engineering teammate.

Sources

(Word count: 1,248)

Original Source

zdnet.com

Comments

No comments yet. Be the first to share your thoughts!