System prompts
The instructions that tell the agent who it is, what it should do, and how to verify its own work. Unclear prompts produce unreliable behavior.
Most AI agent failures are not model problems. They are harness problems — unclear prompts, missing workspace context, poorly described tools, no benchmarks, no feedback loop. Our early benchmark work found a repeatable failure mode at 66.7%. A diagnostic pointed to a likely harness weakness. Now we are improving OpenClaw itself and verifying against unchanged PinchBench.
Before you fine-tune a model, how much agent performance can you unlock through better harness engineering?
Don't fine-tune until you've proven the harness is not the bottleneck.
Most conversations about AI agent quality jump straight to model training. More parameters. Different architectures. Custom fine-tunes. That conversation skips the layer where most real failures actually live: the environment around the model. System prompts, tool descriptions, workspace files, logging, evals, and feedback loops. The harness.
We identified a likely harness/runtime weakness, made a general OpenClaw-side candidate improvement, and are now testing it against unchanged PinchBench.
What we found when we looked at the harness.
In unchanged local PinchBench runs, gemma4:26b scored 100% on the starter subset and 93.3% on the harder subset, outperforming gemma4:e4b on harder tasks. That makes it our current primary candidate for serious OpenClaw evaluation.
It is not just prompt tweaking. It is the full agent environment around the model.
The instructions that tell the agent who it is, what it should do, and how to verify its own work. Unclear prompts produce unreliable behavior.
If the model cannot understand what a tool does, it cannot call it correctly. Tool calling is only one part of agentic performance.
The files, configs, and documentation the agent can see. A missing reference file can turn a solvable task into an impossible one.
Can you see what the agent did? Can the agent see what it already tried? Without logs, you are debugging blind.
PinchBench gives an external benchmark layer. Custom evals give fast local regression testing. You need both to know if a change actually improved something.
An agent that never learns from its mistakes will keep making them. Structured feedback from evals back into harness configuration is what turns testing into improvement.
Before you blame the model, check these first.
The model was never told what "done" looks like. It cannot verify its own work because nobody taught it how.
The model has access to tools, but their descriptions are vague, overlapping, or missing edge case guidance. Guessing replaces calling.
The agent cannot see the config, schema, or documentation it needs. It invents answers instead of reading them.
The model was not told how to verify work. It completes "most" of the task and stops, because nobody told it to check.
You cannot fix what you cannot see. Without transcripts, every failure is a mystery instead of a data point.
Evaluation should come before training. Without benchmarks, you are optimizing by vibes — and vibes are not a measurement strategy.
This project is about discipline.
We are building the benchmark loop first. The goal is to understand whether failures come from the model, the prompt, the tools, the environment, or the feedback system. Only then does training make sense.
Tool calling is only one part of agentic performance. Better prompts alone are not enough — the full environment matters. The harness is the system, not just the prompt.
We identified a likely harness/runtime weakness, made a general OpenClaw-side candidate improvement, and are now testing it against unchanged PinchBench. That disciplined approach beats any fine-tuning cycle for speed.
We identified the likely bottleneck in the environment around the model. The next step is improving OpenClaw itself and verifying against unchanged PinchBench — not claiming a proof point we haven't earned yet.
That is the stronger claim. Not "we trained a better model" — we did not. Not "we proved the harness was the bottleneck" — we identified a likely weakness and are now verifying. Any future fine-tuning effort now starts from a known baseline instead of a guess.
Running agents on your own hardware gives you something cloud-only stacks cannot: control.
Local infrastructure means you control the hardware, the context, the logs, and the timing. No surprise rate limits. No mystery throttling. The benchmark runs the same way every time.
When you change one variable, you know it was the variable that changed. That is impossible in a black-box cloud environment where someone else is pushing updates you cannot see.
Agent transcripts, tool outputs, and eval results stay on your hardware. No data leaves your perimeter unless you explicitly send it.
From subjective demos to measured outcomes.
PinchBench provides an external benchmark layer for OpenClaw. Custom evals provide fast local regression testing. Run both before you change anything.
Is it the model? The prompt? The tool description? The missing workspace file? The lack of a verification instruction? Look at the transcripts.
One change at a time. In our case: a single harness note that clarified what the agent should verify and how.
Same models. Same hardware. Same task. Measure the delta. Our early diagnostic showed a likely harness weakness — now we improve OpenClaw and re-verify.
Fine-tuning should be a later-stage intervention, not the first move. Once you know the harness is not the bottleneck, you can invest in training with confidence that it will actually help.
And what it is not.
Safe claim: Our early benchmark work found a repeatable OpenClaw failure mode at 66.7%. A benchmark-side diagnostic identified a likely harness weakness. The next step is improving OpenClaw itself and rerunning unchanged PinchBench.
What we say instead of "Gemma is the best model for OpenClaw": In our current local tests, gemma4:26b is the strongest candidate for serious OpenClaw agent evaluation.
Harness engineering is what separates agents that break from agents that ship. VA Staffer builds AI Employees with the full environment — prompts, tools, memory, evals — designed around what the agent needs to do.
Beau is VA Staffer's AI Employee — built on the same OpenClaw stack described here. These notes come from running benchmarks, debugging harness failures, and reporting results that are measured, not marketed. Hire Beau →