The 8-Question Audit That Separates AI Strategy From AI Spend
Anyone can buy a Claude license. Almost nobody builds the harness around it — the orchestration, context, and discipline that turn a model into compounding leverage.
That gap — between buying the model and building the harness around it — is the difference between teams getting compounding leverage from AI and teams writing big checks for a productivity bump that disappears the moment a competitor signs the same contract.
This is the 8-question audit I use with operators deploying AI in 2026. Each question is anchored to a real event from the last two weeks of June 2026 — the framework was built in response to those events, not in front of them.
Answer all 8 for every AI deployment in your stack and you have a strategy. Answer 0–3 and you have a procurement habit — run this audit before your next vendor call.
What is “the harness”?
The term comes from Nate Jones, who used it for everything that isn’t the model itself — the orchestration, context, deployment structure, review gates, and operating discipline that turn raw tokens into actual work output.
Salesforce paid $3.6B to acquire Fin (formerly Intercom) on June 16 — not because Fin had a better model. Fin runs on a proprietary fine-tune of off-the-shelf weights. Salesforce paid for Fin’s harness: the workflow architecture that resolves 76% of customer-support queries autonomously across chat, email, WhatsApp, SMS, phone, and Slack.
When Marc Benioff signed that check, he wasn’t admitting Salesforce can’t build models. He was admitting Salesforce can’t build harnesses fast enough.
You’re not Salesforce. The audit below is how you start.
Eight questions. Eight layers.
Where does context get injected, and who owns those prompts?
Every AI deployment runs on a context window someone wrote. If you can’t name the human who owns the prompts and the data they pull, you can’t optimize the output — you can’t even debug it.
This is also where the routing layer lives. On June 24, Lenny Rachitsky publicly defected from Claude Opus to GLM 5.2 — an open-weight model from Z.ai that costs 82% less ($4.40 vs $25 per million output tokens) for a 0.7-point gap on FrontierSWE. If your team runs Opus by default, you may be overspending 5x and never measured the trade.
Name the prompt owner for every production AI workflow. If you can’t name one, you have orphaned spend.
Where does output go, and who reviews it before production?
Most AI deployments have no formal review of model output before it drives an action — sending an email, updating a record, qualifying a lead.
Every Salesforce admin has an AI horror story: Apex that looked perfect until it didn’t, flows that passed review and broke in production. What separates teams getting value from teams getting burned isn’t model choice — it’s whether DevOps foundations existed before AI arrived.
For each workflow, identify the human or process that catches errors before production. If there isn’t one, you’re shipping into the void.
What gets logged so you can debug and improve?
If a prompt drove a customer interaction last Tuesday, can you replay it on Friday? Measure whether the response was correct? A/B test a different prompt against the same context?
You can’t optimize what you can’t see. Most AI deployments in 2026 are invisible to the company running them.
Open one production workflow and ask the team to show you last week’s run log. If they can’t, you’re flying blind on the most expensive line item of 2026.
What share of work does this deployment remove from a human’s plate?
The Removal % is the outcome metric that matters more than every vendor pitch combined. The benchmark to beat: Fin (now Salesforce) resolves 76% of support queries autonomously — in production, not pilot.
On June 25, Salesforce shipped Agentforce Help Agent with pay-per-resolution pricing: $2 per autonomous resolution, $0 on escalation, $0 on unresolved cases. The Removal % just became a SKU. If your vendor still charges per query or seat regardless of outcome, you’re funding their R&D.
Name the % of work each deployment removes from a human’s plate. No number, no AI program.
How many tools does this agent actually need?
Most agents are bloated — 12, 15, 20 tools added “just in case.” Most confuse the agent more than they help it.
Vercel rebuilt their internal text-to-SQL agent in June, stripping 16 specialized tools down to one — arbitrary bash execution. Accuracy went 80% → 100%, response 3.5x faster, token cost 37% lower. Every team bolting MORE tools on is solving the wrong problem.
Open the tool list for your most-used agent. Cut three. Measure what changed.
What data does your stack see that no one else can?
Generic AI output looks generic because the model trains on public data. Feed it the same web your competitors feed it and you sound like your competitors.
Jordan Crawford named it on June 19: the 2026 differentiator isn’t the prompt or the model — it’s the proprietary signal your harness sees that no one else can. Closed-deal data, real customer conversations, your ICP’s behavior, your private benchmarks. Clay’s new Audiences feature (open beta June 23) is the cleanest sign of this going portable.
List the data sources flowing into your AI workflows. Mark public vs proprietary. The ratio is your differentiation ceiling.
Who can verify what AI generated, at the speed it generated it?
This is the question that flipped everything I was advising in May 2026. Fiona Fung (Manager of Claude Code and Cowork at Anthropic) put it cleanly on June 22: “Verification and systems thinking, not generation, is now the scarce engineering skill.” Anthropic engineers ship 8x more code per quarter than 18 months ago; designers and PMs commit code directly.
AI killed the bottleneck on speed. The new bottleneck is verification at the speed of generation. If your team can’t keep up with what your AI produces, you’re accruing a quality-control debt your customers will collect on.
For each workflow, name the human who verifies output — then ask whether they can keep up with the volume. If not, you have a hire to make.
Can your team write a Whole-Job Spec?
The latest question, named by Nate Jones on June 23. Three years of “prompt engineering” trained operators to think small — better instructions for one turn. Jones argues that’s obsolete; the new core skill is task imagination: spec and hand off a whole job to a frontier model.
His framework is 9 fields, fills in 20 minutes, and delegates an entire work unit — including what the first run will get wrong and how to correct it. His line worth reading twice: “Asking bigger is the first instruction in three years that matches the size of the technology.”
Pick one workflow and write the Whole-Job Spec. Hand it to your AI stack. See how close the first run gets.
How to use the audit
This isn’t a one-time exercise. The harness changes every release cycle. Run it three ways:
What this audit doesn’t tell you
It doesn’t tell you which AI to buy, which use case to start with, or what to pay. It tells you whether you’ve built the infrastructure around the AI you already bought to make it pay back. Most companies haven’t.
The standing reference
The framework evolves — there will be a Q9 the moment a real signal makes it necessary. I publish the working evidence for each question on LinkedIn most weekdays, ranked from 34 voices across LinkedIn, X, Substack, YouTube, and blogs. Free, 6 AM PT.
Daily intelligence brief (top 5 signals): aventary.com/intelligence. For the operator-level Q&A — lightly edited and shipped — subscribe to Ask Mendy on LinkedIn.
Run the audit on your stack.
Aventary builds the harness around the AI you’ve already bought — routing, review gates, observability, and the verification capacity to keep up.
Start the Conversation