The discourse around AI agents has gotten ahead of the practice. Most of what you read assumes we already know which work to point them at, and the rest treats the question as one of how aggressive to be rather than whether the task even fits. What's missing is not urgency but clarity about which work actually fits.

The split I keep coming back to is simple, and most real workflows have both halves of it. Some work has a verifiable correct answer. Some does not. That single distinction does most of the work in this essay, though not in the way the framing usually goes.

It does not tell you whether to use an agent. Agents have a place on both sides of the line. Sometimes they are the most powerful tool for work that is clearly non-deterministic, and sometimes they are the only practical way to do deterministic work whose inputs were too messy for traditional software to handle.

What the distinction does tell you is how to architect the system around the work. Work with a right answer needs guarantees, validation, and a deterministic spine, even when an agent is the thing producing the answer. Work without a right answer is where probabilistic systems earn their keep. There is also a third axis running across both halves of the split, which is human oversight. In a business setting a human stays accountable for what the agent produces, regardless of which half the work falls on, and the architecture has to reflect that.

The mistakes of this moment cluster around blurring those lines, whether by treating deterministic work as if it has no right answer, holding out for a deterministic answer where there is not one, or letting an agent run unsupervised on output a human's name will end up on. The rest of the essay is examples, edge cases, and a table worth keeping on hand.

Categories

Deterministic work has a verifiable correct answer. The procedure to get there is known, the cost of being subtly wrong is high, and across domains it looks like arithmetic, applying a rule, lookups, reproducible procedures, and compliance against a checklist.

Non-deterministic work has no single correct answer. It requires judgment, synthesis, ambiguity resolution, creativity, and the value of the work is in navigating an under-specified space. Drafting, deciding, prioritizing, reviewing for intent, summarizing, naming.

Three clarifications before the rest of the essay leans on this.

The split is not the same as easy versus hard. Plenty of deterministic work is brutally hard, since IRS tax calculations and drug dosing in unusual populations both have a right answer that takes specialist training to reach. Plenty of non-deterministic work is easy, since picking what to eat for lunch has no objectively correct answer and the stakes are small. Most jobs are also a mix. Almost no real workflow is purely one or the other, and the skill is decomposing it.

Verifiable usually means against a standard, not against objective truth. A drug dose is correct because a clinical reference says so. A tax filing is correct because the tax code defines it. The category is "the right answer is determined by a stable standard you can encode", not "the right answer is obviously true". That is still a clean line, and it is the one that matters for system design.

Deterministic, in this essay, is about what the task demands rather than about what tools happen to exist for it. A task demands deterministic treatment when it requires guarantees, reproducibility, and auditability, regardless of whether a CPU tool, an agent, or a human is the thing producing the answer. The category is a property of the task. It is not a property of the technology pointed at it.

Metaphor

You do not replace your CPU with a GPU. They are not competing. Modern computers have both because the two architectures are specialized for different shapes of computation. A CPU runs sequential, exact, deterministic operations. A GPU runs massive, parallel workloads where exactness is not the constraint.

Same with knowledge work. A healthy organization has CPU work, meaning rules and processes and automation, and it has GPU work, meaning judgment and creativity and synthesis. AI agents are a new kind of GPU. You still need a CPU.

The mistake of this moment is not believing in AI. It is mistaking your new GPU for a better CPU.

There is one asterisk on the metaphor that the next section is about. Sometimes you recruit a GPU to do a CPU-shaped job because no CPU could read the input. That is real and increasingly common, and it does not change the rules. The work is still CPU-shaped, and you architect accordingly.

Bridging

This is the part of the framework that gets oversimplified, so it is worth pausing on.

Some deterministic work has been done by humans for decades, not because it required judgment but because the inputs were too unstructured for traditional software to handle. The work has a verifiable correct output and the procedure is known, but the input is a messy invoice in fourteen different vendor formats, or a referral letter, or a freeform return request. Other examples in the same shape: filling out a form on a vendor portal that has no API, categorizing inbound leads against a known taxonomy from a freeform LinkedIn bio, reading lab reports and pulling out standardized values for an EHR, translating "the color is wrong, I want to send it back" into the right refund category, reviewing a stack of receipts and assigning each to an expense category.

This work is deterministic. There is a right answer per input. For years it was a human's job because traditional software could not bridge the messy input to the structured output.

This is where AI agents are doing something new. There was no tool before. They are not replacing a CPU tool, they are replacing a human who was doing deterministic work the slow way, and that is a real unlock. It should not be confused with the over-corrections later in the essay.

The crucial nuance is that the work is still deterministic, even though it is now being done by a probabilistic system. Treat it that way. Wrap the agent in deterministic guardrails, meaning schema validation, range checks, known-answer spot tests, and anomaly flags. Spot-verify outputs, because the agent is a probabilistic worker on a deterministic task. The architecture is straightforward in shape: agent parses the unstructured input, deterministic system validates and acts.

You are using a probabilistic tool to do CPU-shaped work. Useful, but you wrap it in CPU-style guardrails, and you do not trust the output the way you trust a compiler.

This is also the answer to "but I replaced a clerk with an agent and it works great". Yes. That is a fit. The work was always deterministic, and the constraint was that no machine could read the inputs until now. The framework does not say to avoid agents on deterministic work. It says to know that the work is deterministic, and architect accordingly.

Oversight

One more layer running across both halves of the framework. In any business setting, a human stays accountable for what the agent produces.

This is partly legal and professional. The lawyer who files the brief is responsible for what is in it, regardless of whether an LLM drafted it. The doctor who signs the chart owns the diagnosis. The engineer on call inherits the system the agent helped build. The CEO answers to the board for what the company did, even if half of it ran through agents. Agents do not take the fall, get sued, get promoted, or get fired. Humans inherit both the blame and the credit, and that fact about where accountability sits is a structural one that the architecture has to reflect.

Anywhere the stakes are above casual, oversight is a baseline rather than a nice-to-have.

What oversight looks like depends on which side of the framework the work is on. On the deterministic side, including the bridging case from earlier, oversight is sampling, anomaly review, schema validation, and spot-checks against the encoded standard. The question is "did the agent produce the right answer", and you can answer it because the right answer exists. On the non-deterministic side, oversight is editorial. There is no standard to check against, and the question becomes "would I be willing to put my name on this". Someone whose taste or expertise you trust reviews. The bar is fitness for purpose, not correctness.

The stakes scale the oversight. Low-stakes outputs tolerate light review, like sampling a few outputs and looking for outliers. High-stakes outputs need every result reviewed before it leaves the building, sometimes by more than one person.

A practical rule. Design the oversight in, not around. The pipeline should make human review a step with a clear queue and a clear surface for the human to act on. If oversight is "manual, when someone remembers" then it is the thing that drops first when the team is busy. If it is a tracked stage with an explicit owner, it survives.

One nuance worth saying out loud. Oversight is not really about distrust of the agent. Even if the agent is more reliable than the human reviewing it, the human is the one whose name is on the output. Oversight is a function of where accountability sits, not where confidence sits. That is why "the agent never makes mistakes" is not an argument against oversight. Accountability and accuracy are not the same problem, and the architecture has to address both.

Patterns

Three patterns worth naming. They share a root mistake, which is treating work that has a right answer as if it does not. They are not the bridging case from earlier, where the work has a right answer and no deterministic tool ever existed. Here, the work has a right answer and a deterministic tool already existed, and an agent got pointed at it anyway. The patterns also tend to drop the oversight layer, because part of the appeal is that the agent just handles it.

Using AI for problems that are already solved. Asking a probabilistic system to do something a deterministic tool does perfectly, or near-perfectly. Asking ChatGPT to total a column when Excel exists. Asking an agent to look up a customer record instead of querying the CRM. Generating an NDA from scratch when a template is sitting in the shared drive. A doctor using an LLM to compute a standard-population drug dose, when a calculator and a dosing reference are built for that exact case. The appeal is convenience, since typing what you want into a chat is faster than digging up the right cell or query or template. The cost is that you have traded a guarantee for an interface.

Throwing away repeatable processes. Replacing documented, auditable systems with free-form prompts. Swapping a runbook for "ask the agent". Replacing a return-policy decision tree with case-by-case agent rulings. Letting AI "do" payroll instead of payroll software with AI handling the edge cases. A law firm replacing a conflict-check system with "we will just ask Claude". Documented processes require upfront work that agents let you skip, which is most of the appeal. What you give up is reproducibility, audit trail, and the surface that oversight lands on. When something breaks, there is nothing to inspect.

Conflating deciding with doing. Almost every workflow has both halves. AI is good at the deciding half. It is dangerous on the doing half when "doing" has a verifiable right answer. A doctor decides which drug, which is GPU work, and computes the dose, which is CPU. A lawyer decides the legal strategy and finds the actual cited authorities, and the Mata v. Avianca case from 2023 is the canonical version of what happens when those two steps collapse into one. A finance team judges which anomalies matter, which is GPU, and matches transactions or cites peer comps that exist, which is CPU. An HR team decides on a termination, which is GPU, and executes it correctly under employment law, which is CPU. A founder decides what to build, which is GPU, and runs the deploy, which is CPU. The deciding part is where judgment matters most, so the doing part feels like a safe delegation. What you actually get is confidently-wrong execution where there should have been deterministic execution, and this is the failure mode hardest to catch, because the output looks right, oversight gets thin, and the human only finds out when something downstream breaks.

Table

The table to keep on hand. The left column is deterministic work. Some of it has a long-standing tool, like Excel or the CRM or the compiler. Some of it has historically required a human because the inputs were messy, which is the bridging case, and an agent wrapped in guardrails is now a legitimate fit. Either way the work is deterministic and the oversight model is verification, meaning sampling, anomaly review, and spot-checks against the standard. The right column is where judgment did not scale before, and where agents earn their keep. Oversight there is editorial.

Domain CPU work (deterministic) GPU work (non-deterministic)
Programming Compiling, running tests, deploying, applying lints, schema migrations Architecting, naming, code review for intent, debugging hypotheses, spec writing
Sales and GTM Updating CRM, computing commission, scheduling sequences, generating quotes from a price book, scraping and normalizing lead data Crafting custom outreach, qualifying ambiguous leads, deal strategy, account research
Marketing Scheduling posts, computing CAC, A/B test math, applying brand templates Copywriting, campaign concepts, brand voice, audience synthesis
Recruiting and HR Scheduling interviews, generating standard offers, applicant tracking, payroll, parsing résumés into a structured pipeline Evaluating fit, writing role descriptions, candidate shortlisting, performance feedback
Customer support Looking up accounts, applying refund policy, routing tickets by rules, classifying inbound by category Drafting empathetic replies, triaging vague reports, escalation judgment
Finance and accounting Reconciliation, tax calculations, expense classification, invoice generation, parsing receipts and invoices into a ledger Strategy, forecasting, audit judgment, board narrative
Legal NDAs from templates, conflict-of-interest checks, filing deadlines, citation lookup against real databases, extracting structured fields from contracts Drafting novel arguments, weighing precedent relevance, reviewing unique contracts for risk, client advising
Medicine Drug dose calculation for standard populations, drug-interaction checks, billing codes, lab-value flagging, extracting standardized values from referral letters Differential diagnosis, treatment discussions, edge-case judgment, dosing where weight or renal or interaction nuance matters
Restaurants and hospitality Inventory counts, tip math, loyalty rules, scheduling, parsing handwritten inventory notes into a digital count Menu design, complaint resolution, hiring, brand decisions
Logistics and ops Route optimization with static constraints, tracking, billing, compliance reports, parsing freight documents Dynamic re-routing, supplier negotiation, capacity planning, exception handling
Construction and trades Material quantity calculations, code-compliance checklists, scheduling, parsing supplier quotes Bid strategy, design trade-offs, customer disputes
Education Grading multiple choice, attendance, scheduling, transcript generation Lesson design, feedback on essays, mentoring conversations

Reading the table left to right shows the boring half of every job, most of which has long been automated, and that is the point. AI does not replace that automation. It routes around it (the over-correction) or bridges into it where there was no tool (the unlock). Reading top to bottom, the right column is the work that has historically been under-leveraged, because judgment did not scale.

The wrong move is dragging items from the left column into the right column "because AI" which is treating deterministic work as if it has no right answer.

Programming

Programming is the most-discussed domain right now and a clean illustration of why decomposition matters. A feature ships through a sequence of stages, and each stage sits firmly on one side of the split. The confusion in the current discourse comes from bundling "writing code" into a single step, when in practice it includes both judgment and verification.

In 2026 this also has a practical interface consequence: in many teams, a large share of "coding" time is no longer typing code, it is typing instructions. The unit of work is a prompt, a diff request, or a spec for an agent, and the model produces the first draft of the code. Engineers are still doing programming, but the muscle has shifted from keystrokes to directing and constraining a generator.

Deciding what to build is GPU, since customer signals and trade-offs and judgment do not have a single right answer. Deciding the architecture is GPU as well, since future flexibility and fit with existing systems are judgment calls. Translating intent into an implementation is also usually GPU, because the spec is under-specified and there are many acceptable implementations. This is exactly why AI is now widely used to generate code: it is good at producing plausible drafts under ambiguity.

The important distinction is that "prompting" is GPU work, but the thing it produces is still CPU-shaped. The program has deterministic semantics once it is pinned to a language, compiler, dependencies, and environment. So modern code generation is best understood as a bridging case: a probabilistic system produces a structured output that must then be forced through deterministic guardrails. The engineer is not replaced; their role moves up a layer, from authoring every line to specifying constraints, reviewing diffs, and insisting on proofs.

Compiling and running is CPU, because the compiler is god and you never use AI for it. Running tests is CPU, because assertions are deterministic and AI does not "evaluate" them. Linting, typechecking, schema validation, reproducible builds, and policy enforcement are CPU. Deploying is CPU, and the CI/CD pipeline is deterministic on purpose, which is why you do not replace it with a chat agent.

Where AI shines on the programming side is in the GPU stages: proposing designs, generating implementation options, explaining code, spotting suspicious patterns, and producing debugging hypotheses. It is also unusually good at security and functional review when framed as hypothesis-generation: "here are three ways this could fail" or "here is a plausible exploit path". But the confirmation step is still CPU: a test, a proof, a reproducer, a static analysis finding, a trace.

The high-leverage architecture is therefore: AI drafts and proposes; deterministic systems validate and execute; a human stays accountable for the outcome. Production agentic systems often blur this in practice, with the agent both deciding and executing and validation happening after the fact, and that is a defensible trade-off when the stakes are low. As stakes rise, the architecture should bend toward strict separation, and toward the bridging pattern from earlier: agent plus deterministic guardrails plus a named human on the hook.

Objections

A few honest pushbacks worth addressing.

Agents will get more reliable. Does this distinction stop mattering then? Some failure modes will get better. The architectural distinction will not, because it is not really about today's models. It is about the shape of the work. A task with a verifiable correct answer needs a system optimized for verifiable correct answers, and that is a property of the task. In raw reliability terms, even 99.9 percent reliability creates unacceptable downstream risk on deterministic tasks, because small errors compound, and we already have 100 percent tools available, like calculators and databases and compilers. The bar to replace those is not "good enough". It is "as reliable as what is already there, plus something the existing tool cannot do".

What is actually verifiable? Most right answers are just standards somebody wrote down. True, and that is the right level of verifiable. Verifiable almost never means "true in nature". It means "checkable against a stable encoded standard", whether the standard is the IRS code, a clinical reference, an SLA, or a refund policy. That is still a clean line. If the standard is stable enough to encode, the work is deterministic and you treat it that way. If the "standard" is contested, evolving, or context-dependent, that is the non-deterministic side.

What about work that is deterministic in principle but too expensive to verify in practice? The honest case is something like "is this codebase secure". There is a correct answer in principle, but finding and proving it is expensive: the space of possible bugs and exploit chains is enormous, and verification often requires stitching together evidence across code, dependencies, configuration, and runtime behavior. In practice you make probabilistic progress by combining cheap deterministic proxies (static analysis, fuzzing, known-vulnerability scans, config checks) with hypothesis-driven investigation. This is also a place where agents can help, not by being the authority on "secure", but by generating and prioritizing hypotheses, connecting weak signals, and proposing exploit paths. The architecture is still: probabilistic search and triage upstream, deterministic proof/PoC and regression checks downstream, with escalation to humans for the residual.

Is human oversight not just a bottleneck? Does it not kill the speed AI is supposed to bring? Sometimes. But oversight is not really negotiable in any setting where a human's name is on the output, because accountability is not a speed dial. The right move is not to skip oversight, it is to design it in so it is the cheapest version of itself, with sampling, anomaly review, exception queues, and automated checks first, with humans only on the residual. Speed is not lost to oversight. Speed is lost to oversight that was not designed in and got bolted on later.

What about hallucinations? Does that not make the whole thing unreliable? The distinction between hallucination and draft is what this framework is about. On non-deterministic work, "hallucination" is the wrong word, because there is no ground truth to deviate from. On deterministic work, it is the right word, and the answer is not to make the agent more reliable. The answer is to not put a probabilistic system in a place that demands a guarantee.

Test

A field test you can run on any task before reaching for an agent.

Where does the work fall. Is there a verifiable correct answer against some encoded standard? If yes, the task is deterministic, regardless of who or what does the work. If the output has multiple legitimate answers requiring judgment that cannot be reduced to rules, or the search space is too large for a human to enumerate, the task is non-deterministic.

On the deterministic side, what tools exist. If a reliable deterministic tool already exists, use it, possibly with an agent calling it. If no tool exists but a human does it because the inputs are messy, that is the bridging case, and an agent plus deterministic guardrails fits.

Then check the stakes. What is the cost of being subtly wrong. If high, wrap whatever is doing the work in verification, agent or tool. Whose name is on the output, and how are they staying informed by design rather than by hope. Answer that before the agent ships, not after.

The one-liner. If it has a right answer, treat it as deterministic, even if an agent is the only thing that can do it. If it does not, that is where agents earn their keep. Either way, somebody human is on the hook.

Closing

A few things follow from taking the distinction seriously.

The first is that most organizations probably need more deterministic systems than they have, not fewer. A lot of CPU work is still being done by hand, sometimes because no tool existed (the bridging case, where agents paired with deterministic guardrails are now the primary system), and sometimes because the existing tool was never adopted or resourced. In practice, cheaper GPUs (agents) can lower the barrier to building the missing CPU tools: you can use a probabilistic worker to translate messy reality into structured inputs, then let deterministic systems do the guaranteed parts. The Jevons-paradox version is that as the cost of "automation" drops, you end up doing more of it, not less. Either way, the move is not "let an agent freelance". It is deterministic outcomes by whatever route, whether a script, a workflow, or an agent wrapped in guardrails.

The second is that GPU work has been quietly under-served for a long time, because judgment did not scale. That is where agents earn their keep, and it is the place where "use AI more, not less" is the right answer.

The third is that human oversight scales with stakes, not with how impressive the agent is. The question that determines how much oversight a system needs is not "how reliable is this model". It is "whose name is on the output, and what happens to them if it is wrong". Nobody likes to get fired. Design oversight in. Make it a queue, an owner, a step. Otherwise it is the thing that drops first.

The fourth is that the strategic skill is the decomposition itself. Looking at a workflow and seeing which steps need a deterministic spine, and whether to deliver that spine with a tool or an agent or a human, and which steps do not, and where the human accountability lives. That skill is undervalued right now, partly because the loud version of the discourse skips it.

Keep your CPU. Add a GPU. Keep a human on the hook. Do not confuse the three.