AI24 May 20265 min read

AI Workflow Design Patterns: A Practical Reference

Eight production-tested AI workflow design patterns with when to use each, failure modes, and concrete implementation notes for engineering teams

By AI Advisory team

Most AI workflows in production are not single LLM calls. They are compositions: a retrieval step, a routing decision, a tool call, a validation pass, sometimes a human review. The difference between a demo that works on stage and a system that runs 50,000 times a week is usually structural, not model-related. The wrong pattern with GPT-4 will lose to the right pattern with a smaller, cheaper model nine times out of ten.

This is a working reference to the design patterns we use when building AI workflows for mid-market clients. Each pattern includes when to reach for it, the failure modes that show up in production, and notes on implementation. The patterns are not mutually exclusive - real systems usually combine three or four of them.

Why patterns matter more than frameworks

The framework discourse (LangChain vs LlamaIndex vs LangGraph vs CrewAI vs raw SDK calls) misses the point. Frameworks are vehicles for patterns. If you understand the patterns, you can implement them in any framework, or in plain Python with the OpenAI or Anthropic SDK. If you do not understand the patterns, no framework will save you - you will end up with a fragile chain of abstractions that fails in ways you cannot debug.

Anthropic's engineering team published a useful taxonomy in late 2024 distinguishing workflows (predefined paths through code) from agents (LLM-directed control flow). The headline finding from their work with customers building production systems: most successful deployments use composable, well-defined workflows rather than open-ended agents. Agents are necessary when the task genuinely cannot be decomposed in advance. For everything else, structured workflows win on reliability, cost, and debuggability. See Anthropic's Building Effective Agents for the primary source.

The patterns below are ordered roughly by how often we reach for them in practice, starting with the simple ones that solve 80% of problems.

Pattern 1: Prompt chaining

Decompose a task into a fixed sequence of LLM calls, where each call's output feeds the next. The classic example: extract structured data from a document, validate it against business rules, generate a summary, and translate. Four calls, fixed order, no branching.

When to use it. The task has clear sequential subtasks and you want to trade latency for accuracy. Smaller, cheaper steps with focused prompts almost always beat one monolithic prompt that tries to do everything at once.

Failure modes. Error compounding. If step one is 95% accurate and step four is 95% accurate, your end-to-end accuracy is around 81%. Mitigate with validation gates between steps - a quick structural check, a regex, or a cheap classifier that catches obvious failures before they propagate.

Implementation note. Use structured outputs (JSON schema, function calling, or Pydantic models with the Instructor library) at every boundary. Unstructured text passing between LLM calls is the single biggest source of production failures we see in client codebases. A field that should be a date arrives as "sometime next Tuesday" and breaks the downstream parser at 3am.

Pattern 2: Routing

A classifier (often an LLM, sometimes a small fine-tuned model or even a regex) decides which downstream workflow to invoke. Customer support is the canonical case: an inbound message gets classified as billing, technical, sales, or refund, and routes to a specialised chain for each.

When to use it. Inputs are heterogeneous and benefit from specialised handling. A single prompt trying to handle billing queries and technical troubleshooting will be mediocre at both. Two focused prompts, each tuned for one job, will outperform.

Failure modes. Misclassification cascades. If the router sends a security incident to the marketing queue, no downstream prompt will recover. Always include an "unknown" or "unclear" route that escalates to human review rather than forcing a low-confidence classification. Log router confidence scores and alert when the distribution shifts - it usually means new query types are appearing that you have not modelled.

Implementation note. Routing is one of the few places where a small fine-tuned model (or even a logistic regression on embeddings) often beats a large generalist model on cost and latency. If you are doing 100,000 routing decisions a day, the economics are obvious.

Pattern 3: Retrieval-augmented generation (RAG)

The most discussed pattern, and the one most often implemented badly. Retrieve relevant context from a knowledge base, inject into the prompt, generate grounded response. The standard architecture: chunk documents, embed chunks, store in a vector database (we default to Postgres with pgvector for most mid-market deployments), retrieve top-k by similarity at query time, pass to the LLM with the question.

When to use it. The model needs access to information it was not trained on - private docs, recent data, customer-specific context. RAG is almost always cheaper, faster, and more maintainable than fine-tuning for this use case.

Failure modes. Naive top-k retrieval misses obvious matches. A query about "Q3 revenue" will not retrieve a document titled "Third quarter financials" if you rely only on dense embedding similarity. The fix is hybrid retrieval: combine dense (semantic) search with sparse (BM25 or keyword) search and rerank the combined results. Microsoft's research on hybrid retrieval shows consistent 10-30% improvements over either method alone on enterprise document sets.

Implementation note. Chunking strategy matters more than vector database choice. Sentence-level chunks lose context; document-level chunks dilute relevance. For most business documents we use 500-800 token chunks with 100 token overlap, plus a parent-document retrieval pattern where the chunk is used for matching but the surrounding context is included in the prompt. Evaluate retrieval quality separately from generation quality - a RAG system with bad retrieval cannot be saved by a better LLM.

Pattern 4: Tool use and function calling

The LLM does not generate the final answer directly; instead, it decides to call a tool (a function, an API, a database query) and uses the result. Reading customer data from a CRM, executing a calculation, hitting a search API, writing to a ticketing system - all tool calls.

When to use it. The task requires action in the real world, access to live data, or precise computation that LLMs are bad at. Never ask an LLM to do arithmetic on financial figures or check inventory levels from memory. Give it a tool.

Failure modes. Tools that fail silently. The LLM calls a tool, the tool returns an error, the LLM hallucinates a plausible-looking answer based on what the result probably would have been. Always surface tool errors back to the model explicitly, and design tools to return structured error responses that the model can reason about ("customer not found", "date range invalid") rather than raw exceptions.

Implementation note. Tool descriptions are prompts. The docstring you write for a function is what the LLM reads to decide whether and how to call it. Treat tool descriptions as carefully as you treat your main system prompt. OpenAI and Anthropic both publish guidance on writing effective tool descriptions in their respective API documentation - read it before you build.

Pattern 5: Evaluator-optimiser (the critique loop)

One LLM generates a response. A second LLM (or the same one with a different prompt) critiques it against criteria. The generator revises based on the critique. Loop until satisfied or until a budget is exhausted.

When to use it. Quality matters more than latency, and you can articulate explicit evaluation criteria. Code generation, content drafting, complex reasoning tasks where the first pass is usually close but not quite right.

Failure modes. Infinite loops where the critic and generator disagree forever. The critic that hallucinates standards the generator cannot meet. Mitigate with hard iteration caps (we default to three iterations), explicit "good enough" thresholds, and structured critic outputs that have to score against named criteria rather than producing open-ended feedback.

Implementation note. The critic prompt is harder to write than the generator prompt. It needs explicit success criteria, examples of good and bad outputs, and a structured response format. Investing time in the critic pays back across all generations.

Pattern 6: Parallelisation

Run multiple LLM calls in parallel, then aggregate. Two flavours: sectioning (decompose a task into independent subtasks that run concurrently) and voting (run the same prompt N times and take majority or best result).

When to use it. Sectioning is for tasks where subproblems are genuinely independent - extracting fields from a long document, processing a batch of items, multi-aspect analysis. Voting is for high-stakes decisions where consistency matters more than cost, or for content moderation where you want multiple independent judgements.

Failure modes. The aggregation step becomes a bottleneck. If you fan out to ten parallel calls and then ask one LLM to synthesise the results, the synthesis is doing most of the cognitive work and the parallelism bought you little. Make sure the aggregation is deterministic where possible (concatenation, structured merge, vote counting) rather than another expensive LLM call.

Implementation note. Watch your rate limits. Parallel calls trip API quotas faster than you expect. Use a semaphore to cap concurrency at a level your account can sustain, and implement exponential backoff on 429 errors. For high-volume work, consider batch APIs (OpenAI and Anthropic both offer them at significant discount) when latency allows.

Pattern 7: Human-in-the-loop checkpoints

The workflow pauses at defined points for human review before continuing. Used for high-stakes actions (sending external communications, executing financial transactions, publishing public content) and for ambiguous cases that fall below confidence thresholds.

When to use it. The cost of a wrong action is high relative to the cost of human time. Regulatory or compliance contexts where audit trails matter. New deployments where you do not yet trust the system enough to let it run unattended.

Failure modes. Review queues that grow faster than humans can clear them. The system either accumulates a backlog that destroys the productivity gain, or humans start rubber-stamping to keep up, which is worse than not having review at all. Track queue depth, time-to-review, and disagreement rate as core metrics.

Implementation note. Design the review UI carefully. The reviewer needs to see the input, the model's proposed output, the model's confidence and reasoning, and ideally a one-click approve, one-click edit, one-click reject flow. Anything that takes more than 30 seconds per item will not scale. Capture the human decisions as training data for the next iteration - this is how you eventually shrink the review queue.

Pattern 8: Agent loops (use sparingly)

The LLM directs its own control flow: it decides what tools to call, in what order, when to stop, based on the evolving state of the task. This is the "agent" in the strict sense - autonomous, multi-step, goal-directed.

When to use it. The task genuinely cannot be decomposed in advance. Open-ended research, complex debugging, tasks where the path to the answer depends on intermediate results that cannot be predicted. Cursor, Claude Code, and Devin are agent-shaped because software engineering is genuinely open-ended.

Failure modes. Cost explosions when the agent loops longer than expected. Drift, where the agent gradually loses sight of the original goal. Tool misuse, where the agent finds creative ways to use tools that you did not anticipate. Always implement hard budgets (max iterations, max tokens, max wall-clock time) and observability that lets you see what the agent did and why.

Implementation note. Most teams reaching for agents should be using a routed workflow with checkpoints instead. The question to ask: can I list the steps in advance? If yes, you do not need an agent, you need a workflow. Agents are the right answer for maybe 10% of the use cases where they get reached for.

Composing the patterns: a worked example

A real customer support automation we built last year for a fintech client combines five of these patterns. Inbound email arrives. Routing classifies into one of seven categories. For "account query", a prompt chain extracts the customer identifier, validates the request type, and pulls account state via tool calls to the core banking API. A RAG step retrieves relevant policy documents. The generator produces a draft response. An evaluator checks the draft against compliance criteria (no financial advice, no PII leakage, correct disclaimer present). If the evaluator scores below threshold, the draft routes to a human reviewer; if above, it sends automatically.

Five patterns, one workflow, 80% of tickets handled without human touch, 100% of high-risk responses reviewed. The agent pattern appears nowhere - this system never needs to decide its own next step, because we already know the steps.

That is the underlying lesson. Production AI workflows are mostly engineering, not machine learning. The patterns are not glamorous. They are the difference between a system that works.

Frequently asked questions

How do I decide between building a workflow and building an agent?

Write out the steps required to complete the task. If you can list them in order, build a workflow - even if it has routing and branching. If the steps depend on intermediate results in ways you genuinely cannot predict ("I will not know what to search for until I have read the first document"), an agent may be justified. In practice, 80-90% of use cases that teams initially scope as agents are better served by structured workflows with branching and checkpoints. Agents are more expensive, harder to debug, and less reliable. Reach for them when the task demands it, not by default.

Which framework should I use to implement these patterns?

For most teams: start with the raw vendor SDK (OpenAI, Anthropic) and add a thin orchestration layer of your own. The patterns above are 50-200 lines of Python each. LangGraph and LlamaIndex are reasonable choices when complexity grows, particularly for stateful agent loops or complex RAG. We default to LangGraph for agent-shaped work and a custom orchestration layer for everything else. Avoid premature framework adoption - it locks you into abstractions you have not yet learned to need. The framework you pick matters far less than getting the patterns right.

How do I evaluate whether a workflow is good enough for production?

Build an evaluation set before you build the workflow. 50-200 representative inputs with known good outputs, ideally drawn from real historical data. Score the workflow on accuracy, latency, and cost against the evaluation set on every code change. Track production metrics separately: end-to-end success rate, human escalation rate, customer satisfaction, error rates per pattern. A workflow that scores 95% on offline evals but escalates 40% of production traffic to humans is not production-ready - the evaluation set is unrepresentative.

What does a typical AI workflow cost to run at scale?

Highly dependent on volume, model choice, and pattern composition. As a rough benchmark from recent client work: a routing-plus-RAG-plus-generation customer support workflow handling moderate complexity queries costs around £0.02 to £0.08 per ticket using mid-tier models (GPT-4o mini, Claude Haiku, Gemini Flash). A complex multi-step extraction and analysis workflow on long documents using frontier models can run £0.50 to £2 per document. Token costs have fallen roughly 80% year-on-year for equivalent capability, so build with an architecture that lets you swap models without restructuring.

Three practical layers. First, use enterprise API tiers (OpenAI, Anthropic, Google all offer zero-retention agreements and signed DPAs) rather than consumer APIs. Second, minimise what you send - redact PII before the LLM call where possible, use pseudonymisation, do not send entire customer records when a single field will do. Third, for genuinely sensitive workloads (health data, financial detail, legal privilege), consider self-hosted open models (Llama 3, Mistral, Qwen) on infrastructure you control. The ICO's guidance on AI and data protection is the authoritative UK reference and worth reading in full before any production deployment.

How long does it take to build a production AI workflow?

A single-pattern workflow (e.g. straightforward RAG over a defined document set) can be in production in 3-6 weeks for a competent team. A multi-pattern workflow with routing, tool use, evaluation, and human review typically runs 8-16 weeks from kickoff. The build time is usually dominated by the unglamorous work: data preparation, evaluation harness, observability, integration with existing systems, and the change-management work to get human reviewers and operators trained. Budget 40% for the LLM-specific work and 60% for the surrounding engineering and operations.

What is the most common mistake teams make when building these workflows?

Skipping evaluation infrastructure. Teams build a workflow, see it work on a handful of test inputs, ship it, and then have no way to detect when it starts failing in production. The second most common mistake is reaching for agents when a workflow would do, which produces systems that are expensive, slow, and unreliable. The third is over-engineering retrieval - building elaborate RAG pipelines when the underlying problem is poor source document quality or weak chunking. Fix the data before you fix the retrieval algorithm.

Getting these patterns into production

The patterns above are simple to describe and harder to compose well. The choices that determine whether a workflow survives contact with production - chunking strategy, evaluation criteria, error handling between steps, where to put human checkpoints - are the ones that only become obvious after you have shipped a few of these systems. If you would value a second pair of eyes on a workflow you are scoping or building, AI Advisory works with mid-market teams on exactly this kind of work; the contact form is the fastest route to a conversation.

Ready to put this into production? book a discovery call.

AI Workflow Design Patterns: A Practical Reference

Why patterns matter more than frameworks

Pattern 1: Prompt chaining

Pattern 2: Routing

Pattern 3: Retrieval-augmented generation (RAG)

Pattern 4: Tool use and function calling

Pattern 5: Evaluator-optimiser (the critique loop)

Pattern 6: Parallelisation

Pattern 7: Human-in-the-loop checkpoints

Pattern 8: Agent loops (use sparingly)

Composing the patterns: a worked example

Frequently asked questions

How do I decide between building a workflow and building an agent?

Which framework should I use to implement these patterns?

How do I evaluate whether a workflow is good enough for production?

What does a typical AI workflow cost to run at scale?

How long does it take to build a production AI workflow?

What is the most common mistake teams make when building these workflows?

Getting these patterns into production

Keep reading.

What is RAG in Machine Learning? A Practical Explanation

RAG with LangChain: How Retrieval-Augmented Generation Actually Works

RAG Analysis: What It Is, How It Works, and When to Use It

Ready to automate your operations?

Why patterns matter more than frameworks

Pattern 1: Prompt chaining

Pattern 2: Routing

Pattern 3: Retrieval-augmented generation (RAG)

Pattern 4: Tool use and function calling

Pattern 5: Evaluator-optimiser (the critique loop)

Pattern 6: Parallelisation

Pattern 7: Human-in-the-loop checkpoints

Pattern 8: Agent loops (use sparingly)

Composing the patterns: a worked example

Frequently asked questions

How do I decide between building a workflow and building an agent?

Which framework should I use to implement these patterns?

How do I evaluate whether a workflow is good enough for production?

What does a typical AI workflow cost to run at scale?

How do I handle data privacy and GDPR when sending data to LLM APIs?

How long does it take to build a production AI workflow?

What is the most common mistake teams make when building these workflows?

Getting these patterns into production

Keep reading.

What is RAG in Machine Learning? A Practical Explanation

RAG with LangChain: How Retrieval-Augmented Generation Actually Works

RAG Analysis: What It Is, How It Works, and When to Use It

Ready to automate your operations?