AI Workflow Agency
AI5 min read

Custom AI Development Agency: How to Choose, Scope and Run One

A practical guide to choosing a custom AI development agency: pricing, scoping, vendor checks, GDPR, build vs buy, and what good delivery looks like

By AI Advisory team

Hiring a custom AI development agency is now a normal line item for mid-market operations, engineering and product teams. The market has moved fast: in 2023 most procurement conversations were about pilots and proofs of concept; by late 2025 they are about production systems, evaluation harnesses, and how to keep models behaving when the underlying APIs change underneath you. This guide is for buyers who have decided that an agency makes more sense than an internal build, and who now need to scope the work, vet vendors, and run the engagement without losing control of cost or quality.

The framing assumed throughout: you are buying a system that needs to keep working, not a one-off deliverable. That changes who you hire and how you contract.

What a custom AI development agency actually does

The phrase covers a wide range of work. At one end, an agency might build a single retrieval-augmented generation (RAG) assistant against your document store: ingestion pipeline, vector index, query rewriting, evaluation harness, and a thin UI. At the other end, the same agency might run a multi-agent system that handles inbound sales qualification across CRM, calendar, and email, with human handoff and audit logging.

What separates a custom agency from a SaaS implementation partner is that the agency writes code. A HubSpot partner configures HubSpot. An AI development agency builds the bit that does not exist yet: a custom retriever, a fine-tuned classifier, an orchestration layer between three vendor APIs, or a RAG pipeline grounded in your specific data with your specific refusal rules.

Typical scope areas:

  • RAG systems over internal knowledge bases, customer documentation, contracts, or product catalogues.
  • Workflow automations that include an LLM step but also handle the unsexy parts: webhooks, idempotency, retries, dead-letter queues.
  • Internal AI tools for ops, sales, or analyst teams - the kind of thing that used to be a brittle spreadsheet plus three Zapier zaps.
  • Customer-facing assistants with guardrails, escalation paths, and evaluation pipelines.
  • Multi-agent orchestration where several LLM calls coordinate to complete a task, with tool use and structured outputs.
  • Fine-tuning when prompting and retrieval are demonstrably insufficient - which is rarer than vendors admit.

The mistake buyers make is shopping for a model rather than a system. The model is a component. The system is the contract.

When to hire an agency vs build in-house

The honest answer is that this depends on what your engineering team already does. A useful test: if you asked your head of engineering to ship a production RAG system in eight weeks, would they say yes confidently, yes nervously, or no? The nervous yes is where agencies earn their fee.

Hire an agency when:

  • You need the first production system shipped in under three months and your team is busy with the core product.
  • You want the patterns established by someone who has shipped this category before, so your team can take it over later without inheriting an architectural mistake.
  • The work spans disciplines your team does not have together: ML evaluation, prompt engineering, retrieval tuning, infrastructure, and product design.
  • The cost of getting it wrong (regulatory exposure, customer-visible hallucinations) is higher than the cost of the engagement.

Build in-house when:

  • The system is core IP and will need continuous iteration tied tightly to the product roadmap.
  • You already have ML or platform engineers with relevant experience.
  • The use case is well-understood and the patterns are public (a basic internal chatbot over a documentation site, for example).

A common hybrid model works well: the agency builds the first one or two systems, documents the patterns, runs them in production for six to twelve months, then transitions ownership to an internal team. McKinsey's State of AI reporting has consistently found that organisations getting measurable EBIT impact from AI tend to combine external delivery capacity with internal product ownership, rather than picking one model exclusively.

Pricing: what to expect in 2026

UK mid-market pricing for custom AI development clusters into three bands. These are blended day rates and project totals from agency benchmark data (Productive's annual agency benchmark, the UK Agencies Benchmark Report, and observed RFP responses), not list prices on websites.

Discovery and strategy phase: £8,000 to £25,000 fixed fee for a two- to four-week engagement that produces a costed roadmap, technical architecture, and prioritised backlog. If an agency wants to skip this and quote a full build cold, that is a signal to slow down.

First production build: £25,000 to £150,000 depending on integration count, data complexity, and evaluation rigour. A single-source RAG assistant with a clean document corpus and one channel sits at the bottom. A multi-system workflow with CRM writes, compliance logging, and human-in-the-loop review sits at the top.

Ongoing operation: £3,000 to £15,000 per month retained, covering monitoring, evaluation, prompt and retrieval iteration, model upgrades, and incident response. Agencies that try to ship and walk away tend to deliver systems that quietly degrade as models change underneath them.

Day rates for senior engineers on these engagements in the UK currently sit at £900 to £1,400. Below that, you are typically buying junior delivery with senior oversight; above that, you are buying named senior practitioners or specialist domains (regulated industries, multi-agent research work).

The two pricing models that work in practice: fixed-fee discovery followed by time-and-materials build with a not-to-exceed cap, or milestone-based fixed pricing where each milestone has an acceptance test written before work starts. Avoid pure time-and-materials with no cap - it transfers all the estimation risk to you and there is no incentive to ship efficiently.

How to vet a custom AI development agency

Most agency selection processes overweight credentials and underweight delivery evidence. The credentials look similar across vendors; the delivery evidence does not.

Ask for the following:

  1. Two production systems you can actually use. Not screenshots, not case studies - a live URL or a screen-shared demo. If they cannot show you something running, they have not shipped recently.
  2. The evaluation harness for one of those systems. Any agency doing serious LLM work has a way to measure whether a prompt or retrieval change made things better or worse. If they do not, every iteration is guesswork.
  3. Their incident log for the last six months. Real production systems break. You want to see that they detected, diagnosed and resolved issues - not that they had none.
  4. The handover artefact from a previous engagement. Runbooks, architecture diagrams, decision logs. This tells you whether you will own the system at the end or be locked in.
  5. The named team who will deliver. Senior partners often pitch, juniors often deliver. Get CVs of the actual engineers and a contractual commitment that they will not be swapped without notice.

Also worth checking: do they have a position on open-weights vs frontier models, on self-hosting vs managed APIs, on n8n vs custom code? An agency without opinions on these tradeoffs will default to whatever is easiest for them rather than what fits your situation. Strong opinions, loosely held, is what you want.

For UK buyers in regulated sectors, ask specifically about UK GDPR handling, data processing agreements, sub-processor disclosure, and whether they have implemented systems aligned with the ICO's guidance on AI and data protection. The right answer is detailed and specific; the wrong answer is "we follow GDPR".

Scoping the first engagement

The single biggest predictor of a successful agency engagement is whether the first project is scoped narrowly enough to ship in 8 to 12 weeks. Wider scopes fail not because the agency is bad but because the feedback loop is too long - by the time anything ships, the model landscape has moved and half the assumptions are stale.

A good first scope has these properties:

  • One primary user. Not "sales and marketing and ops". Pick one team.
  • One primary integration. If the system needs to write to three systems, scope the first version to read from three and write to one.
  • A measurable success metric defined before kickoff. Examples: "40% of inbound support queries handled without human touch with CSAT above 4.0", "sales reps save 30 minutes per day on CRM updates", "document review time reduced from 45 minutes to 10".
  • An evaluation set built in week one. 100 to 300 representative inputs with expected behaviours, used to measure every subsequent change.
  • A clear refusal pattern. What does the system do when it does not know? Saying "I don't know" is a feature, not a bug.

Anything that does not fit in the first scope goes into a phase-two backlog. The discipline of refusing scope creep in phase one is what makes phase two possible.

Common failure modes and how to avoid them

Six patterns account for most failed custom AI engagements:

1. Demo-driven development. The agency builds something that demos well but breaks on real inputs. Mitigation: insist on an evaluation harness from week one and run it against production-like data, not curated examples.

2. Vendor lock-in by accident. The system is built so tightly around one model provider that switching costs three months of rework. Mitigation: require an abstraction layer between application logic and model calls, and ask to see a test that runs the same prompts against two providers.

3. No observability. The system runs in production but nobody can answer "why did it say that?" when a user complains. Mitigation: require structured logging of every LLM call with inputs, outputs, model version, and trace ID from day one.

4. Hallucination tolerance creep. Early in the project, the team accepts a 5% hallucination rate as a known limitation; by month six, it is 12% and nobody noticed. Mitigation: lock the evaluation set, run it weekly, alert on regression.

5. Hidden inference costs. The system works but costs £8,000 a month in tokens because nobody optimised the prompt length or cached responses. Mitigation: include unit economics in the success metric (cost per resolved query, cost per generated document) from the start.

6. The handover that never happens. Six months in, the agency still owns the system because no internal team has time to pick it up. Mitigation: name the receiving team at kickoff and budget their time into the engagement, not after it.

Recent Fortune coverage of AI implementation in mid-market firms has highlighted that the gap between organisations capturing measurable value and those still in pilot phase is now defined less by model quality and more by exactly these operational discipline issues. The technology is no longer the limiting factor.

What good looks like after six months

If the engagement is going well, six months in you should see: a system running in production with documented uptime, an evaluation harness that runs on every change, a monthly cost that matches the original estimate within 20%, an internal owner identified and trained, a backlog of phase-two work prioritised against measured impact, and an agency relationship that has shifted from "build" to "iterate and operate".

If instead you see: a system that works in demos but not in production, no evaluation data, costs trending up with no explanation, no internal owner, and the agency still doing all the work, the engagement is in trouble regardless of what the status reports say. The fix is usually not more budget; it is reverting to a smaller scope and rebuilding the operational discipline.

Frequently asked questions

How long does a typical custom AI build take?

For a well-scoped first system, plan on 8 to 16 weeks from kickoff to production. Weeks one and two are discovery, architecture and evaluation set construction. Weeks three to ten are build and iteration, with working software demonstrable from week four. Weeks eleven to sixteen cover hardening, observability, security review and handover. Anything quoted at four weeks is either a thin prototype or has skipped the parts that matter in production. Anything quoted at six months has scope that should be broken into two engagements - long projects fail at a much higher rate than short ones, and the model landscape moves too quickly for half-year build cycles to be safe.

What is a realistic budget for a first project?

For UK mid-market buyers, budget £30,000 to £100,000 for a first production system, plus £3,000 to £10,000 per month for ongoing operation. The lower end covers a single-channel RAG assistant against a clean data source with one integration. The higher end covers multi-step workflows, multiple integrations, evaluation infrastructure and audit logging. Below £25,000 you are typically buying a prototype, which is fine if you label it as such but a mistake if you expect production behaviour. Above £150,000 for a first project usually means the scope is too wide and would be better split into two sequential engagements.

Should we use an agency or hire AI engineers directly?

The right answer depends on whether AI systems are core to your product or supporting capability. If they are core - the system is what customers pay for - hire in-house and use an agency only for surge capacity or specialist work. If they are supporting capability - making your existing operations more efficient - an agency model typically delivers faster and at lower total cost for the first two to three years. The hybrid that works best: agency builds the first two systems and runs them for six to twelve months, then internal hires take over operation while the agency stays on for new builds.

How do we handle UK GDPR and data protection?

Three things matter. First, a Data Processing Agreement with the agency that names every sub-processor (model providers, vector databases, monitoring tools) and commits to notifying you of changes. Second, a clear position on data residency - whether prompts and outputs are processed in the UK, EU, or US, and whether they are used for model training (they should not be, on enterprise tiers). Third, a Data Protection Impact Assessment for any system processing personal data, following ICO guidance. Agencies that have done this work before will have templates; ones that have not will struggle, and that struggle is itself diagnostic.

How do we avoid vendor lock-in with the agency itself?

Contract for ownership of code, prompts, evaluation data and infrastructure-as-code from day one. The agency should be deliverable-locked-in, not relationship-locked-in - meaning if you replace them, the next team can take over without rebuilding. Specific clauses to require: code in a repository you own, infrastructure in your cloud accounts, prompts and evaluation sets as versioned artefacts in your repository, monthly handover-ready documentation, and a written runbook for each production system. If the agency resists any of these, treat that as a serious signal.

What is the difference between RAG, fine-tuning and prompt engineering, and which do we need?

Prompt engineering is shaping the instructions sent to a model. RAG (retrieval-augmented generation) is fetching relevant context from your data and including it in the prompt. Fine-tuning is training the model itself on examples. The decision order is almost always: prompt engineering first, then RAG, then fine-tuning only if the first two are demonstrably insufficient. Most custom AI systems in production today are RAG plus careful prompting; fine-tuning is justified roughly 15% of the time, mostly for style consistency, structured output reliability, or domain-specific classification. An agency that defaults to fine-tuning early is usually solving the wrong problem.

How do we measure ROI on a custom AI system?

Define the metric before the build, not after. The strongest metrics are time-based (hours saved per week per user), volume-based (queries resolved without human touch), or quality-based (error rate reduction in a defined process). Avoid vague metrics like "improved productivity". Measure baseline for two to four weeks before launch, then measure the same metric monthly after. Subtract total cost (build amortised over expected life plus monthly operation) from value created (hours saved times loaded cost, or revenue uplift) to get net ROI. Systems that cannot show positive ROI within six months of production usually cannot show it at twelve either - cut early.

What happens when the underlying models change?

Model providers release new versions every three to six months, and deprecate older versions on similar cycles. A well-built system handles this through an abstraction layer (so the model is configurable, not hard-coded), an evaluation harness (so you can measure whether a new model improves or regresses behaviour on your specific use case), and a monthly review cadence with the operating team. Expect to upgrade models two to four times a year. Agencies that have only ever shipped one version of a system tend to underestimate this; ones that have operated systems for 18 months or more build for it from day one.

Where to go next

If you are evaluating custom AI development agencies for a first project, the most useful next step is not another vendor call - it is writing down the success metric and the evaluation set for the first system you want built. Doing that work before procurement clarifies scope, sharpens vendor conversations, and makes contracts measurable. AI Advisory runs a two-week fixed-fee discovery designed exactly for this stage; whether you work with us or someone else, do not skip the discovery step.

Further reading

Sources referenced for context not directly cited in the body:

Ready to put this into production? book a discovery call.

Get started

Ready to automate your operations?

Walk away with a prioritised list of automation and AI wins, costed, sequenced, and yours. The call is 30 minutes, free, and binds you to nothing. The shortest path to knowing whether AI Workflow Agency is the right fit.