AI31 May 20265 min read

AI Chatbot Builds: How Specialist Agencies Approach the Work

What an AI automation agency does when building a production chatbot - architecture, retrieval, evaluation, integration and cost - explained for buyers

By AI Advisory team

The market for AI chatbots split into two camps somewhere around mid-2024. On one side, off-the-shelf widgets that wrap a foundation model with a prompt and a knowledge base upload. On the other, custom builds that ground responses in your actual systems, refuse to answer outside scope, and integrate into CRM, ticketing and operations. The first kind takes an afternoon. The second kind is what AI automation agencies are typically commissioned to deliver, and the gap between the two is where most failed pilots sit.

This article walks through what a specialist agency actually does when commissioned for a chatbot build - the architecture decisions, the retrieval approach, the evaluation harness, the integration work and the running costs - so you can brief one credibly or decide whether to build in-house.

What an AI automation agency actually builds

A chatbot, in the agency sense, is rarely just a chat window. It is a system with several layers: a retrieval layer that pulls relevant context from your data, a generation layer that produces a response using a foundation model, a guardrail layer that decides what the bot will and will not answer, an integration layer that pushes outcomes into downstream tools, and an evaluation layer that catches regressions before they ship.

Agencies that specialise in this work generally do four things that distinguish them from drop-in vendors. They scope the use case tightly enough that success is measurable. They ground the model in your data using retrieval-augmented generation (RAG) rather than fine-tuning where possible, because RAG is cheaper to maintain and easier to update. They build refusal patterns so the bot declines gracefully when it does not know. And they integrate the bot into your existing stack - HubSpot, Salesforce, Zendesk, Intercom, internal databases - so that conversations create tickets, update records and route to humans cleanly.

The work is closer to systems integration than to prompt engineering. McKinsey's 2024 State of AI report found that organisations seeing measurable EBIT impact from generative AI are disproportionately those that redesigned workflows around the AI rather than bolting it on. Chatbots that move metrics are the ones embedded in workflow, not the ones floating on a marketing page.

RAG vs fine-tuning: what specialist agencies usually choose

The most common architectural question on a chatbot build is whether to ground the model with retrieval or to fine-tune it on your data. The honest answer for the majority of mid-market projects is retrieval-augmented generation, with fine-tuning reserved for narrow cases.

RAG works by embedding your documents - help articles, product specs, policy documents, past tickets - into a vector database, then retrieving the most relevant chunks at query time and passing them to the model alongside the user's question. The model answers using that context. Update a document, re-embed it, and the bot's knowledge updates within minutes. Postgres with the pgvector extension is the default stack for most mid-market builds because it avoids a separate vector database and runs on infrastructure teams already understand.

Fine-tuning, by contrast, bakes knowledge into the model weights. It is appropriate when you need the model to adopt a specific tone, follow a structured output format reliably, or handle a domain language the base model struggles with - legal drafting conventions, medical coding, specialist trading terminology. It is not appropriate as a substitute for retrieval. OpenAI's own documentation on fine-tuning is explicit: use it for behaviour, not for adding factual knowledge.

A typical agency build uses hybrid retrieval - combining dense vector search with traditional keyword (BM25) search and a re-ranking step - because pure semantic search misses exact matches on product codes, SKUs and proper nouns. The re-ranker is usually a smaller model that scores the top 20 retrieved chunks and selects the best 4-6 to pass to the generator.

Refusal patterns, hallucination control and the evaluation harness

The thing that separates a chatbot a business can put in front of customers from one that becomes a liability is how it handles questions it cannot answer well. A well-built bot says "I don't have that information, let me connect you to someone who does" rather than inventing a refund policy.

Refusal patterns are implemented at several points. The prompt instructs the model to decline when context is insufficient. The retrieval layer returns a confidence score, and below a threshold the bot escalates rather than generates. A classifier sits in front of the generator to catch out-of-scope queries early - questions about competitors, legal advice, medical advice, anything outside the agreed perimeter. And every response is checked against a list of forbidden outputs before it reaches the user.

None of this is testable by hand. Specialist agencies build an evaluation harness - a set of 200-1000 test conversations covering the happy path, the edge cases, the adversarial prompts and the regression cases. Every model change, prompt change or retrieval change runs against the harness and produces a pass/fail report. Without this, every deployment is a gamble. With it, the team can change a prompt on a Tuesday and know by Wednesday morning whether anything has regressed.

The UK's ICO guidance on AI and data protection sets clear expectations around accuracy, transparency and the right to meaningful human review. Build briefs for regulated sectors should reference these directly. A refusal pattern that routes to a human is not just good product design; it is often a compliance requirement.

Integration is where the value lives

A chatbot that answers questions in isolation is worth a fraction of one that creates a Zendesk ticket when the customer asks for a refund, updates the HubSpot deal stage when a sales prospect confirms a meeting, pulls live order status from your warehouse system, or escalates to a named account manager when the customer is on an enterprise plan.

This is the layer where automation specialists earn their fees. The work involves authenticating into 6-12 SaaS APIs, handling rate limits and retries, mapping conversational outcomes to structured CRM actions, and routing conversations to the right human queue based on intent, sentiment and account value. For workflow orchestration, n8n is a common choice because it is self-hostable (which matters for GDPR), gives a visual audit trail, and supports the long tail of integrations a typical mid-market estate needs.

A realistic integration scope for a customer-support chatbot would include: CRM read/write (HubSpot or Salesforce), helpdesk read/write (Zendesk, Intercom or Freshdesk), order management read-only (Shopify, NetSuite or a custom ERP), authentication (SSO via Okta or Azure AD for staff use), and observability (logging to Datadog or similar). Each integration is roughly 0.5-2 days of engineering depending on API quality.

What it costs and how long it takes

Pricing for custom chatbot builds in the UK mid-market sits in a fairly consistent band. A scoped MVP - one channel, one knowledge domain, two or three integrations, a working evaluation harness - typically runs £25,000-£60,000 over 8-12 weeks. A multi-channel production system with deeper integrations, multi-tenant support and ongoing tuning sits at £75,000-£200,000 with a 4-6 month build window. Numbers below £15,000 generally indicate a templated widget rather than a custom build; numbers above £250,000 usually indicate either a complex regulated-sector deployment or scope creep.

Running costs split into two: infrastructure and model usage. Infrastructure for a self-hosted n8n plus pgvector stack runs £200-£800 a month on a modest cloud setup. Model usage depends entirely on volume and model choice - a bot handling 10,000 conversations a month on GPT-4o-mini or Claude Haiku might cost £150-£400 in API fees; the same volume on GPT-4o or Claude Sonnet might be £800-£2,500. Most production bots use a router that sends straightforward queries to the cheap model and only escalates to the expensive model when needed.

Ongoing retainer fees for the agency typically run £3,000-£12,000 a month and cover prompt and retrieval tuning, evaluation harness expansion, new integrations, model upgrades and incident response. Roughly 70% of mid-market chatbot clients keep the build team on retainer because the system needs continuous tending, particularly in the first six months.

How to brief an agency credibly

The briefs that produce good builds share a few characteristics. They name a specific use case rather than "we want a chatbot". They identify the systems the bot must read from and write to. They specify the channels - web, WhatsApp, Slack, MS Teams, voice. They state the success metric - deflection rate, resolution rate, qualified leads generated, time to first response. And they include a rough volume estimate so the agency can size infrastructure and model costs.

Briefs that produce trouble tend to be open-ended - "explore how AI could help our customer service" - or anchored on a tool rather than an outcome - "we want to use ChatGPT". The tool choice is the agency's problem. The outcome is yours.

Ask any shortlisted agency for three things: a worked example of their evaluation harness from a previous build, a description of how they handle GDPR and data residency, and a reference client you can call. If they cannot produce all three, they are probably reselling a SaaS chatbot rather than building one.

Frequently asked questions

How is a custom AI chatbot build different from using something like Intercom Fin or Zendesk AI?

Off-the-shelf bots like Fin and Zendesk AI are good products and the right choice for some use cases - particularly when your knowledge lives entirely in their helpdesk and your integrations are minimal. They become limiting when you need to ground answers in systems they do not natively support (custom databases, ERP, niche SaaS), when you need fine control over refusal behaviour, when data residency or self-hosting is a requirement, or when the per-resolution pricing model becomes expensive at scale. Custom builds give you ownership of the architecture and the cost curve at the price of more upfront engineering.

How long until a chatbot pays back?

For customer support bots, payback typically lands at 6-12 months once the bot is deflecting 25-45% of routine queries and the cost of running it is materially below the cost of the human time it replaces. Internal-facing bots (HR queries, IT support, sales enablement) often pay back faster because they reduce expensive senior time. Sales and lead-qualification bots are harder to measure cleanly because attribution is messy, but most clients see meaningful pipeline contribution within 3-6 months of launch. The key driver is volume - low-volume use cases rarely pay back regardless of build quality.

Most production bots need read access to your knowledge base (help articles, product docs, policy documents) and read/write access to your CRM and helpdesk. For GDPR, the dominant concerns are data residency, lawful basis for processing customer messages, and the right to meaningful human review. Standard mitigations include self-hosting the orchestration layer in UK or EU data centres, using model providers with EU data residency options (Azure OpenAI in West Europe, AWS Bedrock in Ireland, Anthropic via EU regions), redacting PII before it reaches the model where possible, and logging every interaction for the audit trail the ICO expects.

Do we need to fine-tune a model for our use case?

Usually not. For 80-90% of mid-market chatbot use cases, retrieval-augmented generation gives better results than fine-tuning at lower cost and with easier ongoing maintenance. Fine-tuning becomes relevant when you need the model to follow a specific output structure reliably, adopt a domain-specific tone that prompting cannot achieve, or handle terminology the base model handles poorly. Even then, fine-tuning is usually combined with retrieval rather than replacing it. If an agency's first instinct is to fine-tune, ask them to justify why retrieval is insufficient before signing off the approach.

What happens when the underlying model changes?

Foundation models update frequently - new versions, deprecations, pricing changes. A well-built system isolates the model behind an abstraction layer so swapping from GPT-4o to Claude Sonnet to a future model is a configuration change rather than a rebuild. The evaluation harness is what makes this safe: when a new model comes out, you run the harness against it, see where it improves and regresses, and decide whether to switch. Without the harness, model upgrades are guesswork. This is also why the retainer matters - models change often enough that a static, unmaintained bot degrades over 12-18 months.

Can we build this in-house instead of using an agency?

Yes, if you have the right team. The realistic in-house profile is a senior ML or backend engineer who has shipped at least one RAG system to production, plus a product manager who can scope tightly, plus integration engineering capacity. Time to first production deployment is typically 4-6 months for a first build because the team is learning the patterns. Agencies are usually faster and cheaper for the first one or two builds because they have done it before; once the team has shipped two or three systems, in-house economics improve significantly. A common pattern is to use an agency for the first build with explicit knowledge transfer, then bring subsequent builds in-house.

How do we measure whether the chatbot is actually working?

The metrics that matter depend on the use case. For customer support: deflection rate (percentage of queries resolved without human involvement), resolution rate (percentage that resolved correctly), CSAT on bot interactions, and escalation accuracy (when the bot routes to a human, is it the right human). For sales: qualified leads generated, meeting bookings, time to first response. For internal bots: usage rate, time saved per query, and reduction in tickets to the team the bot is supporting. Vanity metrics like total conversations are usually noise. The evaluation harness covers quality regression; these business metrics cover whether the bot is moving the dial.

What does the first 90 days of a build look like?

Weeks 1-2 are discovery: mapping the use case, auditing data sources, identifying integrations, agreeing the evaluation criteria. Weeks 3-6 are core build: ingestion pipeline, retrieval layer, prompt engineering, baseline evaluation harness, first integrations. Weeks 7-10 are iteration: running the harness, tuning retrieval, adding refusal patterns, completing integrations, internal testing with a controlled user group. Weeks 11-12 are launch: staged rollout, observability and alerting, runbooks for the support team. By day 90 you should have a bot in production handling real traffic with metrics flowing into a dashboard, not a perfected system but a working one that improves week on week.

Briefing your next build

The pattern that separates chatbot projects that ship and pay back from those that stall is unglamorous: tight scope, retrieval over fine-tuning, a real evaluation harness, integrations that put outcomes into the systems your team already uses, and someone responsible for tending the system after launch. None of it requires exotic technology. Most of it requires discipline and the experience to know which corners not to cut.

If you are scoping a chatbot build and want a second opinion on the brief, AI Advisory runs scoping sessions that produce a costed plan you can take to any agency, including ones that are not us.

Ready to put this into production? book a discovery call.

AI Chatbot Builds: How Specialist Agencies Approach the Work

What an AI automation agency actually builds

RAG vs fine-tuning: what specialist agencies usually choose

Refusal patterns, hallucination control and the evaluation harness

Integration is where the value lives

What it costs and how long it takes

How to brief an agency credibly

Frequently asked questions

How is a custom AI chatbot build different from using something like Intercom Fin or Zendesk AI?

How long until a chatbot pays back?

Do we need to fine-tune a model for our use case?

What happens when the underlying model changes?

Can we build this in-house instead of using an agency?

How do we measure whether the chatbot is actually working?

What does the first 90 days of a build look like?

Briefing your next build

Keep reading.

What is RAG in Machine Learning? A Practical Explanation

RAG with LangChain: How Retrieval-Augmented Generation Actually Works

RAG Analysis: What It Is, How It Works, and When to Use It

Ready to automate your operations?

What an AI automation agency actually builds

RAG vs fine-tuning: what specialist agencies usually choose

Refusal patterns, hallucination control and the evaluation harness

Integration is where the value lives

What it costs and how long it takes

How to brief an agency credibly

Frequently asked questions

How is a custom AI chatbot build different from using something like Intercom Fin or Zendesk AI?

How long until a chatbot pays back?

What data does the bot need access to, and how do you handle GDPR?

Do we need to fine-tune a model for our use case?

What happens when the underlying model changes?

Can we build this in-house instead of using an agency?

How do we measure whether the chatbot is actually working?

What does the first 90 days of a build look like?

Briefing your next build

Keep reading.

What is RAG in Machine Learning? A Practical Explanation

RAG with LangChain: How Retrieval-Augmented Generation Actually Works

RAG Analysis: What It Is, How It Works, and When to Use It

Ready to automate your operations?