AI Chatbot Agency: How to Choose, Scope and Run a Build That Works
A practitioner's guide to choosing an AI chatbot agency: scoping, retrieval architecture, evaluation, compliance and what good delivery looks like
Most AI chatbot projects fail for the same three reasons: the scope is wrong, the retrieval architecture is wrong, and nobody owns evaluation after launch. Picking the right agency is mostly about avoiding those three failures. This guide walks through how to scope a chatbot build, what to look for in an agency, what good architecture looks like in 2025, and how to budget realistically.
If you are commissioning a chatbot for the first time, or you have a stalled internal build and want to bring in outside help, the questions below will save you a six-figure mistake.
What an AI chatbot agency actually does
The term covers a wide spread of work. At the shallow end, an agency configures a SaaS chatbot product (Intercom Fin, Ada, Zendesk AI) on top of your help centre and tunes the prompts. At the deep end, it builds a custom Retrieval-Augmented Generation (RAG) system on your private data, with a custom UI, integrations into your CRM and ticketing platform, evaluation harnesses, and a feedback loop into the underlying knowledge base.
Most mid-market buyers need something between those poles. A typical engagement looks like this:
- Discovery and scope (1-2 weeks): user research, conversation mining from existing support transcripts, definition of intents the bot will handle and intents it will refuse.
- Knowledge ingestion (2-3 weeks): pulling content from Confluence, SharePoint, Notion, Salesforce Knowledge, PDFs and websites; chunking, embedding and indexing it.
- Build (4-8 weeks): retrieval pipeline, prompt design, refusal patterns, handover-to-human logic, integrations.
- Evaluation (ongoing from week 3): a test set of questions with expected answers, run on every change.
- Deployment and tuning (2-4 weeks): channel integration (web widget, WhatsApp, Slack, Teams, in-app), staged rollout, monitoring.
- Operate (retainer): weekly review of failed conversations, content gap closure, model upgrades.
The agencies worth hiring spend more time on scope and evaluation than on the model itself. Picking GPT-4o or Claude Sonnet 4.5 is a 30-minute decision. Defining what "correct" means for your bot is a 3-week decision.
The retrieval architecture question
Almost every serious chatbot in 2025 is a RAG system. The model itself does not "know" your product, your policies or your prices. It is given the user's question, the system retrieves relevant passages from your private content, and the model writes an answer grounded in those passages.
The quality of the retrieval step matters more than the choice of model. A weak retriever paired with GPT-4o produces confident hallucinations. A strong retriever paired with a mid-tier model produces accurate, cited answers.
Good agencies will talk about:
- Hybrid retrieval - combining dense vector search (semantic) with sparse keyword search (BM25). Anthropic's contextual retrieval research showed hybrid plus contextual chunk embeddings cuts retrieval failure rate by up to 49% versus naive vector search.
- Chunking strategy - how content is split. Splitting a policy document on arbitrary 500-character boundaries destroys context. Splitting on semantic structure (headings, clauses) preserves it.
- Reranking - a second-pass model that reorders the top 20-50 retrieved chunks before sending the top 3-5 to the answer model. Cohere Rerank, BGE Reranker or a cross-encoder are common choices.
- Refusal and grounding - when the bot cannot find good evidence, it should say so, not improvise. This is the single largest difference between a production-safe bot and a demo bot.
- Citations - every answer should be traceable to the underlying source. This is both a trust feature for users and a debugging tool for you.
If an agency cannot explain its retrieval architecture in concrete terms - which vector store, which embedding model, what chunking rules, what reranker, what eval set - assume it is reselling a SaaS product with thin custom prompting on top. That can be the right answer for some use cases, but you should know what you are buying.
How to scope your first chatbot project
The most common scoping mistake is trying to build one bot that handles everything. Customer support, sales qualification, internal HR queries and developer documentation are four different products, with different users, different data, different tone and different risk profiles. Pick one.
Within that one, narrow further. A useful framework:
- Pull 3-6 months of conversation data from your existing channel (Zendesk tickets, Intercom conversations, live chat transcripts, contact-form submissions).
- Cluster the conversations by topic. Most support volumes follow a long tail: the top 10-20 intents typically cover 60-80% of volume.
- For each intent, classify it as either deflectable (the bot can resolve it from documented information), triageable (the bot collects information and routes to a human), or human-only (legal, complex billing, complaints).
- Build for the top 5-8 deflectable intents first. Measure deflection rate, customer satisfaction (CSAT) on bot-resolved conversations, and false-resolution rate (cases where the bot claimed to resolve something but the user came back).
This narrow scoping is what produces the public success cases. Klarna's widely reported chatbot handled two-thirds of customer service chats within a month of launch because the team was disciplined about which intents the bot owned and which routed to humans. They did not try to replace all human work on day one.
What to look for in an agency
Ten questions that separate practitioners from resellers:
- Show me an evaluation harness from a past project. If they cannot, they are shipping unmeasured systems. Walk away.
- What does your retrieval stack look like and why? Specifics about embedding models, vector store, reranking and chunking. Vague answers mean off-the-shelf SaaS.
- How do you handle hallucinations and refusals? Look for grounding rules, citation requirements, and explicit "I don't know" patterns.
- How do you handle PII and GDPR? The ICO's guidance on AI sets clear expectations on lawful basis, transparency and data minimisation. The agency should reference it without prompting.
- Where does the data sit and which models process it? If your data cannot leave the UK or EU, the agency should know which providers offer EU-resident inference and which do not.
- What is the handover to humans? A bot without a clean escalation path is a complaint generator.
- Who owns the prompts, the eval set, and the code at the end? The correct answer is you. Be wary of black-box deliverables.
- What does the retainer cover? Bots drift. Content changes, models update, user behaviour shifts. You need someone monitoring this monthly, not annually.
- Show me a project that did not work and what you learned. Anyone with real experience has at least one. Refusal to discuss failures is a red flag.
- What is the smallest version of this we could ship in 6 weeks? Good agencies always have an answer. Bad ones want to sell a 9-month programme.
Budget, timeline and operating cost
UK market rates as of late 2025, for mid-market projects:
- SaaS chatbot configuration (Intercom Fin, Ada, Zendesk AI): £8,000-£25,000 setup, plus the SaaS licence (typically £0.40-£1.20 per resolved conversation).
- Custom RAG chatbot, single channel, one knowledge domain: £30,000-£70,000 build, 8-12 weeks.
- Custom RAG chatbot, multi-channel, multiple integrations (CRM, ticketing, auth): £70,000-£180,000 build, 12-20 weeks.
- Operate retainer: £2,500-£12,000 per month depending on volume, channel count and evaluation depth.
On running cost, model inference is usually the smallest line item. For a bot handling 10,000 conversations a month with an average of 6 turns and ~2,500 tokens per turn, expect API costs in the £400-£1,200 range monthly on current Anthropic or OpenAI pricing. Infrastructure (vector store, app hosting, monitoring) adds £100-£500. The expensive line items are the humans: content curation, evaluation runs, and the engineer who fixes things when retrieval drifts.
Payback timelines on customer support deflection are usually 4-9 months if scoped properly. McKinsey's State of AI surveys consistently show service operations as the most common function reporting cost savings from generative AI, with median reported cost reductions in the 10-30% range when implementations are mature.
UK compliance and the things that actually matter
For UK buyers, three things sit above the rest:
Lawful basis and transparency. Under UK GDPR, users interacting with the bot need to know they are talking to a machine, what data is being collected, and what it is used for. The ICO is explicit that AI does not get a special pass on transparency requirements. Your privacy notice needs updating before launch, not after.
Data residency and sub-processors. If you are in regulated sectors (financial services, healthcare, legal), check where the model provider processes data. Anthropic and OpenAI both offer enterprise tiers with regional inference and zero data retention; the consumer tiers do not. Your DPIA needs to name every sub-processor in the chain.
Automated decision-making. If the bot makes decisions that significantly affect users (denying a claim, declining service, assigning a credit limit), you are likely in Article 22 territory and need a human-in-the-loop and a right of appeal. Most customer-service bots do not cross this line, but anything in lending, insurance underwriting or HR screening does. The UK government's AI regulation white paper sets out the principles regulators are applying.
A competent agency raises these before you ask. An incompetent one will tell you GDPR is a marketing department problem.
Build vs buy vs hybrid
Three legitimate paths exist, and the right one depends on your situation:
Buy SaaS if your use case is mainstream customer support, your knowledge base is reasonably clean, and you do not need deep integration into proprietary systems. Intercom Fin, Ada and Zendesk AI are all credible. You will trade flexibility for speed-to-launch and reduce build risk significantly.
Build custom if your data is sensitive, your integrations are non-trivial, your domain is specialised (legal, medical, financial), or the conversational logic is genuinely complex. A custom RAG system gives you full control of retrieval, prompting, refusal logic and citations.
Hybrid - and this is increasingly the right answer - if you want a SaaS front-end (Intercom, Zendesk) for ticket management and human handover, with a custom RAG backend doing the actual answering via API. You get the operational maturity of the SaaS product and the answer quality of a custom system.
An agency that pushes you into one path without understanding your data, your integrations and your risk profile is selling its capacity, not solving your problem.
What good delivery looks like in the first 90 days
By week 4, you should see a working bot in a staging environment answering a defined set of test questions, with a published evaluation report.
By week 8, you should see a soft launch on a single channel with a small percentage of traffic, monitored daily, with a published weekly metrics dashboard covering deflection rate, CSAT, refusal rate, false-resolution rate and average cost per conversation.
By week 12, you should see the bot at full traffic on the launch channel, with a documented content backlog (the gaps the bot has surfaced in your knowledge base), a documented model evaluation showing performance against the test set, and a plan for the next channel or intent expansion.
If any of those milestones slip with no clear remediation plan, the project is in trouble. Good agencies surface bad news early. Bad ones surface it at the end of the contract.
FAQ
How long does it take to launch an AI chatbot with an agency?
For a well-scoped first project on a single channel covering 5-8 intents, expect 8-12 weeks from kickoff to soft launch. The first two weeks are discovery and conversation mining. Weeks three to eight cover knowledge ingestion, retrieval setup, prompt design and evaluation. Weeks nine to twelve are integration, staged rollout and tuning. Multi-channel, multi-integration projects with complex auth or compliance requirements push to 16-20 weeks. Be wary of agencies promising production launches in under six weeks unless the scope is genuinely tiny - a 30-day launch usually means skipping evaluation, which means the bot will hallucinate in production.
How much should a mid-market business budget for a custom AI chatbot?
For a custom RAG chatbot covering one channel and one knowledge domain, UK market rates sit between £30,000 and £70,000 for the build, with most projects landing around £45,000-£55,000. Multi-channel projects with CRM and ticketing integration run £70,000-£180,000. On top of build, budget £2,500-£12,000 per month for an operate retainer covering monitoring, evaluation, content updates and model upgrades. API and infrastructure running costs are typically the smallest line item at £500-£2,000 monthly for moderate volumes. The total first-year cost for a mid-market deployment is usually £80,000-£150,000.
What is the difference between RAG and fine-tuning, and which does my chatbot need?
RAG (Retrieval-Augmented Generation) keeps your knowledge in an external store and retrieves relevant passages at query time, feeding them into the model's context. Fine-tuning bakes patterns directly into the model's weights through additional training. For 95% of customer-facing chatbots, RAG is correct because it lets you update knowledge instantly (just update the source content), provides citations, and avoids retraining costs. Fine-tuning is appropriate when you need a specific output style or format the base model cannot produce reliably, or for narrow classification tasks. Most agencies pushing fine-tuning for general chatbot work are solving the wrong problem.
How do AI chatbots handle GDPR and UK data protection?
Three areas need attention: transparency (users must know they are interacting with AI and what data is collected), lawful basis (usually legitimate interest for customer service, consent for marketing chatbots), and data residency (where the model provider processes the data). The ICO has published explicit guidance on AI under UK GDPR and expects organisations to complete a Data Protection Impact Assessment before deploying chatbots that process personal data at scale. Anthropic, OpenAI and Azure OpenAI all offer enterprise tiers with EU or UK data residency and zero retention. Consumer-tier APIs typically do not, so check your contract before launch.
Can an AI chatbot replace our customer support team?
Not at any responsible scope, no. The realistic outcome is deflection of 30-65% of incoming volume on common intents, freeing the team to focus on complex cases, sensitive issues and escalations. The Klarna deployment that handled two-thirds of chats was unusual in both scale and intent design, and even there the human team remained essential for complaints, regulatory matters and edge cases. Treat chatbots as capacity expansion, not headcount replacement. Teams that try to replace humans entirely tend to see CSAT drop sharply within three months and reverse the decision within twelve.
Who owns the chatbot, the prompts and the code at the end of the engagement?
You should, in every meaningful sense. The build code, the system prompts, the evaluation set, the vector indices and the integration code should all be delivered to your repositories under your accounts. Be explicit about this in the statement of work. Some agencies retain ownership of "proprietary frameworks" - this can be acceptable if it covers genuinely reusable internal tooling, but it should never cover your prompts, your evaluation data, or anything trained on your content. If you cannot, in principle, take the system to another vendor at the end of the contract, you have a lock-in problem.
How do we measure whether the chatbot is actually working?
Five metrics matter. Deflection rate (percentage of conversations resolved without human handover). CSAT on bot-resolved conversations (survey the user after resolution). False-resolution rate (conversations the bot marked resolved but the user re-contacted within 7 days). Refusal rate (how often the bot declined to answer - too high means weak retrieval, too low means it is improvising). Cost per resolved conversation (total operating cost divided by resolutions). Track these weekly and review monthly. If your agency cannot supply this dashboard from day one, they are not running a serious operation.
What is the biggest reason chatbot projects fail?
Undefined success criteria, by a wide margin. Projects launch without an evaluation set, without a deflection target, without a CSAT baseline, and without a definition of what "the bot is working" means. Six months later, nobody can say whether it has paid back, the team has stopped maintaining the knowledge base, and the bot is quietly hallucinating on 15% of conversations. The fix is upstream: insist on an evaluation harness, a baseline measurement of your current support metrics, and a written definition of success before any code is written. Agencies that resist this are the ones whose past projects failed.
Getting started
The best first step is a 2-week discovery, not a 12-week build. Mine your existing conversations, pick one narrow use case, define success in measurable terms, and only then commission a build. If you want help working through this scoping process, AI Advisory runs fixed-fee discovery engagements designed to produce a costed, buildable plan rather than a slide deck.
Ready to put this into production? book a discovery call.