AI31 May 20265 min read

RAG Pipelines: How to Choose and Work With a Specialist Agency

What a RAG pipeline agency actually does, what good looks like, realistic costs and timelines, and how to brief one without wasting six weeks

By AI Advisory team

Retrieval-augmented generation has moved from research curiosity to standard production pattern in about 18 months. Most mid-market businesses now want one: an internal assistant grounded in their own documents, a customer-facing bot that answers from real product data, a support copilot that cites policy. The problem is that a working demo takes a weekend and a reliable production system takes six months, and the gap between the two is where most projects die. That is the gap a RAG pipeline agency exists to close.

This guide covers what a specialist RAG agency actually does, how to tell a good one from a glorified prompt-writer, what costs and timelines to expect, and how to brief the engagement so you do not lose the first six weeks to discovery theatre.

What a RAG pipeline agency actually builds

RAG, at its simplest, means a language model that answers questions by first retrieving relevant chunks of your own content and then generating a response grounded in those chunks. The architecture is well-documented - the original paper from Lewis et al. at Meta AI set the template in 2020 - but the production engineering around it is where agencies earn their fee.

A working pipeline has roughly seven moving parts: ingestion (pulling content from SharePoint, Confluence, a CMS, a database, a ticket system), parsing and chunking (turning PDFs and HTML into clean text segments at the right granularity), embedding (converting chunks to vectors), storage (Postgres with pgvector, Pinecone, Weaviate, Qdrant, or similar), retrieval (semantic search, usually hybrid with BM25 keyword search bolted on), reranking (a second-pass model that reorders the top-k results), and generation (the LLM call with a carefully constructed prompt that includes retrieved context, refusal instructions, and citation requirements).

A specialist agency builds, evaluates, and operates all of this as one system. Sub-tasks an agency typically owns: building the ingestion connectors and keeping them synced as source documents change; designing the chunking strategy (this matters more than most people realise - chunk too small and you lose context, chunk too large and retrieval gets noisy); choosing and tuning the embedding model; building an evaluation harness so you can actually measure whether the system is getting better; designing refusal patterns so the bot says "I don't know" instead of inventing answers; and instrumenting the whole thing so you can debug a bad response three months after launch.

The signs of a serious RAG agency

The market is noisy. A lot of agencies that sold WordPress builds in 2022 now sell "AI solutions," and the gap between marketing and capability is wide. A few signals separate the practitioners from the slideware shops.

They lead with evaluation, not architecture. A serious agency will ask, within the first conversation, how you plan to measure whether the system is working. They will talk about retrieval metrics (recall@k, MRR), answer quality metrics (faithfulness, relevance, groundedness), and how they will build a golden test set from your real questions. If the first deck is all about vector databases and not about how you will know it works, you are talking to a hobbyist.

They have opinions about chunking. Ask how they chunk a 200-page PDF policy document versus a knowledge base of 800-word support articles. The right answer involves semantic chunking, parent-document retrieval, or hierarchical strategies depending on the source. The wrong answer is "we split by 1000 characters."

They run hybrid retrieval by default. Pure semantic search misses exact-match terms (product codes, error codes, names). Pure keyword search misses paraphrases. Production systems run both and combine the scores. If an agency only talks about vector similarity, they have not shipped enough RAG to know better.

They show you a working evaluation harness on day one. Tools like Ragas, LangSmith, and TruLens exist precisely so you can measure RAG quality systematically. An agency that does not run one of these (or a custom equivalent) is shipping by vibes.

They talk about hallucination as an engineering problem, not a marketing one. Good RAG reduces hallucination but does not eliminate it. A serious agency will discuss faithfulness scoring, citation enforcement, refusal prompts, and the trade-off between recall and precision. They will not promise "zero hallucinations."

They have a defensible position on the stack. There is no single right stack, but there are wrong stacks. A defensible default for mid-market UK work is Python plus FastAPI, Postgres with pgvector for storage (saves you a vendor relationship), OpenAI or Anthropic for generation, an open-source embedding model or OpenAI's text-embedding-3, and a thin orchestration layer in LangChain or LlamaIndex. An agency that insists on a proprietary platform you cannot inspect should be your last choice.

What it costs and how long it takes

The honest answer is that pricing varies more than most service lines because the scope variance is huge. A single-source RAG chatbot over a 500-document knowledge base is a different project from a multi-tenant assistant that retrieves across SharePoint, Salesforce, a ticketing system, and a product database with row-level permissions.

Rough UK market ranges as of 2026:

Proof of concept (4-6 weeks): £15,000-£35,000. One source, basic evaluation, demo-quality UI. Useful for validating the use case and getting stakeholder buy-in.
Production v1 (10-16 weeks): £40,000-£120,000. Multiple sources, real ingestion pipeline, evaluation harness, refusal patterns, basic observability, deployed into your environment with auth.
Enterprise rollout (4-9 months): £120,000-£400,000+. Row-level security, multi-tenancy, complex permission inheritance, full observability, A/B testing infrastructure, SLAs.
Ongoing operation: £4,000-£20,000 per month retainer covers monitoring, re-evaluation as content changes, prompt tuning, model upgrades, and incident response. Skip this and the system degrades within six months.

The retainer matters more than people expect. RAG systems are not deploy-and-forget. Source content changes, models get deprecated (OpenAI has retired three embedding model generations since 2022), user questions drift, and what passed evaluation in week one fails by month four. Budget for the ongoing cost or do not start.

Build vs buy vs hybrid

Before commissioning a custom RAG build, it is worth checking whether you actually need one. The build-vs-buy decision has three honest options.

Buy a platform. Tools like Glean, Guru, and Microsoft Copilot for M365 offer pre-built RAG over standard enterprise sources. If your need is "search across our SharePoint and Slack with a chat interface" and you do not need bespoke logic, a platform costs less and ships faster. Per-seat pricing typically runs £15-£40 per user per month.

Build custom. Choose this when you need bespoke retrieval logic (e.g. permissions tied to your CRM hierarchy), when the assistant is customer-facing and brand-critical, when you need to embed it deeply into your own product, or when data residency rules out a SaaS option. Custom builds also win when you have unusual sources - legacy databases, proprietary file formats, niche vertical content.

Hybrid. An increasingly common pattern: use a platform for internal employee search, build custom for the customer-facing or product-embedded use case. A good agency will tell you when buying is the right call, even though it costs them the engagement.

The McKinsey State of AI 2024 report found that organisations seeing the highest return from generative AI were those who customised or built bespoke systems for high-value workflows, rather than relying solely on off-the-shelf assistants. That tracks with what we see in practice: platforms are fine for horizontal productivity, custom wins for anything close to your core value proposition.

Compliance, security, and the UK angle

If you are processing personal data through a RAG pipeline - and most internal assistants will, even incidentally - ICO guidance on AI and data protection is your starting point. The key questions an agency should be able to answer cleanly:

Where does data sit during inference? If you are calling OpenAI's API from the EU, you need to understand the data processing terms and whether you are on a zero-retention agreement. The same applies to Anthropic, Cohere, and any other model vendor.
Where do embeddings sit? Embeddings are derived data but in some interpretations still personal data if reversible. Storing them in a UK or EU region matters.
What gets logged? Prompt and completion logs are often a compliance gap. Agencies should propose redaction at log time, not just at retrieval.
How do you handle access control? If user A should not see documents available to user B, retrieval must enforce that. Bolting permissions on after the fact is painful.
What is the DPIA story? For most internal RAG deployments touching personal data, a Data Protection Impact Assessment is required under UK GDPR. A good agency will help you scope it.

For regulated industries - financial services, healthcare, legal - layer on the relevant regulator's expectations. The FCA's feedback statement on AI sets out their current thinking for financial services. None of this is a reason not to build; it is a reason to build with someone who has done it before.

How to brief a RAG agency without wasting six weeks

The single biggest cause of slow RAG projects is unclear scoping. You can compress the first six weeks to two if you arrive with the following:

A specific use case with a specific user. Not "an AI assistant for the business." Instead: "a tool for our 40 customer support agents to answer billing questions by searching across our knowledge base, Stripe, and the customer's ticket history." Narrow scope ships. Broad scope drifts.

A list of sources with rough volumes. Document count, total size, update frequency, access method. "We have about 1,200 articles in Zendesk, updated weekly, accessible via the Zendesk API" is briefable. "We have a lot of content" is not.

20-50 real questions from real users. These become the seed of your evaluation set. Without them, the agency invents synthetic questions and you optimise for the wrong thing.

A baseline answer quality bar. What does "good" look like? Is it "correct answer with citation 80% of the time" or "never invents a policy that does not exist"? Both are valid; they lead to different builds.

Clarity on who runs it after go-live. If you have an internal team who will take over, the build should be transparent and documented. If the agency will operate it on retainer, the build can use more sophisticated tooling.

A decision on data residency early. EU-only, UK-only, or global. This constrains the model vendor and the hosting region and is expensive to change later.

The best engagements start with a one-week paid discovery sprint that produces a costed technical specification before any build commitment. Treat agencies that refuse to do this with suspicion - either they are not used to scoping properly or they are hoping to use your project as their learning exercise.

Frequently asked questions

How is a RAG pipeline agency different from a general AI consultancy?

A general AI consultancy typically advises on strategy, vendor selection, and roadmap, often without building. A RAG pipeline agency builds and operates the system. The distinction matters because RAG quality is dominated by engineering decisions - chunking, retrieval tuning, evaluation, prompt design - that only show up once you start shipping. Consultancies that have not shipped production RAG tend to underestimate the operational work and overestimate how much off-the-shelf tooling will do for you. If you need a roadmap, hire a consultancy. If you need a working system, hire a build agency, ideally one that does both under one roof.

Can we build a RAG pipeline in-house instead?

You can, and many teams do. The honest test is whether you have at least one engineer with production ML or applied AI experience, a willingness to invest 3-6 months on a first version, and an operational plan for the next 18 months. Tutorials make RAG look like a weekend project; production RAG is closer to building a search engine combined with a content pipeline. In-house works well when AI is core to your product and you want the capability long-term. It fails when it is a side project for an already-overloaded platform team. A common pattern: agency builds v1 and trains your team, who then take it over after six months.

How long before we know if it is working?

You should have a working prototype within 4-6 weeks and meaningful evaluation results within 8-10 weeks. "Working" at week six means real users can ask real questions and get grounded answers most of the time. "Production-ready" usually lands at 12-16 weeks once you have refusal patterns, observability, security review, and the evaluation harness running continuously. Be wary of agencies promising production in four weeks - that timeline only works for the simplest single-source use cases, and even then you are skipping the evaluation work you will regret in month three.

What is the biggest hidden cost?

Content preparation. Most organisations discover that their source documents are not as clean as they thought: PDFs with terrible OCR, knowledge base articles that contradict each other, deprecated policies still indexed, internal jargon that does not match how users ask questions. Cleaning and structuring source content is often 20-40% of project effort and is rarely scoped accurately upfront. The second hidden cost is ongoing evaluation - building the golden test set is one thing; maintaining it as your product and content change is a quarterly job that someone has to own.

Will RAG be obsolete when context windows get bigger?

No, though the architecture will keep evolving. Long-context models (Gemini 1.5, Claude with 200k+ tokens) reduce the need for retrieval in some cases but introduce their own problems: cost scales with input size, latency increases, and accuracy degrades on "needle in a haystack" tasks at the limits of the context window. For corpora of any meaningful size - hundreds of thousands of documents, millions of records - retrieval is still the only economic option. Expect hybrid patterns where retrieval narrows to a few thousand candidates and a long-context model reasons over them. The skill of building retrieval pipelines is not going away.

What does ongoing maintenance actually involve?

Four recurring jobs. First, content sync: as source documents change, the index needs to reflect them, with appropriate deletion of stale chunks. Second, evaluation: re-running the golden test set monthly and investigating regressions. Third, prompt and retrieval tuning as user behaviour shifts - real users ask questions you never anticipated, and the system needs to adapt. Fourth, model and infrastructure upgrades: embedding models get deprecated, generation models improve, vector databases release new features. A typical retainer covers a half to one day per week of senior engineering plus on-call response for incidents.

How do we measure ROI on a RAG build?

The cleanest measures are task-specific. For support assistants: deflection rate, average handle time, CSAT on AI-handled tickets. For internal knowledge tools: time-to-answer for common questions, reduction in escalations to subject-matter experts. For sales enablement: ramp time for new hires, win rate on RFPs where the assistant was used. Avoid generic "productivity" metrics - they are unfalsifiable and erode trust. Set two or three specific operational metrics before build starts, baseline them, and review at 90 days. BCG's 2024 research on AI value found that the top quartile of AI adopters were 1.5x more likely to track outcome-based metrics rather than usage metrics, which tracks with what we see in practice.

What should we own versus what should the agency own?

You should own the business logic, the evaluation criteria, the source content, and the relationships with internal stakeholders. The agency should own the technical architecture, the build, the operational tooling, and (during the engagement) the day-to-day system performance. Where it gets ambiguous: prompt engineering is collaborative - you know the domain, they know the patterns. Evaluation set curation is yours, but the harness is theirs. Aim for a contract that transfers code, documentation, and operational runbooks to you at exit, with no proprietary lock-in. If an agency wants to keep your prompts or your evaluation data as trade secrets, walk away.

Getting started

RAG is a maturing discipline, not a hype cycle. The teams winning with it are the ones who treat it as a product engineering problem - with evaluation, observability, and an honest operational plan - rather than a magic demo. If you are scoping a build, the next step is usually a short, paid discovery sprint that ends with a costed technical specification and a clear go or no-go decision. AI Advisory runs these as a fixed-fee two-week engagement for mid-market UK clients; if that fits your situation, get in touch and we will tell you honestly whether a custom build is the right call for your use case.