AI & Development··9 min read·Fabiano Simm

How to integrate LLMs into an existing SaaS without rewriting the stack

A technical, pragmatic guide for founders and CTOs who want to add LLM capabilities to a product already in production — without big-bang rewrites or vendor promises.

LLMSaaSintegrationVercel AI SDKOpenAIAnthropic

The problem with the 'big-bang' approach

When a SaaS decides to adopt LLMs, the first temptation is to redesign everything: new vector database, new infrastructure, new backend. Six months later, the product still hasn't shipped and the original roadmap has accumulated technical debt.

There's a saner alternative: incremental integration. Identify one high-impact feature, add the LLM as an isolated service layer, measure, and only then expand. It's the strangler fig principle applied to AI.

This guide assumes you have a SaaS in production — Node.js, Python, or Rails, it doesn't matter much — and you want to add intelligence without stopping the product.

Step 1: Map use cases by complexity

Before touching code, map use cases on a complexity/value axis. Low complexity and high value come first:

  • Structured text generation from existing data (reports, summaries, automated emails)
  • Data classification and enrichment (categorising tickets, extracting entities from free-form inputs)
  • Internal search assistant with RAG over your documentation or customer base
  • Contextual suggestions within an existing editor or dashboard

Step 2: Isolate the LLM as an internal service

The recommended architecture for getting started: a dedicated module (or lightweight microservice) that encapsulates all prompting logic, output parsing, and fallback. The rest of the application calls it like any other service.

This ensures business code isn't coupled to the model vendor. Tomorrow you switch from OpenAI to Anthropic — only changes in a single module.

In TypeScript, the Vercel AI SDK is the most practical choice: it abstracts providers, offers streaming out of the box, and has native support for tool calling and structured output with Zod.

Step 3: Structured output before any logic

The biggest mistake we see in LLM integrations in production: trusting free text from the model and trying to parse it manually with regex. Works in demos, fails with real users.

The solution: define the output schema with Zod (TypeScript) or Pydantic (Python) from day one. The main providers support JSON mode and function calling to guarantee conformance.

Practical example: if the LLM will classify a support ticket, the output must be `{ category: 'billing' | 'technical' | 'other', confidence: number, summary: string }` — validated by the schema, never free string.

Step 4: RAG only when dynamic context is truly needed

Retrieval-Augmented Generation (RAG) is powerful but adds complexity: embeddings, vector store, retrieval pipeline, relevance to calibrate. Only implement it when the use case requires context that changes frequently and is too large to fit in the context window.

For many B2B SaaS, the prompt with static context (product, policies, FAQ) covers the first 80% of cases. Start simple.

When RAG is needed, pgvector on existing Postgres is the path with least overhead for small teams. Only migrate to Pinecone, Qdrant, or Weaviate when you have concrete reasons (scale, latency, advanced filters).

Step 5: Observability from the first deploy

LLMs in production without observability are an expensive black box. The minimum viable: log tokens used per request, latency, and a sample of prompts + outputs for manual auditing.

For teams that want to go further: Langfuse and Langsmith are the most widely adopted tools for LLM pipeline tracing. Both have free plans sufficient to get started.

Without this visibility floor, it's impossible to tell if the model is regressing, where costs are growing, or what's failing in responses.

What not to do

Avoid these patterns that create technical debt:

  • Inline prompts in the middle of API handlers — centralise them in versioned template files
  • Zero error handling for LLM calls — models fail, have rate limits, and have unpredictable latencies; treat them like external services that can fail
  • Fine-tuning as the first solution — in most cases, prompt engineering and RAG solve without the cost and maintenance of fine-tuning
  • Storing LLM output in DB without validation — always validate the schema before persisting

Frequently asked questions

Do I need to change my database to use LLMs?

Not necessarily. Most cases start with the existing DB. You only add pgvector (Postgres) or a separate vector store (Pinecone, Qdrant) when dynamic context is truly needed. Don't optimise prematurely.

OpenAI vs Anthropic vs open-source — which to choose?

For most European B2B SaaS: OpenAI GPT-4o for general quality, Anthropic Claude 3.5 Sonnet for long-text tasks and structured reasoning, open-source models (Llama, Mistral) if you need on-premise data or zero token cost at volume. Start with a hosted API and only migrate to self-hosted when you have real usage data.

How do I prevent the model from making things up (hallucinations)?

RAG (Retrieval-Augmented Generation) with citable sources is the most robust answer. The model only responds based on documents you control. For critical flows, add output verification with Zod or a second model validating the structure.

How much does it cost per month at real usage volume?

It depends a lot on the model and tokens per request. GPT-4o Mini costs ~$0.15 per million input tokens — for most B2B SaaS, costs are under €200/month in the first 6 months. Our article on real AI costs for SMEs goes deeper into the calculations.

Próximo passo

Integrating LLMs into an existing product? We help you evaluate the right approach for your stack — without selling hype.

Talk to Simmple