Production infrastructure for LLM-backed applications

The tooling for building LLM-backed applications has matured quickly. LangChain, the OpenAI API, FastAPI as a backend layer, vector databases for retrieval: you can get a working RAG pipeline or an AI-powered feature in front of users in days. The application layer is genuinely easier than it used to be.

What hasn’t changed is everything underneath. The database still needs constraints. The API still needs error handling. The deployment still needs to be repeatable. And LLM-backed applications introduce a new category of failure mode that traditional production checklists don’t cover: the non-deterministic layer.

This article is about both: the standard production infrastructure that AI-built applications tend to skip, and the LLM-specific concerns that most teams don’t think about until something goes wrong.

The non-deterministic layer

A traditional API call either succeeds or fails in a predictable way. An LLM call does neither reliably. The model might return a well-formed response, a malformed one, a refusal, or something that passes validation but is semantically wrong. Response times vary by an order of magnitude. The upstream provider has its own availability characteristics that are outside your control.

This changes how you need to think about error handling. The question is not just “did the call succeed?” but “is the response usable?”

A minimal defensive wrapper around an LLM call looks something like this:

import httpx
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def call_llm(prompt: str) -> str:
    async with httpx.AsyncClient(timeout=30.0) as client:
        response = await client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": f"Bearer {OPENAI_API_KEY}"},
            json={
                "model": "gpt-4o",
                "messages": [{"role": "user", "content": prompt}],
            },
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

This handles transient failures with exponential backoff. It does not handle response quality. That requires a separate validation step whose shape depends on what you need the response to do. If you are parsing structured output, validate the schema before it touches your database. If you are generating text for display, decide in advance what an unacceptable response looks like and have a fallback.

Timeouts and the request lifecycle

LLM calls are slow relative to everything else in a web request. A p95 response time of three to eight seconds is normal; p99 can be much higher. If your application makes an LLM call in the request path, you have a few options and each has tradeoffs.

The naive approach is to block: the user waits while the LLM responds, the request times out if it takes too long, and your web server worker is occupied throughout. This works at low scale and fails badly at high scale.

The better approach for most applications is to move LLM calls out of the synchronous request path entirely. The user submits a request, you return immediately with a job ID, a background worker processes the LLM call, and the client polls or receives a webhook when the result is ready. This requires a task queue (Celery with Redis, or a managed alternative like Inngest), but it decouples your web server response times from LLM latency.

# FastAPI endpoint: accept the request, enqueue the job, return immediately
@router.post("/generate")
async def generate(request: GenerateRequest, background_tasks: BackgroundTasks):
    job_id = create_job(db, request)
    background_tasks.add_task(process_llm_job, job_id)
    return {"job_id": job_id, "status": "pending"}

# Separate task: runs outside the request lifecycle
def process_llm_job(job_id: str):
    job = get_job(db, job_id)
    try:
        result = call_llm(job.prompt)
        update_job(db, job_id, status="complete", result=result)
    except Exception as e:
        update_job(db, job_id, status="failed", error=str(e))

The tradeoff is complexity. If your application is simple and your LLM calls are fast enough for the use case, the synchronous approach is fine. The point is to make the choice deliberately, not to discover at load that your workers are all blocked waiting for the OpenAI API.

Retrieval and the vector database layer

RAG applications introduce a retrieval step before the LLM call: embed the query, find similar vectors, inject the results into the prompt.

The most common failure mode in this layer is retrieval quality degrading silently. If your embeddings are stale, your similarity threshold is wrong, or your chunking strategy is poor, the model receives low-quality context and produces low-quality output. There is no exception. The application appears to work: the responses are just worse than they should be.

Monitoring retrieval quality means logging what was retrieved for a sample of queries, not just whether the retrieval call succeeded. If you can compare retrieved chunks against expected results for a test set, do that in CI before deploying embedding or chunking changes.

A second RAG failure mode is cost. Embedding every document in your corpus on every re-index is expensive at scale. Cache embeddings where the source content hasn’t changed, and be deliberate about when re-indexing is triggered.

Observability for non-deterministic systems

Standard observability (response times, error rates, uptime) is necessary but not sufficient for LLM-backed applications. You also need visibility into the AI layer: what prompts are being sent, what responses are coming back, how often the model refuses or produces unusable output, and how response quality changes over time.

At minimum, log the following for every LLM call:

The prompt (or a hash of it, if the content is sensitive)
The model and any parameters
The response, or the error if the call failed
Latency
Token counts, for cost tracking

This gives you a baseline for debugging and a record for detecting quality regressions when you upgrade models or change prompts. Prompt changes are deployments. Treat them as such.

The standard checklist still applies

None of the above replaces the standard production infrastructure concerns. LLM-backed applications still need the same things any production application needs: database constraints, proper error handling at every external boundary, a repeatable deployment process, and an auth layer that has been reviewed rather than assumed to work.

The difference is that AI-built applications tend to get to production faster, which means they arrive with less of this in place. The retrieval pipeline works. The LLM calls return something useful. And underneath, the database has no foreign key constraints, errors are swallowed silently, and the deployment is a manual SSH session.

If your application is at this stage: working, in front of users, but built faster than the infrastructure could keep up with, that is a production hardening problem. The AI layer is rarely where the reliability issues are. It is usually everything else.

Get in touch if that is where you are.