The uncomfortable truth: Most engineers who’ve built rock-solid distributed systems are now building their AI applications like they did in 2015 — and it’s costing them.
You know the drill. Async/await. Connection pooling. Non-blocking I/O. These patterns served us well. But when you’re orchestrating calls to LLMs, vector databases, and 10+ external APIs in a single request, the rules change. What felt like a “best practice” five years ago is now a latency cliff.
I learned this the hard way last month. We had a document processing service hitting P95 latencies of 8+ seconds. Our async code was textbook perfect. Our profiling showed… nothing obvious. Until I realized: we weren’t optimizing for the right thing.
The LLM Latency Trap
Traditional backend optimization assumes: “If I make my code non-blocking, throughput goes up and latency goes down.”
This is true in a closed system. But LLM applications are radically different. Your own code might run in milliseconds, but you’re waiting on:
- LLM inference time: 500ms–3000ms per call (depending on model and load)
- Vector database queries: 100–500ms with cold indices
- Embedding generation: Another 200–800ms
- Chain-of-thought logic: Multiple sequential calls that seem parallel but aren’t
You can make your code beautifully async, but if you’re still calling these services sequentially — waiting for one to finish before starting the next — you’ve missed the point entirely.
Here’s what I see constantly: Engineers write elegant concurrent code but structure their business logic as if they’re still writing REST API handlers from 2010. The async/await syntax hides the sin.
The Real Problem: Sequential Thinking in a Parallel World
Let’s say you’re building a search service over company docs. The flow looks like this:
- User asks a question
- Embed the query → 400ms
- Search vector DB for top-K results → 150ms
- Call LLM with context → 1200ms
- Return response
Total: ~1750ms. Not terrible, but not great either.
Now, here’s what most senior engineers miss: You can start step 3 (embedding the query) and step 2 (searching with the user’s raw query) in parallel. While you’re embedding, retrieve BM25 lexical matches. Both finish around the same time, and you feed both into the LLM.
Or better yet: don’t wait for embedding to finish. Optimistically start a hybrid search with BM25 while embedding happens in the background. This cuts your latency from 1750ms to 1400ms — a 20% improvement that compounds.
The code looks the same. Async/await. Non-blocking. But the architecture is completely different. You’re thinking in terms of: “What can I do in parallel?” not “Is this call non-blocking?”
Speculative Execution and LLM-Aware Caching
Here’s a technique I don’t see enough teams using: speculative execution.
The moment you know the user’s intent (intent classification can happen very fast with a small model), you can start preparing data for multiple possible outcomes. If there’s a 70% chance they want recent posts, start fetching those while you’re still processing. If there’s a 30% chance they want analytics, prepare that too.
When the full inference returns, you already have 80% of what you need cached. No network calls. Instant delivery.
This is alien to most backend engineers trained on Kafka and REST APIs. But it’s table stakes now.
Similarly, LLM response caching is criminally underutilized. Not naive caching — I mean semantic caching. If the user asks “What are our Q2 metrics?” and then asks “What about Q2 revenue specifically?”, the second question’s embedding is close enough to the first that you can probably serve most of the context from cache, then do a small targeted update.
Services like Anthropic’s prompt caching and Ollama’s KV cache give you the primitives. Very few teams actually use them strategically.
The Tricky Part: Timeout and Error Handling in the Age of Uncertainty
Here’s where it gets complicated. When you have 6+ concurrent operations (search, embedding, classification, fallback models, feature prep, etc.), timeouts become a design problem, not a code detail.
Your P99 latency is now controlled by the slowest operation across all your parallel tasks. If even one of them hangs or times out, your entire response degrades.
Senior engineers know about timeouts, yes. But they often set a global timeout (e.g., “5 seconds”) and hope everything finishes. This doesn’t work with LLMs.
Instead, you need:
- Graduated timeouts: Different limits for different operations. Embedding models might get 1 second; LLM inference gets 3 seconds; search gets 500ms.
- Graceful degradation: If search times out, continue with full-text results. If embedding times out, use lexical matching. Don’t fail the whole request.
- Observability everywhere: You need to know which operation caused a timeout, how often it happens, and why. Logging P50/P95/P99 for each operation independently is non-negotiable.
I’ve seen teams ignore this and hit production incidents where one flaky embedding service dragged down the entire application’s latency. They had good async code. They had good monitoring. But they didn’t think about the system holistically.
The Pragmatic Path Forward
So what should you actually do?
- Map your critical path. Draw out every operation in your LLM workflow. Identify which are sequential dependencies and which can run in parallel. This alone will reveal 20–30% of your latency waste.
- Measure before you optimize. Instrument every external call: LLM, vector DB, embedding service, everything. Percentile latencies matter more than averages.
- Batch and parallelize aggressively. If you’re making 10 calls, make them concurrently. If you can embed multiple queries at once, do it.
- Implement semantic caching. Start with simple approaches: hash the query embedding and cache results for similar queries. Graduate to more sophisticated schemes as your traffic grows.
- Design for graceful degradation. Every external service can fail. Your system should still work, just less ideally. Cache, fallbacks, and partial results are your friends.
- Monitor latency by operation, not just end-to-end. You can’t optimize what you don’t see. Distributed tracing is not optional anymore.
The Lesson
The async/await revolution did amazing things for throughput. But LLM applications aren’t throughput-constrained — they’re latency-constrained. Your user doesn’t care if you can handle 10,000 concurrent requests if each one takes 5 seconds.
Most senior engineers know this intellectually. But then they write code that sequences operations, hides them behind async/await, and ships it. The code is correct. The system is fast by 2015 standards. But it’s slow in 2026.
The fix isn’t better async code. It’s rethinking the architecture. Parallelizing ruthlessly. Caching semantically. Failing gracefully. And obsessively measuring where time is actually spent.
Do that, and your LLM applications will feel instant. Ignore it, and they’ll feel bloated — no matter how good your async code is.