The most expensive API call is the one you make twice.
In production AI applications, a surprising percentage of requests are near-duplicates. A user asks the same question in slightly different words. A code assistant gets called on the same function twice in an afternoon. A summarization pipeline reprocesses documents that have barely changed.
Semantic caching addresses this by matching new requests against previous ones based on meaning, not exact text.
How it works
When a request comes in, we generate an embedding. This is a vector representation of the prompt's meaning, produced by a small local embedding model. We then perform a nearest-neighbor search against our cache index.
If the similarity score exceeds the threshold (0.92 by default), we return the cached response immediately. The API never gets called.
If the score falls below the threshold, we compress and forward the request as usual, then store the result in the cache with its embedding for future lookups.
Choosing the threshold
The threshold is the single most important parameter. Set it too high and you will miss valid cache hits. Set it too low and you will return incorrect responses for semantically different queries.
In our testing, 0.92 strikes the right balance for code-related prompts. For conversational or open-ended prompts, we recommend 0.95. The threshold is configurable per request type in your .woozcode.json.
What to expect
In a typical development session with a lot of iteration on the same codebase, semantic cache hit rates run between 15% and 35%. That range sounds modest, but each hit eliminates an API call entirely, including its latency.
The cache is stored locally as a flat binary index. It persists across sessions and can be cleared with wooz cache clear. You can also inspect cache statistics with wooz cache stats.
Limitations
Semantic caching is not suitable for prompts that depend on real-time data, user-specific state, or random variation. For these, disable caching on a per-call basis using the skipCache: true option.