Skip to main content

Prompt Caching

When you chat with a Space attached, a large chunk of your message — the background context drawn from that Space — stays the same turn after turn. Prompt caching lets the AI provider remember that repeated chunk instead of re-reading (and re-charging for) it every single time. The result: lower cost and, often, faster responses, with no change to the answers you get.

Rephlo actively enables prompt caching for Anthropic (Claude) — it's on by default, with a configurable lifetime. This page explains when caching kicks in and how to see the benefit.

How it works (in plain terms)

A request to the AI is built in layers: a stable instruction, your Space's background context, then the new part of the conversation. The background context that repeats each turn is what gets cached.

  • The stable prefix (instruction + Space context) is reused from the cache.
  • Only the new content in each turn is processed fresh.
  • For repeated context, this can cut the cost of those input tokens dramatically.

When caching applies

Caching engages when both of these are true:

  1. A Space is attached to the conversation. The Space context is what creates a large, repeatable prefix worth caching. Without an attached Space, there's no stable block to cache, so caching has nothing to do.
  2. You're on a provider that caches. Rephlo drives caching for Anthropic (Claude) so repeated context is served from cache on later turns. Some other cloud providers cache repeated context automatically on their own side — that's their behavior, not something Rephlo controls. On-device models run locally and don't cache, so caching simply doesn't apply there.

There's also a practical floor: the cached context needs to be reasonably large before providers actually serve it from cache. Small Spaces may not cross that threshold, in which case you'll see little or no caching benefit — which is harmless, just not a saving.

Anthropic (Claude) specifics

Different providers cache in different ways. Most cloud vendors cache repeated context automatically. Anthropic (Claude) is a bit different — Rephlo drives caching for it and gives you two settings in the Anthropic provider configuration:

SettingWhat it does
Prompt caching enabledTurns Anthropic caching on. On by default.
Cache lifetime (TTL)How long the cache lives between turns: ephemeral (about 5 minutes, the default) or extended (about 1 hour). A longer lifetime costs slightly more to write the cache but keeps it warm longer.

The default of ephemeral caching enabled is a good fit for most conversations, where your follow-up messages come within a few minutes of each other. If you tend to step away and return to a chat much later, extended keeps the cache warm so the next message still gets a cache hit.

You'll find these options on the Providers screen under your Anthropic configuration. For where models and providers fit together, see Models & Inference.

Seeing the savings

Rephlo tracks how many input tokens were served from the cache versus processed fresh on each response. With detailed cache logging enabled (the default for Anthropic), this information is recorded so you can confirm caching is doing its job — a high proportion of cached tokens on later turns means the cache is being reused as intended.

The savings come from the provider charging less for cached input than for fresh input, so longer conversations with a stable Space context are where caching pays off most.

Quick tips

  • Attach a Space to a chat to make caching possible — it's the trigger.
  • Keep the Space context stable across turns. Rephlo handles this for you by only adding genuinely new context each turn rather than rebuilding the whole block.
  • Don't worry about a "cold" cache. The first turn writes the cache; there's no penalty if a later turn happens to miss — it just processes normally.