Models & Inference
A provider is who runs the AI (OpenAI, Anthropic, your own machine, or Rephlo). A model is the actual engine that produces text. One provider usually offers several models, and you choose which one runs each command or chat.
Getting this distinction right is the key to controlling speed, quality, privacy, and cost.
The three ways Rephlo runs AI
Rephlo can run a request in one of three different ways. You pick the one that fits your needs.
| Way to run AI | Where it runs | Who pays | Plan required | Best for |
|---|---|---|---|---|
| Cloud (your own key) | The provider's servers (OpenAI, Anthropic, Google, etc.) | You — billed by the provider on your own API key | Pro plan or perpetual license | Top-tier models, full control |
| On-device | Your own computer, fully offline | No one — no credits, no provider bill | Pro plan or perpetual license | Privacy, no internet, no usage cost |
| Rephlo-hosted | Rephlo's inference service | Rephlo credits from your plan | Included with paid plans | Convenience, no keys to manage |
Plan gating: Bring-your-own-key providers (cloud and Ollama) and on-device models are Pro-tier features — they require an active Pro plan (or higher) or a perpetual license to use. For on-device models that gate covers both downloading and running them; once unlocked, they run with no usage cost and fully offline. The Rephlo-hosted Dedicated API is included with paid plans and draws credits.
1. Cloud providers (bring your own key)
You paste an API key from a provider you already have an account with. Rephlo supports eight BYOK provider types — seven cloud vendors (OpenAI, Anthropic, Google, Groq, xAI (Grok), OpenRouter, and any OpenAI-compatible endpoint — including Azure OpenAI) plus local Ollama, which runs on your own machine at localhost:11434. Each exposes its own list of models — you choose from whatever that provider currently offers (the latest GPT, Claude, Gemini, and other models).
Usage is billed by the provider, on your key — Rephlo never charges credits for these calls. Set these up under Providers.
2. On-device models (private, free, offline)
Rephlo can run open models entirely on your own computer using GGUF model files via a built-in engine. Once a model is downloaded:
- It runs with zero internet — nothing is sent to any server.
- It uses no credits and costs nothing to run.
- Everything stays completely private on your machine.
Using on-device models — both downloading and running them — requires an active Pro plan (or higher) or a perpetual license, the same as BYOK cloud and Ollama providers. Once unlocked, they run credit-free and fully offline.
Each on-device model carries metadata so you can pick the right one: a download size, a speed rating and accuracy rating (1–10), and the RAM it needs (a minimum and a recommended amount). Pick a smaller, faster model if your machine has limited memory, or a larger, more accurate one if you have RAM to spare.
Some on-device models support a thinking mode (visible reasoning before the final answer). Reasoning-capable models such as Qwen and Gemma produce their thoughts in special blocks; Rephlo detects this automatically so the reasoning is handled correctly.
Browse, download, and manage these under On-Device Models.
3. Rephlo-hosted inference (uses credits)
If you don't want to manage any keys, Rephlo can run inference for you on its own hosted service. These requests consume credits from your Rephlo plan. Responses stream back in real time so you see output as it's generated.
Credits, balances, and what each plan includes live on the web — see Credits & Usage and Plans & Pricing.
How a model gets chosen
Rephlo decides which model to use based on a clear order of precedence:
- A command's own model override, if the command specifies one.
- The model picked for the current conversation (in chat), which is remembered per conversation.
- The active provider's default model, used as the fallback — for example, quick context-menu (overlay) actions run on the active provider's default model unless the command overrides it.
Switching models on the fly
In chat, a unified model selector in the toolbar lets you change the model for the current conversation. Your choice is remembered for that conversation, so reopening it keeps the same model. See Chat & Conversations for details.
Model parameters
Beyond which model you use, you can tune how it behaves with parameters such as temperature (randomness), max tokens (length of the reply), top-p, and top-k. Not every model supports every parameter, and some impose limits — Rephlo validates and adjusts your settings to match each model's constraints, so you don't have to memorize the rules. You'll find these in the model parameter editor and in Advanced Configuration.
Lowering cost with prompt caching
When you use Anthropic (Claude) with repeated context — the same long instructions or documents across many requests — Rephlo can reuse a cached copy of that context instead of re-sending it every time. This is on by default and can dramatically reduce cost for repeated work. Some other cloud providers cache repeated context automatically on their own side, and on-device models don't cache at all. Learn more in Prompt Caching.
Choosing the right approach
- Want the best quality and already have an API key? Use a cloud provider (BYOK — a Pro-tier feature).
- Care most about privacy, or working offline? Use an on-device model — free and local to run (access is a Pro-tier feature).
- Don't want to manage keys at all? Use Rephlo-hosted inference (included with paid plans; uses credits).