Model Hosting Guide

How models work in Ona, which model backends are supported, and how to host them in local, VPS, or hybrid setups.

What This Covers

This guide answers two core questions:

What models/providers can Ona use?
How can those models be hosted and routed in production?

Ona supports both cloud-hosted and self-hosted inference, plus mixed deployments.

Supported Model Backends

Ona can connect to these model sources:

Backend type	Providers/engines	Hosted where
Cloud API	Anthropic, OpenAI, Groq, Gemini, OpenRouter	Provider-managed cloud
Local model server	Ollama	Your machine/server
OpenAI-compatible server	llama.cpp, LM Studio, vLLM	Your machine/server
Hybrid	Any combination of the above	Split across local + cloud

You only need one backend configured to run missions, but multi-backend setups are recommended for cost/performance control.

Core Hosting Modes

1) Cloud-only

All inference calls go to cloud providers through API keys.

Fast setup
No local GPU requirement
Ongoing API cost
Good for low-maintenance deployments

2) Self-hosted only

All inference calls run through local services like Ollama or llama.cpp.

Maximum control and privacy
No per-token cloud billing
Requires model hosting resources (CPU/GPU/RAM)
Good for local-first or private infrastructure

3) Hybrid (recommended for many teams)

Run some roles/tasks locally and route others to cloud models.

Balance cost, speed, and quality
Keep sensitive/private flows local
Use cloud for heavier reasoning when needed

How Ona Chooses a Model

Model resolution is layered. Ona can route by explicit role or fallback:

Role-specific override (for example Developer)
Task-size routing (mini, mid, large)
Default model
Provider fallback

This lets you control quality/cost by assigning stronger models to high-value tasks and cheaper/faster models to routine work.

Deep Research Routing (Local-first)

Recommended operating model:

Keep Solin and core system workflows on local/self-hosted models.
Escalate only selected tasks to larger cloud models for:
- deep research
- online activity
- high-depth reasoning
Preferred cloud escalations: Gemini, ChatGPT (OpenAI), Claude.

Solin escalation flow

User sends mission to Solin.
Solin classifies intent (normal vs deep_research / online_activity).
If normal: execute on local model path.
If escalated: Solin sends a targeted prompt to selected larger cloud model.
Solin receives the response and returns it to the user as part of mission output.
Solin stores useful result content in memory for future recall.

Memory behavior after escalation

When an escalated cloud response is accepted, store it in memory with metadata:

provider/model used
task intent (deep_research or online_activity)
timestamp and scope
summary + reusable findings

This preserves long-term usefulness while keeping default operation local-first.

Important Architecture Rule (Dual Bot Separation)

In dual-bot deployments:

Solin (main assistant) should use a different LLM service than
Customer Service assistant

Do not run both bots on the same backend endpoint when separation is required. Keep service boundaries explicit for reliability, isolation, and operational clarity.

Self-Hosted Options

Ollama

Best for simple self-hosted model operations.

Host location: local machine or server
API shape: Ollama native (/api/tags, /api/generate, etc.)
Typical use: general local inference, lightweight role routing

llama.cpp (OpenAI-compatible)

Best for GGUF-based local serving with explicit runtime control.

Host location: local machine or server
API shape: OpenAI-compatible (/v1/*)
Typical use: Solin/main assistant on dedicated local model

LM Studio / vLLM

Useful when you already run OpenAI-compatible local endpoints.

Host location: local machine or server
API shape: OpenAI-compatible (/v1/*)
Typical use: dev workflows, custom serving stacks, advanced hosting

Cloud Options

Cloud providers work through API credentials (prefer lockbox-backed storage).

Typical use:

high-reasoning tasks
burst capacity
no local hardware operations

Cloud backends can be combined with local backends in one deployment.

Recommended Deployment Patterns

Recommended: Private-first hybrid

Run local models for Solin and core system workflows (for example llama.cpp or Ollama).
Use Gemini, ChatGPT (OpenAI), and Claude for deep research, web-heavy reasoning, and online activity.
Keep day-to-day memory and private operational context on local/self-hosted paths whenever possible.
Result: strong cloud reasoning when needed, while most sensitive data and routine context stay private.

Pattern A: Simple local starter

Primary: Ollama
Default model: one local model for all roles
Best for: single-user local setup

Pattern B: Local main + cloud specialist

Main/default: local (Ollama or llama.cpp)
Research/large tasks: cloud model
Best for: cost-sensitive but quality-aware usage

Pattern C: Split local services by role

Solin: llama.cpp (dedicated endpoint)
CS assistant: Ollama (separate endpoint)
Optional extra local endpoint for embeddings
Best for: advanced self-hosted control

Pattern D: Cloud-first fallback with local resilience

Primary: cloud
Local model as fallback/resilience path
Best for: high uptime expectations with local continuity option

Startup and Operations Behavior

Ona startup should honor configured model services:

If local model services are configured, startup should bring them up (or validate they are reachable).
If CS model service is configured, startup should validate/pull required CS model.
If a required model endpoint is unavailable, diagnostics should report it clearly.

This keeps setup/settings aligned with runtime behavior.

Security and Credential Handling

For cloud providers:

Store API keys in lockbox when available.
Avoid plaintext key sprawl across configs/logs.
Ensure logs and outputs redact secret values.

For self-hosted providers:

Restrict model endpoints to trusted networks.
Keep per-service endpoint ownership clear (especially in dual-bot setups).

Privacy note:

A hybrid setup can significantly improve privacy posture: keep personal/system context local, and only send scoped prompts to cloud models for deep research or online tasks.

Validation Checklist

Before production use, verify:

Each configured endpoint is reachable.
At least one usable model exists per required backend.
Role/task routing resolves as expected.
Dual-bot service separation is preserved.
Startup/doctor checks reflect real model health.

Practical Summary

Ona supports:

Cloud-hosted models (Anthropic, OpenAI, Groq, Gemini, OpenRouter)
Self-hosted models (Ollama, llama.cpp, LM Studio, vLLM)
Hybrid model architectures (mix local + cloud)

For most deployments:

start with one backend,
add role-based routing,
then split services (main vs CS) as the system grows.