Model Hosting Guide

How models work in Ona, which model backends are supported, and how to host them in local, VPS, or hybrid setups.


What This Covers

This guide answers two core questions:

  1. What models/providers can Ona use?
  2. How can those models be hosted and routed in production?

Ona supports both cloud-hosted and self-hosted inference, plus mixed deployments.


Supported Model Backends

Ona can connect to these model sources:

Backend type Providers/engines Hosted where
Cloud API Anthropic, OpenAI, Groq, Gemini, OpenRouter Provider-managed cloud
Local model server Ollama Your machine/server
OpenAI-compatible server llama.cpp, LM Studio, vLLM Your machine/server
Hybrid Any combination of the above Split across local + cloud

You only need one backend configured to run missions, but multi-backend setups are recommended for cost/performance control.


Core Hosting Modes

1) Cloud-only

All inference calls go to cloud providers through API keys.

  • Fast setup
  • No local GPU requirement
  • Ongoing API cost
  • Good for low-maintenance deployments

2) Self-hosted only

All inference calls run through local services like Ollama or llama.cpp.

  • Maximum control and privacy
  • No per-token cloud billing
  • Requires model hosting resources (CPU/GPU/RAM)
  • Good for local-first or private infrastructure

3) Hybrid (recommended for many teams)

Run some roles/tasks locally and route others to cloud models.

  • Balance cost, speed, and quality
  • Keep sensitive/private flows local
  • Use cloud for heavier reasoning when needed

How Ona Chooses a Model

Model resolution is layered. Ona can route by explicit role or fallback:

  1. Role-specific override (for example Developer)
  2. Task-size routing (mini, mid, large)
  3. Default model
  4. Provider fallback

This lets you control quality/cost by assigning stronger models to high-value tasks and cheaper/faster models to routine work.


Deep Research Routing (Local-first)

Recommended operating model:

  • Keep Solin and core system workflows on local/self-hosted models.
  • Escalate only selected tasks to larger cloud models for:
    • deep research
    • online activity
    • high-depth reasoning
  • Preferred cloud escalations: Gemini, ChatGPT (OpenAI), Claude.

Solin escalation flow

  1. User sends mission to Solin.
  2. Solin classifies intent (normal vs deep_research / online_activity).
  3. If normal: execute on local model path.
  4. If escalated: Solin sends a targeted prompt to selected larger cloud model.
  5. Solin receives the response and returns it to the user as part of mission output.
  6. Solin stores useful result content in memory for future recall.

Memory behavior after escalation

When an escalated cloud response is accepted, store it in memory with metadata:

  • provider/model used
  • task intent (deep_research or online_activity)
  • timestamp and scope
  • summary + reusable findings

This preserves long-term usefulness while keeping default operation local-first.


Important Architecture Rule (Dual Bot Separation)

In dual-bot deployments:

  • Solin (main assistant) should use a different LLM service than
  • Customer Service assistant

Do not run both bots on the same backend endpoint when separation is required. Keep service boundaries explicit for reliability, isolation, and operational clarity.


Self-Hosted Options

Ollama

Best for simple self-hosted model operations.

  • Host location: local machine or server
  • API shape: Ollama native (/api/tags, /api/generate, etc.)
  • Typical use: general local inference, lightweight role routing

llama.cpp (OpenAI-compatible)

Best for GGUF-based local serving with explicit runtime control.

  • Host location: local machine or server
  • API shape: OpenAI-compatible (/v1/*)
  • Typical use: Solin/main assistant on dedicated local model

LM Studio / vLLM

Useful when you already run OpenAI-compatible local endpoints.

  • Host location: local machine or server
  • API shape: OpenAI-compatible (/v1/*)
  • Typical use: dev workflows, custom serving stacks, advanced hosting

Cloud Options

Cloud providers work through API credentials (prefer lockbox-backed storage).

Typical use:

  • high-reasoning tasks
  • burst capacity
  • no local hardware operations

Cloud backends can be combined with local backends in one deployment.


Recommended Deployment Patterns

Recommended: Private-first hybrid

  • Run local models for Solin and core system workflows (for example llama.cpp or Ollama).
  • Use Gemini, ChatGPT (OpenAI), and Claude for deep research, web-heavy reasoning, and online activity.
  • Keep day-to-day memory and private operational context on local/self-hosted paths whenever possible.
  • Result: strong cloud reasoning when needed, while most sensitive data and routine context stay private.

Pattern A: Simple local starter

  • Primary: Ollama
  • Default model: one local model for all roles
  • Best for: single-user local setup

Pattern B: Local main + cloud specialist

  • Main/default: local (Ollama or llama.cpp)
  • Research/large tasks: cloud model
  • Best for: cost-sensitive but quality-aware usage

Pattern C: Split local services by role

  • Solin: llama.cpp (dedicated endpoint)
  • CS assistant: Ollama (separate endpoint)
  • Optional extra local endpoint for embeddings
  • Best for: advanced self-hosted control

Pattern D: Cloud-first fallback with local resilience

  • Primary: cloud
  • Local model as fallback/resilience path
  • Best for: high uptime expectations with local continuity option

Startup and Operations Behavior

Ona startup should honor configured model services:

  • If local model services are configured, startup should bring them up (or validate they are reachable).
  • If CS model service is configured, startup should validate/pull required CS model.
  • If a required model endpoint is unavailable, diagnostics should report it clearly.

This keeps setup/settings aligned with runtime behavior.


Security and Credential Handling

For cloud providers:

  • Store API keys in lockbox when available.
  • Avoid plaintext key sprawl across configs/logs.
  • Ensure logs and outputs redact secret values.

For self-hosted providers:

  • Restrict model endpoints to trusted networks.
  • Keep per-service endpoint ownership clear (especially in dual-bot setups).

Privacy note:

  • A hybrid setup can significantly improve privacy posture: keep personal/system context local, and only send scoped prompts to cloud models for deep research or online tasks.

Validation Checklist

Before production use, verify:

  • Each configured endpoint is reachable.
  • At least one usable model exists per required backend.
  • Role/task routing resolves as expected.
  • Dual-bot service separation is preserved.
  • Startup/doctor checks reflect real model health.

Practical Summary

Ona supports:

  • Cloud-hosted models (Anthropic, OpenAI, Groq, Gemini, OpenRouter)
  • Self-hosted models (Ollama, llama.cpp, LM Studio, vLLM)
  • Hybrid model architectures (mix local + cloud)

For most deployments:

  • start with one backend,
  • add role-based routing,
  • then split services (main vs CS) as the system grows.