Model Hosting Guide

AI Model
Architecture

How models work in Ona, which providers and engines are supported, and how to host and route model traffic in local, VPS, or hybrid production setups.

What This Covers

  1. What models and providers Ona can use.
  2. How those models are hosted and routed in production.

Ona supports cloud-hosted and self-hosted inference, plus mixed deployments.

Supported Model Backends

Backend Type Providers / Engines Hosted Where
Cloud API Anthropic, OpenAI, Groq, Gemini, OpenRouter Provider-managed cloud
Local model server Ollama Your machine or server
OpenAI-compatible server llama.cpp, LM Studio, vLLM Your machine or server
Docker-hosted services Containerized Ollama, vLLM, llama.cpp, or custom gateways Your Docker host (local or server)
Hybrid Any combination of local + cloud Split across local and cloud

One backend is enough to run missions, but multi-backend setups are recommended for quality, cost, and speed control.

Core Hosting Modes

Cloud-only

  • Fast setup and no local GPU requirement.
  • Ongoing API cost with provider-managed operations.
  • Best for low-maintenance deployments.

Self-hosted only

  • Maximum control and privacy.
  • No per-token cloud billing.
  • Requires CPU/GPU/RAM planning for model hosting.

Hybrid

  • Route sensitive flows locally and heavier tasks to cloud.
  • Balance cost, speed, and output quality.
  • Recommended for many teams and growing deployments.

How Ona Chooses a Model

  1. Role-specific override (for example, Developer).
  2. Task-size routing (`mini`, `mid`, `large`).
  3. Default model selection.
  4. Provider fallback when needed.

This layered routing lets you reserve stronger models for high-value missions while routine workloads stay on faster or cheaper backends.

Deep Research Routing (Local-first)

  • Keep Solin and core workflows on local or self-hosted models.
  • Escalate selected tasks to cloud models for deep research, online activity, and high-depth reasoning.
  • Preferred cloud escalation options: Gemini, ChatGPT (OpenAI), Claude.

Solin escalation flow

  1. User sends mission to Solin.
  2. Solin classifies intent (`normal` vs `deep_research` / `online_activity`).
  3. Normal missions run on the local model path.
  4. Escalated missions send targeted prompts to selected cloud models.
  5. Solin returns combined output to the user.
  6. Useful accepted findings are stored in memory for future recall.

Memory metadata after escalation

  • Provider and model used.
  • Task intent (`deep_research` or `online_activity`).
  • Timestamp and scope.
  • Summary and reusable findings.

Important Rule: Dual Bot Separation

In dual-bot deployments, Solin (main assistant) should run on a different LLM service than the Customer Service assistant. Keep service boundaries explicit for reliability, isolation, and operational clarity.

Self-Hosted Options

Ollama

  • Simple self-hosted model operations.
  • Native Ollama API (`/api/tags`, `/api/generate`).
  • Typical use: local inference and lightweight routing.

llama.cpp

  • OpenAI-compatible local serving (`/v1/*`).
  • Good GGUF runtime control and dedicated local endpoints.
  • Typical use: Solin/main assistant on private infrastructure.

LM Studio / vLLM

  • OpenAI-compatible local endpoints (`/v1/*`).
  • Fits custom stacks and advanced serving workflows.
  • Typical use: development and specialized hosting paths.

Cloud Options

  • Connect cloud providers through API credentials (prefer lockbox-backed storage).
  • Typical use: high-reasoning tasks, burst capacity, and no local hardware operations.
  • Cloud and local backends can be combined in one deployment.

Docker-Hosted Models

Docker is a practical way to run self-hosted model services with consistent startup, upgrades, and rollback. It works for local machines, home servers, and VPS deployments.

What to Containerize

  • Ollama for local-first inference endpoints.
  • vLLM or llama.cpp gateways for OpenAI-compatible serving.
  • Optional routing gateway for model selection and failover.

Operational Benefits

  • Repeatable environments across development and production.
  • Simple lifecycle control with compose or orchestrators.
  • Clear service boundaries for Solin vs CS assistant backends.

Security baseline: expose model ports only on trusted networks and keep secrets in env/lockbox-managed paths.

Routing baseline: keep private default flows on local Docker services and escalate only scoped deep research tasks to cloud models.

Recommended Deployment Patterns

Private-first hybrid (recommended)

  • Run Solin and core workflows on local models (for example llama.cpp or Ollama).
  • Use Gemini, ChatGPT (OpenAI), and Claude for deep research and web-heavy reasoning.
  • Keep day-to-day memory and private operational context local whenever possible.

Pattern A: Simple local starter

Primary: Ollama. One local model for all roles. Best for single-user local setup.

Pattern B: Local main + cloud specialist

Default local model path plus cloud escalation for large research tasks. Best for cost-sensitive, quality-aware usage.

Pattern C: Split local services by role

Solin on llama.cpp, CS assistant on Ollama, optional embeddings endpoint. Best for advanced self-hosted control.

Pattern D: Cloud-first with local resilience

Primary cloud path with local fallback for continuity and uptime resilience.

Startup and Operations Behavior

  • Startup should bring up local model services when configured, or validate reachability.
  • If CS model service is configured, validate or pull required CS model at startup.
  • If any required endpoint is unavailable, diagnostics should report it clearly.

Security and Credential Handling

Cloud providers

  • Store API keys in lockbox when available.
  • Avoid plaintext key sprawl across configs and logs.
  • Ensure logs and outputs redact secret values.

Self-hosted providers

  • Restrict model endpoints to trusted networks.
  • Keep per-service endpoint ownership clear, especially in dual-bot setups.
  • Use hybrid routing to keep sensitive context local by default.

Validation Checklist

  • Each configured endpoint is reachable.
  • At least one usable model exists per required backend.
  • Role and task routing resolves as expected.
  • Dual-bot service separation is preserved.
  • Startup and doctor checks reflect real model health.

Practical Summary

  • Ona supports cloud-hosted models, self-hosted models, and hybrid architectures.
  • Start with one backend, then add role-based routing.
  • As your system grows, split services for Solin and CS assistant to preserve isolation.