▸ Sovereign Models · Inference + Fine-tuning

Open-weight LLMs.
Closed-tenant inference.

Run Llama, Mistral, Qwen, DeepSeek, Phi and Gemma on dedicated infrastructure in your tenant. OpenAI-compatible endpoint. Zero data egress. Audit-grade logs.

14 models EU data residency 0 egress OpenAI-compatible

14 sovereign models, ready to serve.

Curated, quantized and packaged for cost-efficient inference. Hosted in EU AZs. Replicated on demand. Per-token billing with reserved capacity available.

llama-3.3-70b
Meta · open weight
70B

Top-tier generalist. Strong reasoning, multilingual, 128k context. Great default for new deployments.

q4q8fp16128k ctx
from €0.40 / 1M tok
Use →
mistral-large
Mistral · open weight
123B

European frontier model. Excellent at multilingual reasoning, code, and structured output. 128k context.

q4fp16128k ctx
from €0.55 / 1M tok
Use →
qwen-2.5-72b
Alibaba · open weight
72B

Strong coder, especially at Python and SQL. Best-in-class for tool calling and agentic workflows.

q4q8fp1632k ctx
from €0.38 / 1M tok
Use →
deepseek-v3
DeepSeek · open weight
671B MoE

Mixture-of-experts at the frontier. Activated params ~37B per token. Premier reasoning and math.

q4fp8128k ctx
from €1.20 / 1M tok
Use →
phi-4
Microsoft · open weight
14B

Small but mighty. Reasoning that punches above its weight class. Excellent for cost-sensitive workloads.

q4q8fp1616k ctx
from €0.12 / 1M tok
Use →
gemma-3-27b
Google · open weight
27B

Best-in-class multimodal. Reads images, OCRs documents. Strong default for visual-document workflows.

q4q8multimodal128k ctx
from €0.22 / 1M tok
Use →
llama-3.2-3b
Meta · open weight
3B

Edge-friendly mini-model. Sub-100ms TTFT. Perfect for autocomplete, classification, intent detection.

q4fp16128k ctx
from €0.04 / 1M tok
Use →
nomic-embed-v2
Nomic · open weight
embed

Long-context embeddings. 8192-token chunks, MoE-routed. Use with your tenant Qdrant or Chroma.

fp16768 dimmultilingual
from €0.02 / 1M tok
Use →
+ custom fine-tune
Your weights
LoRA

Bring your own LoRA, QLoRA or full fine-tuned weights. We host, version, and serve. SafeTensors only.

any baseany quantversioned
from €89 / mo
Upload →

Three guarantees, written into the contract.

01

No egress, ever

Inference happens on dedicated GPUs in EU AZs. Your prompts, completions and embeddings never leave the tenant. Egress is technically blocked at the VPC level, not just policy.

02

Customer-managed keys

BYO HSM-backed keys for encryption at rest and audit log signing. Rotate on your schedule. Pull a signed evidence pack any quarter for your auditors.

03

Auditable infrastructure

Every request is logged with the model version, parameters, latency and token counts. Logs are append-only, signed, and exportable as a CSV or stream to your SIEM.

Your OpenAI code already works.

Change two lines — the base URL and the API key. Same endpoints, same response shapes, same SDKs.

# pip install openai from openai import OpenAI client = OpenAI( base_url="https://api.kawkav.com/v1", api_key="kwk_…", # your tenant key ) resp = client.chat.completions.create( model="llama-3.3-70b", messages=[ {"role": "system", "content": "You are a sovereign assistant."}, {"role": "user", "content": "Summarize the GDPR article 25."}, ], temperature=0.2, ) print(resp.choices[0].message.content)
// npm i openai import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.kawkav.com/v1", apiKey: process.env.KAWKAV_KEY, }); const r = await client.chat.completions.create({ model: "mistral-large", messages: [ { role: "system", content: "Be precise." }, { role: "user", content: "Outline a HIPAA-safe RAG architecture." }, ], }); console.log(r.choices[0].message.content);
curl https://api.kawkav.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer kwk_…" \ -d '{ "model": "qwen-2.5-72b", "messages": [ {"role": "user", "content": "Generate a Postgres schema for clinical trials."} ] }'

Per-token, monthly minimum, or fully reserved.

Pay as you go
€0.04/ 1M tok
Starting price · phi-4 mini
  • Per-token billing, no monthly commit
  • All 14 catalog models
  • Shared-tenant inference pool
  • 99.5% SLA
  • EU data residency
Get an API key →
Reserved capacity
From €1,200/mo
Dedicated GPU(s) · single-tenant
  • Reserved GPU — A100, H100, B200
  • Unlimited tokens at served capacity
  • Pin specific model versions
  • Air-gap / on-prem option
  • DPA + SCC + DPIA support
  • 99.99% SLA available
Talk to enterprise →

Try it. Five minutes.

Sign up, get an API key, point your OpenAI client at us. Migrate later or stay forever — same SDK either way.

Get an API key → Back to home