THE ML ENGINE

Kompile Model Server

A standalone lifecycle service to download, convert, optimize, train, and serve models entirely on your own hardware. Move from expensive hosted APIs to self-hosted inference with zero code changes.

Model Types by RAG Stack Layer

Every layer of the RAG pipeline uses a different kind of model. Kompile manages all of them — download, convert, optimize, and serve from a single platform.

Query Understanding→

Embedding→

Retrieval→

Reranking→

Generation→

Guardrails→

Vision & Document Understanding→

Speech→

Knowledge Graph Embeddings

Query Understanding

LLM-powered query transformers reshape user queries before retrieval.

LLM (Query Transformer)

OpenAI, Anthropic, Gemini, Local SameDiff

HyDE — generate hypothetical documents for embedding-based retrieval
Multi-Query — expand a single query into N alternative formulations
Step-Back — abstract to a higher-level query for broader context
Compression — distill verbose input into a focused retrieval query
Expansion — augment queries with related terms and synonyms

Embedding

Encode documents and queries into vector representations for similarity search.

Dense Encoder

SameDiff, ONNX, Anserini, OpenAI, Sentence Transformers, PostgresML

BGE Base/Small/Large EN v1.5 — BAAI general-purpose (384–1024 dim)
BGE-M3 — multilingual, 100+ languages (1024 dim, 8192 tokens)
Arctic Embed L — Snowflake long-context (1024 dim, 8192 tokens)
E5 Base v2 — intfloat general-purpose (768 dim)
CosDPR-Distil — cosine-similarity DPR (768 dim)

Sparse Encoder

Anserini SameDiff

SPLADE++ EfficientDistil — learned sparse, BERT vocab (30522 dim)
SPLADE++ SelfDistil — self-distilled variant (30522 dim)

Image Encoder

ONNX via VLM pipeline

SigLIP Base — Google, 224×224 images (768 dim)
CLIP ViT-B/32 — OpenAI, vision + text (512 dim)

Retrieval

Fetch candidate documents using sparse, dense, or hybrid search.

Retrieval Engines

Anserini (Lucene), pgvector, Chroma, Vespa

BM25 — classic sparse keyword retrieval via Lucene/Anserini
HNSW Dense — approximate nearest neighbor over dense embeddings
Hybrid — BM25 + dense retrieval with score fusion
Prebuilt Indexes — MS MARCO passage/doc Lucene indexes

Reranking

Re-score retrieved candidates with neural cross-encoders or statistical methods.

Cross-Encoder (Neural)

SameDiff (.sdz), ONNX

MS MARCO MiniLM L-6-v2 — 6-layer, 384-dim, fast passage reranker
MS MARCO MiniLM L-12-v2 — 12-layer, higher accuracy
mMARCO mMiniLM L12 — multilingual reranker
STS-B TinyBERT L-4 — lightweight semantic similarity
QNLI DistilRoBERTa — question-answering NLI reranker (768 dim)

Statistical Rerankers

Built-in (Anserini)

RRF — Reciprocal Rank Fusion across multiple retrievers
MMR — Maximal Marginal Relevance for diversity
RM3 — relevance-model query expansion
BM25-PRF — pseudo-relevance feedback
Rocchio — vector space expansion
Axiom — axiomatic semantic feedback

Generation

Generate responses grounded in retrieved context — cloud or fully local.

Cloud LLM

API

OpenAI — GPT-4o, GPT-4, GPT-3.5 Turbo
Anthropic — Claude 4, Claude 3.5 Sonnet
Google Gemini — Gemini Pro, Gemini Flash
Spring AI — abstraction over multiple backends

Local LLM (GGUF)

GGUF → SameDiff conversion, OpenAI-compatible API

LLaMA 2 / 3 — Meta
Mistral / Mixtral — Mistral AI
Phi-2 / Phi-3 — Microsoft
Qwen / Qwen2 — Alibaba
Gemma — Google
Falcon — TII
StarCoder / StarCoder2 — BigCode
MPT — MosaicML

Local LLM (Native SameDiff)

SameDiff FlatBuffers (.fb)

SmolLM 135M/360M Instruct — HuggingFace, edge-ready
Phi-2 (2.7B) — Microsoft, 32 layers, 2560-dim

Guardrails

LLM-based safety and quality gates on both input and output.

Input Guardrails

LLM-based (uses configured LLM)

Prompt Injection Detection — threshold-based classifier
Toxicity Detection — configurable category thresholds
PII Detection — email, phone, SSN, credit card
Topic Guardrail — allowed/blocked topic enforcement

Output Guardrails

LLM-based (uses configured LLM)

Hallucination Detection — context-grounded fact checking with retry
Relevancy Check — response-to-query relevance scoring with retry
Format Guardrail — length, structure, and format validation

Vision & Document Understanding

Extract text, tables, and structure from images, PDFs, and scanned documents.

Vision-Language Models (VLM)

ONNX, SameDiff

SmolDocling 256M — SigLIP + Qwen2 decoder; DocTags, Markdown, JSON output
Donut Base — Swin Transformer + BART; form and structured extraction
Florence-2 Base/Large — Microsoft; general vision understanding

OCR Models

ONNX, SameDiff

DBNet v2 — text region detection
CRNN v2 — text recognition
TableFormer — table structure extraction (Docling)
LayoutLM / LayoutLMv2 / LayoutLMv3 — document layout analysis
PaddleOCR / DocTR — end-to-end OCR pipelines

Speech

Transcribe audio into text for ingestion into the RAG pipeline.

Speech-to-Text

GGUF, ONNX (auto-download from HuggingFace)

Whisper Tiny — fastest, lowest accuracy
Whisper Base — lightweight general purpose
Whisper Small — balanced speed/accuracy
Whisper Medium — higher accuracy
Whisper Large — highest accuracy, multilingual

Knowledge Graph Embeddings

Embed entities and relations for graph-based reasoning and GraphRAG.

KG Embedding Algorithms

Built-in (Neo4j integration)

TransE — translational embedding (h + r ≈ t)
RotatE — rotational embedding in complex space (h ∘ r ≈ t)

Pipeline Layers

15+

Model Categories

50+

Supported Models

All Local

Every Model Runs On-Prem

Multi-GPU Scheduling & Memory Management

Automatic workload routing across all available GPUs with reservation-based memory management and continuous batching.

Per-Service Device Routing

Pin each workload type to specific CUDA devices: embedding, LLM inference, VLM encoder, VLM decoder, ingest, and vector population. Vision models auto-route to the largest GPU. Configured via DeviceRoutingConfig.

GPU Resource Manager

Reservation-based memory management with priority preemption. Tracks per-service GPU reservations, nvidia-smi to CUDA runtime index mappings, and per-device memory pools.

Continuous Batching

Dynamic batching scheduler with per-model priority queues, configurable preferred batch sizes, and max queue delay. Requests are batched continuously to maximize GPU utilization.

KV Cache Management

Full KV cache lifecycle with prefix indexing (content-hash based), priority eviction policies, checkpoint/restore, and statistics collection. Shared across LLM and VLM pipelines.

RUNS ON YOUR HARDWARE

Graph Optimization Pipeline

Models are compiled through a multi-pass pipeline that strips waste, fuses operations, and targets your hardware. Every pass is individually toggleable.

Cleanup

UnusedFunctionOptimizations — remove dead constants and ops

ConstantFunctionOptimizations — constant folding, pre-execute const-input ops

IdentityFunctionOptimizations — strip identity ops and no-op permutations

Algebraic simplifications — AddZero, SubtractZero, MultiplyOne, MultiplyZero, SubtractSelf, DivideOne

Shape & Linear Fusion

ShapeFunctionOptimizations — fuse chained permutes, reshapes, and concats

LinearFusionOptimizations — merge matmul + add into fused xw_plus_b

Attention & Activation Fusion

AttentionFusionOptimizations — detect and fuse Q*K^T/sqrt(d)*V patterns with causal masking and projection fusion

FuseSigmoidMulToSwish — collapse Sigmoid * Mul into Swish activation

FuseSwiGLUPattern — recognize and fuse SwiGLU gating

FuseRMSNormPattern — detect and fuse RMSNorm sequences

FuseMeanSquarePattern — fuse mean-square normalization

Hardware Targeting

CuDNNFunctionOptimizations — convert to cuDNN kernel implementations

QuantizationOptimizations — INT8, UINT8, Float16, BFloat16 quantization

Triton GPU compilation — warp tuning, stage tuning, graph capture, FP fusion

Performance Profiles

DEBUG_FAST

Minimal passes for rapid iteration. No fusion, no quantization.

BALANCED

Cleanup + basic fusion. Good for development and testing.

MAX_PERF

All optimization passes enabled. Maximum hardware throughput.

LLM_OPTIMAL

Tuned for transformer inference — attention fusion, KV cache, Triton compilation, speculative decoding.

LLM_BASIC

Lightweight LLM profile without GPU compilation passes.

Training & Fine-Tuning

Customize models on your proprietary data. Every method runs natively inside the model server — no fragmented external tooling.

PEFT / Adapter Methods

LoRA

Low-rank adaptation of attention weights

QLoRA

4-bit quantized base model + LoRA adapters for memory-efficient fine-tuning

AdaLoRA

Adaptive rank allocation across layers

DyLoRA

Dynamic rank during training

DoRA

Weight-decomposed low-rank adaptation for better convergence

IA3

Inhibit/amplify activations with learned scaling vectors

Prompt Tuning

Trainable soft prompt token embeddings

Prefix Tuning

Trainable prefix vectors per transformer layer

Alignment Methods

DPO

Direct Preference Optimization

KTO

Kahneman-Tversky Optimization

ORPO

Odds Ratio Preference Optimization

PPO

Proximal Policy Optimization with reward model

GRPO

Group Relative Policy Optimization

Distillation

Logit KD

Teacher-student distillation

Feature KD

Teacher-student distillation

Attention KD

Teacher-student distillation

Combined

Teacher-student distillation

Model Registry & Air-Gapping

A proper model registry with full lifecycle management. Package into .karch archives and deploy to fully air-gapped environments.

.karch Archives

Self-contained archives with manifest.karch.json and checksums.sha256. Export, import, publish, and download via CLI or REST API. Move models across air-gapped boundaries with a single file.

Model Lifecycle

Full promote, replace, convert, and delete workflows with status tracking. Import from HuggingFace with resumable downloads. Configurable source modes: ARCHIVE, REGISTRY, or HYBRID.

Format Importers

Native importers for ONNX, TensorFlow, Keras, and GGUF/GGML. The GGML importer supports architecture-specific weight mapping for all major model families.

OpenAI-Compatible API

Expose any loaded model through an OpenAI-compatible REST API. Existing integrations can point to the model server as a drop-in replacement for hosted APIs.

Supported Model Architectures

LLaMAMistralMixtralPhiQwenGemmaFalconMPTStarCoder

IMPORT FROM FRAMEWORKS YOU TRUST

Inference Capabilities

LLM Execution

Text generation with streaming, speculative decoding (n-gram, draft model), KV cache reuse, and configurable sampling strategies.

VLM Execution

Vision-language model inference with separate encoder/decoder device routing, KV cache integration, and image+text multimodal input.

Embedding Generation

Generate embeddings from local ONNX or SameDiff models. Batch processing with automatic device placement and memory management.

Ready to compile your models?

Get early access and run models on your own hardware.

Request Early Access