THE ML ENGINE

Kompile Model Server

A standalone lifecycle service to download, convert, optimize, train, and serve models entirely on your own hardware. Move from expensive hosted APIs to self-hosted inference with zero code changes.

Model Types by RAG Stack Layer

Every layer of the RAG pipeline uses a different kind of model. Kompile manages all of them — download, convert, optimize, and serve from a single platform.

Query Understanding
Embedding
Retrieval
Reranking
Generation
Guardrails
Vision & Document Understanding
Speech
Knowledge Graph Embeddings
Query Understanding

LLM-powered query transformers reshape user queries before retrieval.

LLM (Query Transformer)

OpenAI, Anthropic, Gemini, Local SameDiff
  • HyDE generate hypothetical documents for embedding-based retrieval
  • Multi-Query expand a single query into N alternative formulations
  • Step-Back abstract to a higher-level query for broader context
  • Compression distill verbose input into a focused retrieval query
  • Expansion augment queries with related terms and synonyms
Embedding

Encode documents and queries into vector representations for similarity search.

Dense Encoder

SameDiff, ONNX, Anserini, OpenAI, Sentence Transformers, PostgresML
  • BGE Base/Small/Large EN v1.5 BAAI general-purpose (384–1024 dim)
  • BGE-M3 multilingual, 100+ languages (1024 dim, 8192 tokens)
  • Arctic Embed L Snowflake long-context (1024 dim, 8192 tokens)
  • E5 Base v2 intfloat general-purpose (768 dim)
  • CosDPR-Distil cosine-similarity DPR (768 dim)

Sparse Encoder

Anserini SameDiff
  • SPLADE++ EfficientDistil learned sparse, BERT vocab (30522 dim)
  • SPLADE++ SelfDistil self-distilled variant (30522 dim)

Image Encoder

ONNX via VLM pipeline
  • SigLIP Base Google, 224×224 images (768 dim)
  • CLIP ViT-B/32 OpenAI, vision + text (512 dim)
Retrieval

Fetch candidate documents using sparse, dense, or hybrid search.

Retrieval Engines

Anserini (Lucene), pgvector, Chroma, Vespa
  • BM25 classic sparse keyword retrieval via Lucene/Anserini
  • HNSW Dense approximate nearest neighbor over dense embeddings
  • Hybrid BM25 + dense retrieval with score fusion
  • Prebuilt Indexes MS MARCO passage/doc Lucene indexes
Reranking

Re-score retrieved candidates with neural cross-encoders or statistical methods.

Cross-Encoder (Neural)

SameDiff (.sdz), ONNX
  • MS MARCO MiniLM L-6-v2 6-layer, 384-dim, fast passage reranker
  • MS MARCO MiniLM L-12-v2 12-layer, higher accuracy
  • mMARCO mMiniLM L12 multilingual reranker
  • STS-B TinyBERT L-4 lightweight semantic similarity
  • QNLI DistilRoBERTa question-answering NLI reranker (768 dim)

Statistical Rerankers

Built-in (Anserini)
  • RRF Reciprocal Rank Fusion across multiple retrievers
  • MMR Maximal Marginal Relevance for diversity
  • RM3 relevance-model query expansion
  • BM25-PRF pseudo-relevance feedback
  • Rocchio vector space expansion
  • Axiom axiomatic semantic feedback
Generation

Generate responses grounded in retrieved context — cloud or fully local.

Cloud LLM

API
  • OpenAI GPT-4o, GPT-4, GPT-3.5 Turbo
  • Anthropic Claude 4, Claude 3.5 Sonnet
  • Google Gemini Gemini Pro, Gemini Flash
  • Spring AI abstraction over multiple backends

Local LLM (GGUF)

GGUF → SameDiff conversion, OpenAI-compatible API
  • LLaMA 2 / 3 Meta
  • Mistral / Mixtral Mistral AI
  • Phi-2 / Phi-3 Microsoft
  • Qwen / Qwen2 Alibaba
  • Gemma Google
  • Falcon TII
  • StarCoder / StarCoder2 BigCode
  • MPT MosaicML

Local LLM (Native SameDiff)

SameDiff FlatBuffers (.fb)
  • SmolLM 135M/360M Instruct HuggingFace, edge-ready
  • Phi-2 (2.7B) Microsoft, 32 layers, 2560-dim
Guardrails

LLM-based safety and quality gates on both input and output.

Input Guardrails

LLM-based (uses configured LLM)
  • Prompt Injection Detection threshold-based classifier
  • Toxicity Detection configurable category thresholds
  • PII Detection email, phone, SSN, credit card
  • Topic Guardrail allowed/blocked topic enforcement

Output Guardrails

LLM-based (uses configured LLM)
  • Hallucination Detection context-grounded fact checking with retry
  • Relevancy Check response-to-query relevance scoring with retry
  • Format Guardrail length, structure, and format validation
Vision & Document Understanding

Extract text, tables, and structure from images, PDFs, and scanned documents.

Vision-Language Models (VLM)

ONNX, SameDiff
  • SmolDocling 256M SigLIP + Qwen2 decoder; DocTags, Markdown, JSON output
  • Donut Base Swin Transformer + BART; form and structured extraction
  • Florence-2 Base/Large Microsoft; general vision understanding

OCR Models

ONNX, SameDiff
  • DBNet v2 text region detection
  • CRNN v2 text recognition
  • TableFormer table structure extraction (Docling)
  • LayoutLM / LayoutLMv2 / LayoutLMv3 document layout analysis
  • PaddleOCR / DocTR end-to-end OCR pipelines
Speech

Transcribe audio into text for ingestion into the RAG pipeline.

Speech-to-Text

GGUF, ONNX (auto-download from HuggingFace)
  • Whisper Tiny fastest, lowest accuracy
  • Whisper Base lightweight general purpose
  • Whisper Small balanced speed/accuracy
  • Whisper Medium higher accuracy
  • Whisper Large highest accuracy, multilingual
Knowledge Graph Embeddings

Embed entities and relations for graph-based reasoning and GraphRAG.

KG Embedding Algorithms

Built-in (Neo4j integration)
  • TransE translational embedding (h + r ≈ t)
  • RotatE rotational embedding in complex space (h ∘ r ≈ t)

9

Pipeline Layers

15+

Model Categories

50+

Supported Models

All Local

Every Model Runs On-Prem

Multi-GPU Scheduling & Memory Management

Automatic workload routing across all available GPUs with reservation-based memory management and continuous batching.

Per-Service Device Routing

Pin each workload type to specific CUDA devices: embedding, LLM inference, VLM encoder, VLM decoder, ingest, and vector population. Vision models auto-route to the largest GPU. Configured via DeviceRoutingConfig.

GPU Resource Manager

Reservation-based memory management with priority preemption. Tracks per-service GPU reservations, nvidia-smi to CUDA runtime index mappings, and per-device memory pools.

Continuous Batching

Dynamic batching scheduler with per-model priority queues, configurable preferred batch sizes, and max queue delay. Requests are batched continuously to maximize GPU utilization.

KV Cache Management

Full KV cache lifecycle with prefix indexing (content-hash based), priority eviction policies, checkpoint/restore, and statistics collection. Shared across LLM and VLM pipelines.

RUNS ON YOUR HARDWARE

NVIDIAAMDIntel

Graph Optimization Pipeline

Models are compiled through a multi-pass pipeline that strips waste, fuses operations, and targets your hardware. Every pass is individually toggleable.

Cleanup

UnusedFunctionOptimizations remove dead constants and ops
ConstantFunctionOptimizations constant folding, pre-execute const-input ops
IdentityFunctionOptimizations strip identity ops and no-op permutations
Algebraic simplifications AddZero, SubtractZero, MultiplyOne, MultiplyZero, SubtractSelf, DivideOne

Shape & Linear Fusion

ShapeFunctionOptimizations fuse chained permutes, reshapes, and concats
LinearFusionOptimizations merge matmul + add into fused xw_plus_b

Attention & Activation Fusion

AttentionFusionOptimizations detect and fuse Q*K^T/sqrt(d)*V patterns with causal masking and projection fusion
FuseSigmoidMulToSwish collapse Sigmoid * Mul into Swish activation
FuseSwiGLUPattern recognize and fuse SwiGLU gating
FuseRMSNormPattern detect and fuse RMSNorm sequences
FuseMeanSquarePattern fuse mean-square normalization

Hardware Targeting

CuDNNFunctionOptimizations convert to cuDNN kernel implementations
QuantizationOptimizations INT8, UINT8, Float16, BFloat16 quantization
Triton GPU compilation warp tuning, stage tuning, graph capture, FP fusion

Performance Profiles

DEBUG_FAST

Minimal passes for rapid iteration. No fusion, no quantization.

BALANCED

Cleanup + basic fusion. Good for development and testing.

MAX_PERF

All optimization passes enabled. Maximum hardware throughput.

LLM_OPTIMAL

Tuned for transformer inference — attention fusion, KV cache, Triton compilation, speculative decoding.

LLM_BASIC

Lightweight LLM profile without GPU compilation passes.

Training & Fine-Tuning

Customize models on your proprietary data. Every method runs natively inside the model server — no fragmented external tooling.

PEFT / Adapter Methods

LoRA

Low-rank adaptation of attention weights

QLoRA

4-bit quantized base model + LoRA adapters for memory-efficient fine-tuning

AdaLoRA

Adaptive rank allocation across layers

DyLoRA

Dynamic rank during training

DoRA

Weight-decomposed low-rank adaptation for better convergence

IA3

Inhibit/amplify activations with learned scaling vectors

Prompt Tuning

Trainable soft prompt token embeddings

Prefix Tuning

Trainable prefix vectors per transformer layer

Alignment Methods

DPO

Direct Preference Optimization

KTO

Kahneman-Tversky Optimization

ORPO

Odds Ratio Preference Optimization

PPO

Proximal Policy Optimization with reward model

GRPO

Group Relative Policy Optimization

Distillation

Logit KD

Teacher-student distillation

Feature KD

Teacher-student distillation

Attention KD

Teacher-student distillation

Combined

Teacher-student distillation

Model Registry & Air-Gapping

A proper model registry with full lifecycle management. Package into .karch archives and deploy to fully air-gapped environments.

.karch Archives

Self-contained archives with manifest.karch.json and checksums.sha256. Export, import, publish, and download via CLI or REST API. Move models across air-gapped boundaries with a single file.

Model Lifecycle

Full promote, replace, convert, and delete workflows with status tracking. Import from HuggingFace with resumable downloads. Configurable source modes: ARCHIVE, REGISTRY, or HYBRID.

Format Importers

Native importers for ONNX, TensorFlow, Keras, and GGUF/GGML. The GGML importer supports architecture-specific weight mapping for all major model families.

OpenAI-Compatible API

Expose any loaded model through an OpenAI-compatible REST API. Existing integrations can point to the model server as a drop-in replacement for hosted APIs.

Supported Model Architectures

LLaMAMistralMixtralPhiQwenGemmaFalconMPTStarCoder

IMPORT FROM FRAMEWORKS YOU TRUST

TensorFlowPyTorchONNXKerasJAX

Inference Capabilities

LLM Execution

Text generation with streaming, speculative decoding (n-gram, draft model), KV cache reuse, and configurable sampling strategies.

VLM Execution

Vision-language model inference with separate encoder/decoder device routing, KV cache integration, and image+text multimodal input.

Embedding Generation

Generate embeddings from local ONNX or SameDiff models. Batch processing with automatic device placement and memory management.

Ready to compile your models?

Get early access and run models on your own hardware.

Request Early Access