THE ML ENGINE
Kompile Model Server
A standalone lifecycle service to download, convert, optimize, train, and serve models entirely on your own hardware. Move from expensive hosted APIs to self-hosted inference with zero code changes.
Model Types by RAG Stack Layer
Every layer of the RAG pipeline uses a different kind of model. Kompile manages all of them — download, convert, optimize, and serve from a single platform.
LLM-powered query transformers reshape user queries before retrieval.
LLM (Query Transformer)
OpenAI, Anthropic, Gemini, Local SameDiffHyDE— generate hypothetical documents for embedding-based retrievalMulti-Query— expand a single query into N alternative formulationsStep-Back— abstract to a higher-level query for broader contextCompression— distill verbose input into a focused retrieval queryExpansion— augment queries with related terms and synonyms
Encode documents and queries into vector representations for similarity search.
Dense Encoder
SameDiff, ONNX, Anserini, OpenAI, Sentence Transformers, PostgresMLBGE Base/Small/Large EN v1.5— BAAI general-purpose (384–1024 dim)BGE-M3— multilingual, 100+ languages (1024 dim, 8192 tokens)Arctic Embed L— Snowflake long-context (1024 dim, 8192 tokens)E5 Base v2— intfloat general-purpose (768 dim)CosDPR-Distil— cosine-similarity DPR (768 dim)
Sparse Encoder
Anserini SameDiffSPLADE++ EfficientDistil— learned sparse, BERT vocab (30522 dim)SPLADE++ SelfDistil— self-distilled variant (30522 dim)
Image Encoder
ONNX via VLM pipelineSigLIP Base— Google, 224×224 images (768 dim)CLIP ViT-B/32— OpenAI, vision + text (512 dim)
Fetch candidate documents using sparse, dense, or hybrid search.
Retrieval Engines
Anserini (Lucene), pgvector, Chroma, VespaBM25— classic sparse keyword retrieval via Lucene/AnseriniHNSW Dense— approximate nearest neighbor over dense embeddingsHybrid— BM25 + dense retrieval with score fusionPrebuilt Indexes— MS MARCO passage/doc Lucene indexes
Re-score retrieved candidates with neural cross-encoders or statistical methods.
Cross-Encoder (Neural)
SameDiff (.sdz), ONNXMS MARCO MiniLM L-6-v2— 6-layer, 384-dim, fast passage rerankerMS MARCO MiniLM L-12-v2— 12-layer, higher accuracymMARCO mMiniLM L12— multilingual rerankerSTS-B TinyBERT L-4— lightweight semantic similarityQNLI DistilRoBERTa— question-answering NLI reranker (768 dim)
Statistical Rerankers
Built-in (Anserini)RRF— Reciprocal Rank Fusion across multiple retrieversMMR— Maximal Marginal Relevance for diversityRM3— relevance-model query expansionBM25-PRF— pseudo-relevance feedbackRocchio— vector space expansionAxiom— axiomatic semantic feedback
Generate responses grounded in retrieved context — cloud or fully local.
Cloud LLM
APIOpenAI— GPT-4o, GPT-4, GPT-3.5 TurboAnthropic— Claude 4, Claude 3.5 SonnetGoogle Gemini— Gemini Pro, Gemini FlashSpring AI— abstraction over multiple backends
Local LLM (GGUF)
GGUF → SameDiff conversion, OpenAI-compatible APILLaMA 2 / 3— MetaMistral / Mixtral— Mistral AIPhi-2 / Phi-3— MicrosoftQwen / Qwen2— AlibabaGemma— GoogleFalcon— TIIStarCoder / StarCoder2— BigCodeMPT— MosaicML
Local LLM (Native SameDiff)
SameDiff FlatBuffers (.fb)SmolLM 135M/360M Instruct— HuggingFace, edge-readyPhi-2 (2.7B)— Microsoft, 32 layers, 2560-dim
LLM-based safety and quality gates on both input and output.
Input Guardrails
LLM-based (uses configured LLM)Prompt Injection Detection— threshold-based classifierToxicity Detection— configurable category thresholdsPII Detection— email, phone, SSN, credit cardTopic Guardrail— allowed/blocked topic enforcement
Output Guardrails
LLM-based (uses configured LLM)Hallucination Detection— context-grounded fact checking with retryRelevancy Check— response-to-query relevance scoring with retryFormat Guardrail— length, structure, and format validation
Extract text, tables, and structure from images, PDFs, and scanned documents.
Vision-Language Models (VLM)
ONNX, SameDiffSmolDocling 256M— SigLIP + Qwen2 decoder; DocTags, Markdown, JSON outputDonut Base— Swin Transformer + BART; form and structured extractionFlorence-2 Base/Large— Microsoft; general vision understanding
OCR Models
ONNX, SameDiffDBNet v2— text region detectionCRNN v2— text recognitionTableFormer— table structure extraction (Docling)LayoutLM / LayoutLMv2 / LayoutLMv3— document layout analysisPaddleOCR / DocTR— end-to-end OCR pipelines
Transcribe audio into text for ingestion into the RAG pipeline.
Speech-to-Text
GGUF, ONNX (auto-download from HuggingFace)Whisper Tiny— fastest, lowest accuracyWhisper Base— lightweight general purposeWhisper Small— balanced speed/accuracyWhisper Medium— higher accuracyWhisper Large— highest accuracy, multilingual
Embed entities and relations for graph-based reasoning and GraphRAG.
KG Embedding Algorithms
Built-in (Neo4j integration)TransE— translational embedding (h + r ≈ t)RotatE— rotational embedding in complex space (h ∘ r ≈ t)
9
Pipeline Layers
15+
Model Categories
50+
Supported Models
All Local
Every Model Runs On-Prem
Multi-GPU Scheduling & Memory Management
Automatic workload routing across all available GPUs with reservation-based memory management and continuous batching.
Per-Service Device Routing
Pin each workload type to specific CUDA devices: embedding, LLM inference, VLM encoder, VLM decoder, ingest, and vector population. Vision models auto-route to the largest GPU. Configured via DeviceRoutingConfig.
GPU Resource Manager
Reservation-based memory management with priority preemption. Tracks per-service GPU reservations, nvidia-smi to CUDA runtime index mappings, and per-device memory pools.
Continuous Batching
Dynamic batching scheduler with per-model priority queues, configurable preferred batch sizes, and max queue delay. Requests are batched continuously to maximize GPU utilization.
KV Cache Management
Full KV cache lifecycle with prefix indexing (content-hash based), priority eviction policies, checkpoint/restore, and statistics collection. Shared across LLM and VLM pipelines.
RUNS ON YOUR HARDWARE
Graph Optimization Pipeline
Models are compiled through a multi-pass pipeline that strips waste, fuses operations, and targets your hardware. Every pass is individually toggleable.
Cleanup
UnusedFunctionOptimizations — remove dead constants and opsConstantFunctionOptimizations — constant folding, pre-execute const-input opsIdentityFunctionOptimizations — strip identity ops and no-op permutationsAlgebraic simplifications — AddZero, SubtractZero, MultiplyOne, MultiplyZero, SubtractSelf, DivideOneShape & Linear Fusion
ShapeFunctionOptimizations — fuse chained permutes, reshapes, and concatsLinearFusionOptimizations — merge matmul + add into fused xw_plus_bAttention & Activation Fusion
AttentionFusionOptimizations — detect and fuse Q*K^T/sqrt(d)*V patterns with causal masking and projection fusionFuseSigmoidMulToSwish — collapse Sigmoid * Mul into Swish activationFuseSwiGLUPattern — recognize and fuse SwiGLU gatingFuseRMSNormPattern — detect and fuse RMSNorm sequencesFuseMeanSquarePattern — fuse mean-square normalizationHardware Targeting
CuDNNFunctionOptimizations — convert to cuDNN kernel implementationsQuantizationOptimizations — INT8, UINT8, Float16, BFloat16 quantizationTriton GPU compilation — warp tuning, stage tuning, graph capture, FP fusionPerformance Profiles
DEBUG_FASTMinimal passes for rapid iteration. No fusion, no quantization.
BALANCEDCleanup + basic fusion. Good for development and testing.
MAX_PERFAll optimization passes enabled. Maximum hardware throughput.
LLM_OPTIMALTuned for transformer inference — attention fusion, KV cache, Triton compilation, speculative decoding.
LLM_BASICLightweight LLM profile without GPU compilation passes.
Training & Fine-Tuning
Customize models on your proprietary data. Every method runs natively inside the model server — no fragmented external tooling.
PEFT / Adapter Methods
LoRA
Low-rank adaptation of attention weights
QLoRA
4-bit quantized base model + LoRA adapters for memory-efficient fine-tuning
AdaLoRA
Adaptive rank allocation across layers
DyLoRA
Dynamic rank during training
DoRA
Weight-decomposed low-rank adaptation for better convergence
IA3
Inhibit/amplify activations with learned scaling vectors
Prompt Tuning
Trainable soft prompt token embeddings
Prefix Tuning
Trainable prefix vectors per transformer layer
Alignment Methods
DPO
Direct Preference Optimization
KTO
Kahneman-Tversky Optimization
ORPO
Odds Ratio Preference Optimization
PPO
Proximal Policy Optimization with reward model
GRPO
Group Relative Policy Optimization
Distillation
Logit KD
Teacher-student distillation
Feature KD
Teacher-student distillation
Attention KD
Teacher-student distillation
Combined
Teacher-student distillation
Model Registry & Air-Gapping
A proper model registry with full lifecycle management. Package into .karch archives and deploy to fully air-gapped environments.
.karch Archives
Self-contained archives with manifest.karch.json and checksums.sha256. Export, import, publish, and download via CLI or REST API. Move models across air-gapped boundaries with a single file.
Model Lifecycle
Full promote, replace, convert, and delete workflows with status tracking. Import from HuggingFace with resumable downloads. Configurable source modes: ARCHIVE, REGISTRY, or HYBRID.
Format Importers
Native importers for ONNX, TensorFlow, Keras, and GGUF/GGML. The GGML importer supports architecture-specific weight mapping for all major model families.
OpenAI-Compatible API
Expose any loaded model through an OpenAI-compatible REST API. Existing integrations can point to the model server as a drop-in replacement for hosted APIs.
Supported Model Architectures
IMPORT FROM FRAMEWORKS YOU TRUST
Inference Capabilities
LLM Execution
Text generation with streaming, speculative decoding (n-gram, draft model), KV cache reuse, and configurable sampling strategies.
VLM Execution
Vision-language model inference with separate encoder/decoder device routing, KV cache integration, and image+text multimodal input.
Embedding Generation
Generate embeddings from local ONNX or SameDiff models. Batch processing with automatic device placement and memory management.
Ready to compile your models?
Get early access and run models on your own hardware.
Request Early Access