YOUR ORGANIZATION'S MOST VALUABLE ASSET
Knowledge Graphs & Crawl
Models come and go. Your organization's knowledge is permanent. Kompile crawls your entire data estate and compiles it into typed, versioned knowledge graphs with probabilistic reasoning, causal inference, and full provenance — the structured context layer that makes every AI system smarter.
Why Knowledge Graphs Are Your Core Asset
LLM providers change quarterly. Your organizational knowledge — relationships between people, processes, decisions, and documents — is the competitive advantage that compounds over time.
Own
Your knowledge graph runs on your infrastructure. No vendor lock-in. No data leaving your network. Full sovereignty over your organizational intelligence.
Compound
Every crawl enriches the graph. Entity resolution merges duplicates. Community detection discovers structure. The graph gets smarter with every ingestion cycle.
Audit
Every node and edge carries provenance, confidence scores, and timestamps. Every mutation is logged with before/after snapshots. Regulatory-grade traceability built in.
The Crawl Pipeline
One command ingests your entire data estate through an 8-phase pipeline with adaptive memory-aware parallelism.
$ kompile crawl start https://docs.example.com \
--depth 3 --graph --graph-schema-mode STRICT \
--chunker tableAwareChunker --watch
Phase 1/8 Loading........... 142 documents from 3 sources
Phase 2/8 Classifying....... 98 text, 31 mixed PDF, 13 spreadsheet
Phase 3/8 Preprocessing..... lang detect, PII redact, dedup
Phase 4/8 Chunking.......... 4,210 chunks (table-aware)
Phase 5/8 Extracting........ 1,847 entities, 2,391 relationships
Phase 6/8 Resolving......... 312 duplicates merged (cosine > 0.88)
Phase 7/8 Indexing.......... 4,210 chunks embedded (BGE-M3)
Phase 8/8 Enriching......... taxonomy discovered, 14 categories
Done. Graph: 1,535 nodes, 2,391 edges, 14 communitiesPull data from 20+ sources with OAuth2 auth, robots.txt compliance, rate limiting, and incremental content-hash crawling.
Auto-classify PDFs as text/image/mixed. Route tables, formulas, slides, audio, and email to specialized extractors.
Language detection, translation, boilerplate removal, Unicode normalization, PII redaction, content-hash + SimHash deduplication.
Five strategies: sentence, recursive character, markdown-aware, token-based, table-aware. Register SNIPPET nodes with CONTAINS edges.
Multi-agent LLM + pattern NER extraction with cost-balanced batch planning. Schema enforcement: None, Lenient, or Strict.
Entity resolution via Levenshtein + embedding cosine + MEBN probabilistic scoring. Compute shared-entity and similarity edges.
Embed chunks with BGE, Arctic, or SPLADE++ and store in Anserini HNSW, pgvector, Vespa, or Chroma.
Deduplicate, prune, validate, normalize. Discover domain taxonomy via LLM, categorize entities, rebuild search indexes.
Supported Data Sources
ConfluenceJiraNotionSlackDiscordGoogle WorkspaceGmailOneDriveRedditIMAP / POP3MBOX / PSTS3 / SFTP / SMBSQL DatabasesWeb CrawlerLocal FilesystemGraph Architecture
A typed, hierarchical graph model with seven node levels, eleven edge types, and full provenance on every mutation. Multi-tenant isolation via fact sheets and named graph scoping.
Node Hierarchy
Edge Types
HIERARCHICALEMBEDDING_SIMILARITYSHARED_ENTITYEXTRACTED_FROMCONTAINSCITATIONTEMPORALCROSS_SOURCEAUTHORED_BYADDRESSED_TOUSER_DEFINEDProvenance & Confidence
Every node carries source attribution, confidence score, provenance string, occurrence/observation/creation timestamps, and optional TTL expiry. User-pinned nodes resist automated pruning.
KG Embeddings
Each node stores an ND4J INDArray embedding computed via TransE or RotatE. Used for link prediction, entity similarity, and embedding-similarity edges across the graph.
Multi-Tenant Isolation
Every node and edge is scoped by factSheetId for complete tenant isolation. Named graph IDs enable logical sub-graph grouping within a fact sheet. All queries, algorithm caches, and maintenance ops respect scoping.
Probabilistic Reasoning & Causal Inference
Kompile graphs aren't just structural — they support probabilistic reasoning over live knowledge graph state through Multi-Entity Bayesian Networks and causal inference.
Multi-Entity Bayesian Networks (MEBN)
First-order probabilistic logic grounded against live knowledge graph state. Build MTheory templates with reusable MFrag fragments, then ground them into situation-specific networks at query time.
MFrag Templates
EntityRelevance, CausalInfluence, InformationFlow, RiskPropagation
SSBN Grounding
BFS expansion from seed nodes through live KG builds query-specific network
Variable Elimination
Compute P(query | evidence) with CPTs, Noisy-OR gates, and ND4J factor operations
Entity Resolution
MEBN-scored P(isSameEntity | signals) alongside Levenshtein and cosine scoring
FOL Predicates
GraphKnowledgeBase evaluates atomic predicates with universal/existential quantifiers
Causal Inference
Eight W3C PROV-DM aligned causal edge types for modeling how events, decisions, and data flow through your organization. Temporal chain extraction, attribution paths, and counterfactual modeling.
CAUSESDirect causal relationship between events
ENABLESPrecondition that makes an outcome possible
TRIGGERSEvent that initiates a downstream process
CONTRIBUTES_TOPartial influence on an outcome
PREVENTSBlocks or inhibits an outcome
CORRELATES_WITHStatistical association without proven causation
INFLUENCESIndirect effect on downstream state
DERIVED_FROMData lineage and provenance chain (W3C PROV-DM)
Graph Maintenance
Knowledge graphs are living systems. Kompile provides nine automated maintenance primitives that keep your graph clean, consistent, and trustworthy — with full audit logging on every operation.
TTL Sweep
Expire nodes past their validUntil timestamp
Orphan Cleanup
Remove nodes with no remaining edges
Confidence Pruning
Remove low-confidence nodes and edges
Component Pruning
Remove small isolated subgraph components
Contradiction Detection
Find and flag conflicting facts
Source Validation
Validate provenance references are still valid
Entity Re-Resolution
Re-run entity resolution with MEBN scoring
Stats Refresh
Recompute cached graph statistics
Community Rebuild
Re-run Louvain community detection
Mutation Audit Trail
Every node/edge create, update, and delete is captured as a GraphMutationRecord with full before/after JSON snapshots, changeset correlation IDs, trigger source (user, crawler, maintenance), and actor attribution. Up to 50 maintenance history records retained.
Real-Time Change Broadcasting
Spring event-driven decorator pattern publishes NodeMutationEvent and EdgeMutationEvent on every write. GraphChangeWebSocketBroadcaster streams changes to connected clients in real time. Configurable post-mutation hooks via GraphUpdateHookRegistry.
MCP Graph Tools for AI Agents
30+ MCP tool operations exposed via Spring AI @Tool annotations. Any MCP-compatible agent (Claude Code, Codex, Gemini CLI, Qwen) can read, write, traverse, and analyze your knowledge graph natively.
Mutation Tools
Create, update, delete nodes and edges. Bulk edge creation. Merge nodes (redirects all edges, then deletes the merged source). Algorithm cache auto-invalidation on every mutation.
Search Tools
Full-text graph search on nodes and edges. Node detail and edge detail lookups. Edges-between queries. Metadata search across the graph.
Traversal Tools
BFS traversal (depth 5), ego networks (radius 3), neighborhood queries, shortest path (Dijkstra/BFS), hybrid search (local/global/hybrid with vector weight and hop depth), and visualization data export.
Algorithm Tools
PageRank, degree centrality (in/out/total), betweenness centrality with sampling, Jaccard node similarity. All backed by ND4J matrix operations with adjacency view caching.
Community Tools
Louvain and weakly-connected-component detection. List community members. LLM-generated community summaries. Per-node community lookup. Jaccard top-k similar pairs.
Label & Named Graph Tools
Tag nodes with labels. Create and manage named graphs for logical sub-graph grouping. All operations respect fact sheet multi-tenant scoping.
Graph Algorithms & Embeddings
Graph Algorithms
ND4J-backed matrix algorithms running on your GPU or CPU. Adjacency view caching per fact sheet for fast repeated queries.
PageRank (power iteration)
Louvain community detection (modularity optimization)
Betweenness centrality (with sampling)
Degree centrality (in / out / total)
Dijkstra shortest path (weighted)
BFS traversal (unweighted)
Weakly connected components
Jaccard similarity (top-k)
LLM community summarization
KG Embeddings
Turn your graph into a queryable vector space. Two native embedding models trained directly on your graph topology.
TransE
Scoring: ||h + r - t||. SGD training with margin ranking loss and negative corruption sampling. Predict tails, heads, relations, or find similar entities.
RotatE
Complex-space rotation: h * r = t. Self-adversarial negative sampling with softmax weighting. Handles symmetric, antisymmetric, inversion, and composition relation patterns.
Operations
Link prediction, entity similarity, head/tail/relation prediction. All tensor operations via ND4J. Serializable to disk for offline use.
Export & Interop
Nine export formats for downstream consumption. Merge graphs across environments with fuzzy Levenshtein + type deduplication and ID remapping.
JSONJSON-LDCSVGraphMLCypher DumpHTML + D3.jsSVG DiagramMediaWikiObsidian VaultStorage backends: Neo4j with APOC upserts and Cypher queries, or built-in JPA + ND4J adjacency matrix for embedded deployments. Deterministic entity IDs ensure idempotent writes.
GraphRAG Search Modes
Three retrieval modes, each with three backend implementations (JPA, Neo4j Cypher, ND4J matrix).
LOCAL
Ego-network queries centered on matched entities. Fast, focused answers from the immediate neighborhood.
GLOBAL
Community-level summaries for broad questions. Aggregates knowledge across Louvain-detected communities.
HYBRID
Combined local + global with configurable vector weight and hop depth. Best of both for complex multi-hop reasoning.
Your knowledge graph is your organization's memory.
Start building the asset that compounds with every crawl.