YOUR ORGANIZATION'S MOST VALUABLE ASSET

Knowledge Graphs & Crawl

Models come and go. Your organization's knowledge is permanent. Kompile crawls your entire data estate and compiles it into typed, versioned knowledge graphs with probabilistic reasoning, causal inference, and full provenance — the structured context layer that makes every AI system smarter.

View on GitHub Get Early Access

Why Knowledge Graphs Are Your Core Asset

LLM providers change quarterly. Your organizational knowledge — relationships between people, processes, decisions, and documents — is the competitive advantage that compounds over time.

Own

Your knowledge graph runs on your infrastructure. No vendor lock-in. No data leaving your network. Full sovereignty over your organizational intelligence.

Compound

Every crawl enriches the graph. Entity resolution merges duplicates. Community detection discovers structure. The graph gets smarter with every ingestion cycle.

Audit

Every node and edge carries provenance, confidence scores, and timestamps. Every mutation is logged with before/after snapshots. Regulatory-grade traceability built in.

The Crawl Pipeline

One command ingests your entire data estate through an 8-phase pipeline with adaptive memory-aware parallelism.

terminal

$ kompile crawl start https://docs.example.com \
    --depth 3 --graph --graph-schema-mode STRICT \
    --chunker tableAwareChunker --watch

  Phase 1/8  Loading........... 142 documents from 3 sources
  Phase 2/8  Classifying....... 98 text, 31 mixed PDF, 13 spreadsheet
  Phase 3/8  Preprocessing..... lang detect, PII redact, dedup
  Phase 4/8  Chunking.......... 4,210 chunks (table-aware)
  Phase 5/8  Extracting........ 1,847 entities, 2,391 relationships
  Phase 6/8  Resolving......... 312 duplicates merged (cosine > 0.88)
  Phase 7/8  Indexing.......... 4,210 chunks embedded (BGE-M3)
  Phase 8/8  Enriching......... taxonomy discovered, 14 categories

  Done. Graph: 1,535 nodes, 2,391 edges, 14 communities

1. Load

Pull data from 20+ sources with OAuth2 auth, robots.txt compliance, rate limiting, and incremental content-hash crawling.

2. Classify & Route

Auto-classify PDFs as text/image/mixed. Route tables, formulas, slides, audio, and email to specialized extractors.

3. Preprocess

Language detection, translation, boilerplate removal, Unicode normalization, PII redaction, content-hash + SimHash deduplication.

4. Chunk

Five strategies: sentence, recursive character, markdown-aware, token-based, table-aware. Register SNIPPET nodes with CONTAINS edges.

5. Extract Graph

Multi-agent LLM + pattern NER extraction with cost-balanced batch planning. Schema enforcement: None, Lenient, or Strict.

6. Resolve

Entity resolution via Levenshtein + embedding cosine + MEBN probabilistic scoring. Compute shared-entity and similarity edges.

7. Index

Embed chunks with BGE, Arctic, or SPLADE++ and store in Anserini HNSW, pgvector, Vespa, or Chroma.

8. Enrich

Deduplicate, prune, validate, normalize. Discover domain taxonomy via LLM, categorize entities, rebuild search indexes.

Supported Data Sources

ConfluenceJiraNotionSlackDiscordGoogle WorkspaceGmailOneDriveRedditIMAP / POP3MBOX / PSTS3 / SFTP / SMBSQL DatabasesWeb CrawlerLocal Filesystem

Graph Architecture

A typed, hierarchical graph model with seven node levels, eleven edge types, and full provenance on every mutation. Multi-tenant isolation via fact sheets and named graph scoping.

Node Hierarchy

SOURCERoot document source (URL, file, database)

DOCUMENTA single document within a source

SNIPPETA chunk or paragraph within a document

ENTITYExtracted named entity (person, org, concept)

TABLETabular data extracted from documents

ATTACHMENTBinary attachments (images, embedded PDFs)

CUSTOMUser-defined nodes for domain-specific types

Edge Types

HIERARCHICALEMBEDDING_SIMILARITYSHARED_ENTITYEXTRACTED_FROMCONTAINSCITATIONTEMPORALCROSS_SOURCEAUTHORED_BYADDRESSED_TOUSER_DEFINED

Provenance & Confidence

Every node carries source attribution, confidence score, provenance string, occurrence/observation/creation timestamps, and optional TTL expiry. User-pinned nodes resist automated pruning.

KG Embeddings

Each node stores an ND4J INDArray embedding computed via TransE or RotatE. Used for link prediction, entity similarity, and embedding-similarity edges across the graph.

Multi-Tenant Isolation

Every node and edge is scoped by factSheetId for complete tenant isolation. Named graph IDs enable logical sub-graph grouping within a fact sheet. All queries, algorithm caches, and maintenance ops respect scoping.

Probabilistic Reasoning & Causal Inference

Kompile graphs aren't just structural — they support probabilistic reasoning over live knowledge graph state through Multi-Entity Bayesian Networks and causal inference.

Multi-Entity Bayesian Networks (MEBN)

First-order probabilistic logic grounded against live knowledge graph state. Build MTheory templates with reusable MFrag fragments, then ground them into situation-specific networks at query time.

MFrag Templates

EntityRelevance, CausalInfluence, InformationFlow, RiskPropagation

SSBN Grounding

BFS expansion from seed nodes through live KG builds query-specific network

Variable Elimination

Compute P(query | evidence) with CPTs, Noisy-OR gates, and ND4J factor operations

Entity Resolution

MEBN-scored P(isSameEntity | signals) alongside Levenshtein and cosine scoring

FOL Predicates

GraphKnowledgeBase evaluates atomic predicates with universal/existential quantifiers

Causal Inference

Eight W3C PROV-DM aligned causal edge types for modeling how events, decisions, and data flow through your organization. Temporal chain extraction, attribution paths, and counterfactual modeling.

CAUSES

Direct causal relationship between events

ENABLES

Precondition that makes an outcome possible

TRIGGERS

Event that initiates a downstream process

CONTRIBUTES_TO

Partial influence on an outcome

PREVENTS

Blocks or inhibits an outcome

CORRELATES_WITH

Statistical association without proven causation

INFLUENCES

Indirect effect on downstream state

DERIVED_FROM

Data lineage and provenance chain (W3C PROV-DM)

Graph Maintenance

Knowledge graphs are living systems. Kompile provides nine automated maintenance primitives that keep your graph clean, consistent, and trustworthy — with full audit logging on every operation.

TTL Sweep

Expire nodes past their validUntil timestamp

Orphan Cleanup

Remove nodes with no remaining edges

Confidence Pruning

Remove low-confidence nodes and edges

Component Pruning

Remove small isolated subgraph components

Contradiction Detection

Find and flag conflicting facts

Source Validation

Validate provenance references are still valid

Entity Re-Resolution

Re-run entity resolution with MEBN scoring

Stats Refresh

Recompute cached graph statistics

Community Rebuild

Re-run Louvain community detection

Mutation Audit Trail

Every node/edge create, update, and delete is captured as a GraphMutationRecord with full before/after JSON snapshots, changeset correlation IDs, trigger source (user, crawler, maintenance), and actor attribution. Up to 50 maintenance history records retained.

Real-Time Change Broadcasting

Spring event-driven decorator pattern publishes NodeMutationEvent and EdgeMutationEvent on every write. GraphChangeWebSocketBroadcaster streams changes to connected clients in real time. Configurable post-mutation hooks via GraphUpdateHookRegistry.

MCP Graph Tools for AI Agents

30+ MCP tool operations exposed via Spring AI @Tool annotations. Any MCP-compatible agent (Claude Code, Codex, Gemini CLI, Qwen) can read, write, traverse, and analyze your knowledge graph natively.

Mutation Tools

Create, update, delete nodes and edges. Bulk edge creation. Merge nodes (redirects all edges, then deletes the merged source). Algorithm cache auto-invalidation on every mutation.

Search Tools

Full-text graph search on nodes and edges. Node detail and edge detail lookups. Edges-between queries. Metadata search across the graph.

Traversal Tools

BFS traversal (depth 5), ego networks (radius 3), neighborhood queries, shortest path (Dijkstra/BFS), hybrid search (local/global/hybrid with vector weight and hop depth), and visualization data export.

Algorithm Tools

PageRank, degree centrality (in/out/total), betweenness centrality with sampling, Jaccard node similarity. All backed by ND4J matrix operations with adjacency view caching.

Community Tools

Louvain and weakly-connected-component detection. List community members. LLM-generated community summaries. Per-node community lookup. Jaccard top-k similar pairs.

Label & Named Graph Tools

Tag nodes with labels. Create and manage named graphs for logical sub-graph grouping. All operations respect fact sheet multi-tenant scoping.

Graph Algorithms & Embeddings

Graph Algorithms

ND4J-backed matrix algorithms running on your GPU or CPU. Adjacency view caching per fact sheet for fast repeated queries.

PageRank (power iteration)

Louvain community detection (modularity optimization)

Betweenness centrality (with sampling)

Degree centrality (in / out / total)

Dijkstra shortest path (weighted)

BFS traversal (unweighted)

Weakly connected components

Jaccard similarity (top-k)

LLM community summarization

KG Embeddings

Turn your graph into a queryable vector space. Two native embedding models trained directly on your graph topology.

TransE

Scoring: ||h + r - t||. SGD training with margin ranking loss and negative corruption sampling. Predict tails, heads, relations, or find similar entities.

RotatE

Complex-space rotation: h * r = t. Self-adversarial negative sampling with softmax weighting. Handles symmetric, antisymmetric, inversion, and composition relation patterns.

Operations

Link prediction, entity similarity, head/tail/relation prediction. All tensor operations via ND4J. Serializable to disk for offline use.

Export & Interop

Nine export formats for downstream consumption. Merge graphs across environments with fuzzy Levenshtein + type deduplication and ID remapping.

JSONJSON-LDCSVGraphMLCypher DumpHTML + D3.jsSVG DiagramMediaWikiObsidian Vault

Storage backends: Neo4j with APOC upserts and Cypher queries, or built-in JPA + ND4J adjacency matrix for embedded deployments. Deterministic entity IDs ensure idempotent writes.

GraphRAG Search Modes

Three retrieval modes, each with three backend implementations (JPA, Neo4j Cypher, ND4J matrix).

LOCAL

Ego-network queries centered on matched entities. Fast, focused answers from the immediate neighborhood.

GLOBAL

Community-level summaries for broad questions. Aggregates knowledge across Louvain-detected communities.

HYBRID

Combined local + global with configurable vector weight and hop depth. Best of both for complex multi-hop reasoning.

Your knowledge graph is your organization's memory.

Start building the asset that compounds with every crawl.

GitHub Request Early Access