LightRAG
, Retrieval‑Augmented Generation
, Knowledge Graph
, Neo4j
, Qdrant
, Mistral‑7B
, Ollama
, Healthcare Audit
, Dual‑Level Retrieval
, On‑Prem LLM
, Incremental Ingest
, Grafana
, Prometheus
Pilot Project | Graph‑Enhanced RAG Chatbot for an Auditing and Consulting Firm
Morice NouvertneA firm in the public sector was drowning in tens of thousands of pages of legislation, audit manuals, and funding guidelines. Staff spent hours searching PDFs or emailing “the expert down the hall.” My brief was clear:
-
Give consultants an offline, private large‑language‑model assistant that can:
- answer nuanced regulatory questions,
- cite sources,
- keep pace with weekly policy updates,
-
…without sending sensitive data to external APIs.
Why LightRAG?
Traditional chunk‑based RAG returned isolated paragraphs that lacked cross‑document context (“flat retrieval”). LightRAG’s graph‑first design solved three pain points:
- Entity relations matter in healthcare law (e.g., hospital ➜ financing method ➜ county ordinance).
- Dual‑level retrieval lets users ask both “What § 301 SGB V changes?” and “How does that affect funding for geriatric rehab?”
- Incremental graph updates keep the bot current without re‑embedding an entire 8 GB corpus every week.
Architecture at glance
-
Document Ingestion
- Source documents included PDFs, DOCX files, and structured CSVs containing legal texts, audit guides, and compliance documentation.
- Ingestion ran on an air-gapped virtual machine, ensuring no sensitive content left the internal network.
-
Entity & Relationship Extraction
- Text segments were passed through Mistral-7B (via Ollama) to identify entities (e.g., organizations, policies) and their relationships (e.g., funding paths, jurisdictional links).
- Extracted entities and relations were transformed into node and edge data structures.
-
Knowledge Graph Construction
- Extracted data was ingested into a Neo4j graph database.
- Nodes represented entities; edges represented relationships.
- Deduplication and profiling ensured clean and consistent graph structure (~2 million nodes and 6 million relationships).
-
Vector Embedding Layer
- Descriptions of each node and edge were embedded using Mistral-7B instruct.
- These embeddings were stored in Qdrant, enabling fast similarity search.
-
Dual-Level Retrieval Logic
- The LightRAG orchestrator accepted user questions and extracted both low-level (specific entities) and high-level (conceptual topics) keywords.
- These were matched against the graph (via Neo4j queries) and vector space (via Qdrant).
- The resulting context included descriptions from connected entities and semantic neighbors.
-
Response Generation
- Context data was passed to the Mistral-7B LLM for final answer generation.
- Citations and context were included in the response to support transparency.
-
Deployment & Serving
- The entire system was containerized with Docker Compose and deployed to a local Proxmox-based server cluster.
- Inference ran via Ollama with both GPU and CPU support.
- Monitoring and metrics were tracked using Prometheus and Grafana dashboards.
Outcomes
KPI | Before | After LightRAG | Δ |
---|---|---|---|
Avg. time to locate correct paragraph | 12 min (manual search) | 35 sec (chat) | ‑95% |
Answer helpfulness (internal survey, 1‑5) | 2.7 | 4.4 | +63% |
Source‑attribution accuracy | n/a | 98% (spot‑checked) | — |
Weekly update latency | 1–2 days | < 30 min incremental ingest | ‑97% |
What I Learned
- End‑to‑end ownership: stood up dev, test, and prod clusters solo—including RabbitMQ task queues, CI/CD, and Grafana + Prometheus dashboards for GPU & query‑latency monitoring.
- Graph pruning matters: quarterly merge‑duplicate‑entity jobs keep Neo4j < 10 GB despite rapid growth.
- LLM‑on‑prem equals trust: local Mistral‑7B (int8) satisfied both data‑protection and cost‑efficiency constraints—~€ 0 inference cost.
- Dual‑level retrieval UX: surfacing why an answer was chosen (entity vs. relation evidence) boosts consultant confidence far more than raw text snippets.
Tech Stack
- LLM & Serving: Mistral‑7B via Ollama (GPU + CPU fall‑back)
- Knowledge Graph: Neo4j 5
- Vector Store: Qdrant (HNSW, cosine)
- Python 3.11 (LLM orchestration, FastAPI chat backend)
- Docker / Compose + Git‑based CI/CD
- Monitoring: Grafana + Prometheus