How GroundCite works

GroundCite is a Retrieval-Augmented Generation (RAG) engine. Unlike a generic chatbot, it answers only from a fixed library of scientific papers and traces every claim back to the exact source passage. When the library does not cover a question, it says so rather than inventing an answer.

The pipeline

1 · You ask

A question in plain language about the indexed research corpus.

2 · Hybrid retrieval

Dense vector search (pgvector) and keyword search (BM25) run in parallel, then fuse with Reciprocal Rank Fusion to surface the most relevant passages.

3 · Rerank

A language model re-orders the candidates by true relevance to the question, keeping only the best passages.

4 · Grounded answer

The model answers using only those passages, citing each claim with [n]. If the evidence is insufficient, it refuses instead of guessing.

Example output

A real question, answered from the corpus. The markers link each statement to a source below.

Question

What problem does sparse attention address for long-context LLM inference?

Answer

Sparse attention tackles the high computational and memory-bandwidth cost of full attention during long-context inference, where the KV cache grows with sequence length and the selection step itself can stay quadratic1. Approaches reduce this by decoupling selection and overlapping CPU-to-GPU prefetch with computation1, or by distilling document collections into reusable key-value caches so static context is not re-encoded on every query2.

Sources

1
SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM InferencearXiv:2606.04511 · 2026-06-04
2
Cartridges at Scale: Training Modular KV Caches over Large Document CollectionsarXiv:2606.04557 · 2026-06-04

How it is evaluated

Measured, not assumed. RAGAS-style metrics over a hand-curated question set:

0.95

Faithfulness

0.68

Answer relevancy

0.67

Retrieval hit@k

Faithfulness measures how much of the answer is supported by the retrieved passages; answer relevancy, how well it addresses the question; retrieval hit@k, whether the right paper is retrieved. The retrieval score is optimistic by design (each question targets a known paper) and is disclosed as such.

Under the hood

Corpus: recent arXiv cs.CL papers (NLP / LLM research), ~10k papers indexed full-text.
Retrieval: Postgres + pgvector (dense) and BM25 (lexical), fused with Reciprocal Rank Fusion, then LLM reranking.
Models: embeddings via BGE-M3 (1024-dim, multilingual) - the corpus is embedded offline on a local GPU, the query at runtime through OpenRouter; generation and reranking via DeepSeek V3.2, also via OpenRouter.
Stack: FastAPI backend, Next.js frontend, deployed with Docker behind Traefik.

Running live queries

This page is public, but running a live query needs an access code, which keeps the public demo's cost bounded. Contact me to get an access code, then enter it on the home page and ask away.