Sparse attention tackles the high computational and memory-bandwidth cost of full attention during long-context inference, where the KV cache grows with sequence length and the selection step itself can stay quadratic1. Approaches reduce this by decoupling selection and overlapping CPU-to-GPU prefetch with computation1, or by distilling document collections into reusable key-value caches so static context is not re-encoded on every query2.
How GroundCite works
GroundCite is a Retrieval-Augmented Generation (RAG) engine. Unlike a generic chatbot, it answers only from a fixed library of scientific papers and traces every claim back to the exact source passage. When the library does not cover a question, it says so rather than inventing an answer.
The pipeline
A question in plain language about the indexed research corpus.
Dense vector search (pgvector) and keyword search (BM25) run in parallel, then fuse with Reciprocal Rank Fusion to surface the most relevant passages.
A language model re-orders the candidates by true relevance to the question, keeping only the best passages.
The model answers using only those passages, citing each claim with [n]. If the evidence is insufficient, it refuses instead of guessing.
Example output
A real question, answered from the corpus. The markers link each statement to a source below.
What problem does sparse attention address for long-context LLM inference?
Sources
- 1SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM InferencearXiv:2606.04511 · 2026-06-04
- 2Cartridges at Scale: Training Modular KV Caches over Large Document CollectionsarXiv:2606.04557 · 2026-06-04
How it is evaluated
Measured, not assumed. RAGAS-style metrics over a hand-curated question set:
Faithfulness measures how much of the answer is supported by the retrieved passages; answer relevancy, how well it addresses the question; retrieval hit@k, whether the right paper is retrieved. The retrieval score is optimistic by design (each question targets a known paper) and is disclosed as such.
Under the hood
- Corpus: recent arXiv cs.CL papers (NLP / LLM research), ~10k papers indexed full-text.
- Retrieval: Postgres + pgvector (dense) and BM25 (lexical), fused with Reciprocal Rank Fusion, then LLM reranking.
- Models: embeddings via BGE-M3 (1024-dim, multilingual) - the corpus is embedded offline on a local GPU, the query at runtime through OpenRouter; generation and reranking via DeepSeek V3.2, also via OpenRouter.
- Stack: FastAPI backend, Next.js frontend, deployed with Docker behind Traefik.
Running live queries
This page is public, but running a live query needs an access code, which keeps the public demo's cost bounded. Contact me to get an access code, then enter it on the home page and ask away.