MR

NLP Retrieval & RAG

An information-retrieval and retrieval-augmented question-answering system on Natural Questions.

View on GitHub
Problem
How well can a system answer natural-language questions over a document collection, and what does re-ranking add?
Built
TF-IDF, dense and hybrid retrieval baselines with FAISS, cross-encoder re-ranking and Flan-T5 generation.
Models / methods
TF-IDF, dense embeddings, a cross-encoder re-ranker, and Flan-T5.
Result
The dense + cross-encoder stack improved candidate quality over the TF-IDF baseline; grounding reduced unsupported answers.
Strength shown
IR fundamentals, retrieval evaluation, end-to-end RAG.
Links
GitHubCase Study
RAG pipeline — indexing, retrieval, re-ranking & grounded generation
RAG pipeline — indexing, retrieval, re-ranking & grounded generation
Retrieved documents & citations — gold passage recovered at rank 1
Retrieved documents & citations — gold passage recovered at rank 1
Query-to-passage similarity heatmap for a sample question
Query-to-passage similarity heatmap for a sample question

Charts and diagrams are real outputs and architecture from the project.

01Objective

Answer natural-language questions over a document collection, comparing retrieval strategies and measuring how much re-ranking and generation add.

02Dataset / input

The Natural Questions benchmark — real user questions paired with Wikipedia passages — used for both retrieval and answer evaluation.

03Model approach

  • TF-IDF, dense, and hybrid retrieval baselines
  • FAISS for efficient nearest-neighbour search over embeddings
  • A cross-encoder to re-rank the top candidates
  • Flan-T5 to generate grounded answers from retrieved context

04Results / metrics

Retrieval compared with standard IR metrics (recall@k, MRR); the dense + cross-encoder stack improved candidate quality over the TF-IDF baseline, and grounding generation in retrieved passages reduced unsupported answers. The similarity heatmap shows the top retrieved passages for a sample query.

05Deployment / reproducibility

Delivered as reproducible notebooks and scripts that run the full pipeline from indexing through evaluation.

06Limitations

  • Compute-bound to a subset of the full benchmark
  • Generation quality bounded by a small, open generator model

07Future improvements

  • Larger instruction-tuned generators and query rewriting
  • Answer-faithfulness scoring and citation checking

08Key takeaway

Demonstrates the retrieval fundamentals behind modern RAG — baselines, dense search, re-ranking and grounded generation — measured properly.