Case Study

NLP Retrieval & RAG

An information-retrieval and retrieval-augmented question-answering system on Natural Questions.

Python
FAISS
Transformers
Flan-T5
NLP

GitHub available
Retrieval demo
Re-ranking
Reproducible notebook

View on GitHub

At a glance

Problem: How well can a system answer natural-language questions over a document collection, and what does re-ranking add?
Built: TF-IDF, dense and hybrid retrieval baselines with FAISS, cross-encoder re-ranking and Flan-T5 generation.
Models / methods: TF-IDF, dense embeddings, a cross-encoder re-ranker, and Flan-T5.
Result: The dense + cross-encoder stack improved candidate quality over the TF-IDF baseline; grounding reduced unsupported answers.
Strength shown: IR fundamentals, retrieval evaluation, end-to-end RAG.
Links: GitHubCase Study

Visual proof

RAG pipeline — indexing, retrieval, re-ranking & grounded generation

Retrieved documents & citations — gold passage recovered at rank 1

Query-to-passage similarity heatmap for a sample question

Charts and diagrams are real outputs and architecture from the project.

01Objective

Answer natural-language questions over a document collection, comparing retrieval strategies and measuring how much re-ranking and generation add.

02Dataset / input

The Natural Questions benchmark — real user questions paired with Wikipedia passages — used for both retrieval and answer evaluation.

03Model approach

TF-IDF, dense, and hybrid retrieval baselines
FAISS for efficient nearest-neighbour search over embeddings
A cross-encoder to re-rank the top candidates
Flan-T5 to generate grounded answers from retrieved context

04Results / metrics

Retrieval compared with standard IR metrics (recall@k, MRR); the dense + cross-encoder stack improved candidate quality over the TF-IDF baseline, and grounding generation in retrieved passages reduced unsupported answers. The similarity heatmap shows the top retrieved passages for a sample query.

05Deployment / reproducibility

Delivered as reproducible notebooks and scripts that run the full pipeline from indexing through evaluation.

06Limitations

Compute-bound to a subset of the full benchmark
Generation quality bounded by a small, open generator model

07Future improvements

Larger instruction-tuned generators and query rewriting
Answer-faithfulness scoring and citation checking

08Key takeaway

Demonstrates the retrieval fundamentals behind modern RAG — baselines, dense search, re-ranking and grounded generation — measured properly.

Back to all projects