Building a Q&A System for Biopharma Deals with RAG

Published 2025-10-11 • Updated 2025-10-31

Financial analysts and consultants in the biopharma space often need to find specific, nuanced information buried within press releases and deal announcements. Answering questions like “What were the CVR milestones in the Roche-89bio deal?” or “Which financial advisors were most active last quarter?” can involve hours of manual searching. After working on a broader data pipeline, I wanted to explore a more focused solution for this kind of expert Q&A.

This project demonstrates a retrieval-augmented generation (RAG) system built to answer natural-language questions over a dataset of ~265 biopharma deal articles. The goal was to move beyond simple keyword search and build a tool that could provide concise, accurate answers grounded in the source text, complete with citations to build user trust.

The RAG Pipeline

The system follows the standard RAG pattern of indexing, retrieving, and generating. I used LangChain to orchestrate the components, with a FAISS vector store for efficient retrieval.

  1. Indexing: The first step is to process the source articles into a searchable index. The src/index_json.py script handles this. For each deal, it constructs a header with key metadata (deal title, date, companies, value, etc.) and prepends it to the full article text. This combined text is then split into smaller, overlapping chunks using a RecursiveCharacterTextSplitter. Finally, OpenAI’s embedding models are used to create vector representations of each chunk, which are then stored in a local FAISS index.

  2. Retrieval & Generation: When a user asks a question, the system first embeds the query and uses FAISS to retrieve the top k=4 most semantically similar text chunks from the index. These retrieved chunks — along with the original question — are then inserted into a prompt and sent to an LLM (gpt-4o-mini).

Key Challenges & Reflections

A fun challenge in this project was the prompt engineering. Getting reliable, well-formatted output from an LLM requires being highly specific in your instructions. The final prompt directs the model to use only the provided context, answer in concise bullets, and append a citation after every claim it makes. Crucially, it’s instructed to respond with “I don’t know” if the context is insufficient, which is essential for limiting hallucinations and helps build user trust in the system’s reliability.

This project also reinforced for me how powerful RAG is as a pattern for enterprise applications. By grounding the LLM in a specific, up-to-date, and proprietary knowledge base, you can create a highly valuable assistant for domain experts. The demo notebook shows its capability on a range of realistic queries, from comparing CVR structures to identifying frequently mentioned legal advisors.

While this project is a successful demo, turning it into a production tool would involve a few next steps. First would be automating the data ingestion to keep the index current, then building a simple Streamlit or Gradio UI for easier access, and finally adding filters for date ranges or therapeutic areas.

Conclusion

I learned a lot building this focused RAG application. It’s a practical and effective architecture for creating trustworthy, domain-specific Q&A systems. The ability to deliver not just an answer, but also the specific sources backing it up, is a crucial feature for any tool intended for expert users in a field like biopharma finance.