BioPharma Data Pipeline

Published 2025-02-01 • Updated 2025-11-11

During my internship at CTIC, I noticed that a consistent pain point in the biopharma investment/startup space is that consultants and investors spend hours searching for particular assets and asset types across a variety of sources, so I built a single pipeline to surface relevant information automatically.

Pipeline snapshot

  1. Data sources
    • ClinicalTrials.gov
    • SEC 10-Ks
    • Deal reporting platforms
    • Exhibitions, showcases, and conferences
    • DuckDuckGo Search (DDGS) for company information
  2. ETL & NLP
    • ETL: BeautifulSoup, Selenium, gpt-4o for page summaries/blurbs
    • NLP: SentenceTransformers/SBERT, rank-bm25
    • Semantic + lexical search hybrid with top k result reranking using a cross encoder
  3. S3 JSON Document Store
    • ETL runs are merged with previous runs into a single object
    • SBERT embedding matrix is computed with every run and also stored on S3
  4. Render deployment
    • Flask app with a Celery background worker
    • Pages for login, search, task progress, and result receipt
    • Result provided via downloadable Excel sheet

Key challenges

Looking back, the primary challenge was the sheer size of the task. There were so many sources to parse through — each one needing a tailored approach — and so much I hadn’t done before (semantic searching, Render deployment, managing a background worker) that just getting something working proved to be a challenge. I also over-built before checking in with the coworkers who would be using the product, meaning that, although I enjoyed developing this and learning so many new things, I probably could have been more focused on developing what was actually needed rather than what I thought would be helpful. Next time, I’d keep the end users more involved throughout the process.

This was also one of my first projects that I developed with the intention of sharing the codebase with someone else, which pushed me to be more thorough with documentation, clear code, and generally best practices. I found that this actually helped me think through how I was approaching each problem more intentionally, since I had in mind that I would be sharing these approaches with others.

Brief reflections

A fun challenge was getting company websites from a list of companies. I wanted to gather blurbs about each company for semantic searching down the line, but I generally don’t trust current LLMs’ web-searching capabilities (in my experience they’re particularly prone to hallucinations), especially for collecting information on lesser-known companies. My solution was to get the home and “about” pages for each company, and ask an LLM to use just those pages’ information to create the blurbs.

I used DuckDuckGo Search, DDG’s web-searching API, to get the websites, which I filtered by fuzzy-matching with the company name and removing links that looked like news or Wikipedia articles. I then appended common “about” URL paths (“/about-us”, “/company”, “/our-story”, etc.) and kept those that returned a successful status code. This approach was 90%+ successful, surprisingly.

I also discovered and enjoyed learning about semantic searching and text embeddings through this project. I hope to create a dedicated write-up about them at some point. Combining the semantic search with a lexical search turned out well, since it offered a good balance of a more “conceptual” search with a “hard” search that’s often needed when searching for specific drug or company names.

Conclusion

I learned a lot through this project: HTML/JS site parsing, building a search engine, and developing an app that offloads heavy tasks (notably the search) to a background worker on the tech side; and managing my own expectations and clearly understanding users’ needs on the relational end.