clinical_trial_data_extractor

Pipelines built on Langchain and Huggingface Transformers for document localization of Pubmed Open Access (PMC) literature.

Introduction

This is a set of small pipelines written over the langchain library and huggingface transformers for unsupervised document localization. Document localization is the problem where we want to find the exact relevant section of a long article that answers a specific query or need.

Process

PMC Document Extraction: We fetch PMC open access articles (using BioC REST API) given a PMC id. We convert them into a simplified JSON format with section headers and titles, and paragraphs (as given by PMC).

Document Feature Generation/Vector Store creation: We use FAISS index through the Langchain library (and a sentence-transformers model) to create a vector store for each paragraph in the article. Using Langchain and the vector store we can then find the most relevant document (using similarity_search) to a query.

The aim is to feed these relevant texts to a Large Language Model such as GPT-3 to create JSON data for very specific requirements. For example, in this extractor my target is to find the arms, procedures, and frequency of procedures for a specific clinical trial. Through the localization module above I can identify the relevant sections of a very large document, whch is then given to GPT to extract the relevant arms, procedures and frequencies.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
33301246.db		33301246.db
pmc_json		pmc_json
.gitignore		.gitignore
33301246.txt		33301246.txt
README.md		README.md
XML_parser.ipynb		XML_parser.ipynb
__init__.py		__init__.py
clinical_trial_data_extractor.ipynb		clinical_trial_data_extractor.ipynb
document_generator.py		document_generator.py
document_summarizer.py		document_summarizer.py
experiments_with_setfit.ipynb		experiments_with_setfit.ipynb
pubmed_json_parser.py		pubmed_json_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clinical_trial_data_extractor

Introduction

Process

About

Releases

Packages

Languages

chilicrabcakes/clinical_trial_data_extractor

Folders and files

Latest commit

History

Repository files navigation

clinical_trial_data_extractor

Introduction

Process

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages