Skip to content

research paper pdf ——> video (ai image + text-to-speech)

Notifications You must be signed in to change notification settings

aaliyah-davy/resx

Repository files navigation

Res-X

research paper pdf ——> video (ai image + text-to-speech)

Res-X (Research Explanation) is a project that turns research papers into videos. It implements three types of Machine Learning models: Optical Character Recognition (Tesseract & LaTeX OCR), Text-To-Speech, and Text-To-Image (StableDiffusion).

Usage

Upload a PDF document and Res-X will generate a video. Works with papers that have LaTeX-generated formulas. (URL input options coming soon.)

It’s specifically made with researchers and students in mind due to the overwhelming expectation that they constantly stay up-to-date with new papers without enough time to read/parse everything. It may be particularly helpful for papers from an industry that the user is unfamiliar with, or for people who are visual learners/processors.

Unlike other platforms, Res-X is tailored to research papers and seeks to compartmentalize the input PDF and it works for papers with LaTeX-generated formulas (like math/physics/compsci).

As a published researcher and visual learner, I definitely find Res-X useful.

Example with my paper's abstract:

resx_vid.mp4

Fu C, Davy A, Holmes S, Sun S, Yadav V, et al. (2021) Dynamic genome plasticity during unisexual reproduction in the human fungal pathogen Cryptococcus deneoformans. PLOS Genetics 17(11): e1009935. https://doi.org/10.1371/journal.pgen.1009935

Requirements

scipy, torch, wkhtmltopdf, coqui-ai TTS, ffmpeg, moviepy, diffusers, os, requests, imgkit, io, sys, pdf2image, fake_useragent, re, cv2, math, string, imutils, numpy, regex, PIL, pandas, statistics, pytesseract, pix2tex, itertools, IPython, matplotlib, fontTools, LaTeX OCR, diffusers (huggingface)

Reflection

Some of the major roadblocks I faced ultimately determined my approach:

  • Started off web-scraping but could only scrape from some sites —> limited inputs to PDFs only
  • PDF parsing libraries were inconsistent and often required perfectly formatted PDFs —> used document images and cv2 Contours library to parse
  • cv2 Contours helped identify figures in doc image but not modern/small/medium tables —> acquiring table bounding boxes was time-intensive
  • Text-To-Image required a GPU (which I don’t have) —> relied on Google Colab’s free TPU for that segment

About

research paper pdf ——> video (ai image + text-to-speech)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages