Skip to content

Scrape URIs from Telegram channel transcripts in PDF files

License

Notifications You must be signed in to change notification settings

gwu-libraries/uriscrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

uriscrape

Scrapes URIs from Telegram channel transcripts in PDF files. Typically URIs will take the form of something like these examples:

https://t.me/joinchat/AAAAAEOs3wFD4Mv6SN4hlQ

(tg://join?invite=AAAAAEOs3wFD4Mv6SN4hlQ)

https://drive.google.com/open?id=0B_3xyna6XV4GMHNPU0VVWHZKRXc

https://archive.org/details/Rumiyah13UR_201709
(https://archive.org/details/Rumiyah13UR_201709)

(tg://search_hashtag?hashtag=%D8%A6%DB%95%D9%84%DA%BE%D8%A7%D9%8A%D8%A7%D8%AA)

Running the program

usage: `python uriscrape.py transcript`

positional arguments:
  transcript         filepath to transcript pdf or directory

optional arguments:
  None yet...

Output file

urls.xlsx - All found URIs, including columns/variables as follows:

  • File: PDF file processed
  • Access_Date: Date/time the program was run. May be important for documenting when the program attempted to resolve URIs
  • Post_Date: Date of the post, as derived from the date labels in the Telegram transcript
  • URL: URL as found
  • Site_Reached: True/False - whether the URI was able to be resolved
  • Unshortened URL: Unshortened URL (e.g. https://youtu.be/lqXwyl89xU4 -> unshortens to https://www.youtube.com/watch?v=lqXwyl89xU4&feature=youtu.be )
  • Status: Error code, if an error was encountered in trying to access the URI
  • Type: Classification of the link
  • Hashtag: Hashtag, if the link is a Telegram hashtag link
  • Channel: Channel, if the link is a Telegram join link
  • Account: Account, if the link is a Telegram account link
  • Domain: Full server daomain (e.g. www.youtube.com)
  • Primary_Secondary: Just the primary and secondary portions of the domain (e.g. youtube.com)