GitHub - subhendusethi/nytimes-article-crawler: Crawl data from articles of the New York Times website

Dependencies:

Python 3.7
BeautifulSoup 4.5.1

HOT TO EXECUTE

$ python crawler_usage.py <start-url-string> <number-of-documents-to-crawl> <results-directory-path>

Note:
  <start-url-string> : Root of the BFS tree of article document URLs
  
  <number-of-documents-to-crawl> : Number of article documents to crawl
  
  <results-directory-path> : Result directory path without "/" at the end.
  Here the output of the crawled documents will be stored in this format:
  "DOC"-<ID>-".txt"

EXTRACTED DOCUMENT FORMAT

The format of the retrieved articles files:

Name:

DOC_<ID>.txt

Content :

URL
TITLE
META-KEYWORDS
DATE
DOC ID
CONTENT

IMPROVEMENTS

Improvements: Add Depth in the BFS routine. Add more documentation Add more functionality: Like crawling specific type of content e.g. [music, crime, politics, etc]

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
crawler		crawler
README.md		README.md
crawler_usage.py		crawler_usage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HOT TO EXECUTE

EXTRACTED DOCUMENT FORMAT

IMPROVEMENTS

License

About

Releases

Packages

Contributors 2

Languages

subhendusethi/nytimes-article-crawler

Folders and files

Latest commit

History

Repository files navigation

HOT TO EXECUTE

EXTRACTED DOCUMENT FORMAT

IMPROVEMENTS

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages