Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping News Sites by Date and Article Level #4

Open
MelvinSninkle opened this issue Oct 20, 2024 · 2 comments
Open

Scraping News Sites by Date and Article Level #4

MelvinSninkle opened this issue Oct 20, 2024 · 2 comments

Comments

@MelvinSninkle
Copy link
Collaborator

MelvinSninkle commented Oct 20, 2024

Title: Implement Scraping for Fox, CNN, and MSNBC at Article Level
Description: Develop a web scraping solution to extract headlines from Fox News, CNN, and MSNBC. Data should be collected by date and at the article level.
Leads To: #6
Tasks:
1. Select a web scraping tool (e.g., BeautifulSoup, Scrapy, Puppeteer) for efficient extraction.
2. Set up pipelines to collect article URLs, headlines, authors, and publication dates.
3. Ensure data collection is logged by date and stored at the article level.
4. Implement error handling and retries for failed attempts (up to three retries).

Acceptance Criteria:
• Scraping functions correctly for Fox News, CNN, and MSNBC.
• Extracted data includes URLs, headlines, authors, and publication dates.
• Data is stored in a structured format with clear logging.
• Priority: High

Labels: Backend, Scraping, Data Collection, MVP

@chalcolith
Copy link

I could start taking a look at this. I think I would start by writing a technical spec. What tech stack would be appropriate? Most of my backend experience is with .NET in C#, so that would be my first impulse, but if there is a more preferred stack I could get up to speed on that.

@MelvinSninkle
Copy link
Collaborator Author

I do not have a stack preference for this piece. My only concern would be thinking about how we scale this. It seems like this should be a durable pipeline that will work even as we start to manipulate the data differently. Please challenge me on that if I seem to be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants