Scraping News Sites by Date and Article Level #4

MelvinSninkle · 2024-10-20T23:01:24Z

Title: Implement Scraping for Fox, CNN, and MSNBC at Article Level
Description: Develop a web scraping solution to extract headlines from Fox News, CNN, and MSNBC. Data should be collected by date and at the article level.
Leads To: #6
Tasks:
1. Select a web scraping tool (e.g., BeautifulSoup, Scrapy, Puppeteer) for efficient extraction.
2. Set up pipelines to collect article URLs, headlines, authors, and publication dates.
3. Ensure data collection is logged by date and stored at the article level.
4. Implement error handling and retries for failed attempts (up to three retries).

Acceptance Criteria:
• Scraping functions correctly for Fox News, CNN, and MSNBC.
• Extracted data includes URLs, headlines, authors, and publication dates.
• Data is stored in a structured format with clear logging.
• Priority: High

Labels: Backend, Scraping, Data Collection, MVP

chalcolith · 2024-10-22T15:22:37Z

I could start taking a look at this. I think I would start by writing a technical spec. What tech stack would be appropriate? Most of my backend experience is with .NET in C#, so that would be my first impulse, but if there is a more preferred stack I could get up to speed on that.

MelvinSninkle · 2024-10-23T08:26:00Z

I do not have a stack preference for this piece. My only concern would be thinking about how we scale this. It seems like this should be a durable pipeline that will work even as we start to manipulate the data differently. Please challenge me on that if I seem to be wrong.

Melvillian mentioned this issue Nov 6, 2024

LLM Data Ingestion into Editable Table #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping News Sites by Date and Article Level #4

Scraping News Sites by Date and Article Level #4

MelvinSninkle commented Oct 20, 2024 •

edited by Melvillian

Loading

chalcolith commented Oct 22, 2024

MelvinSninkle commented Oct 23, 2024

Scraping News Sites by Date and Article Level #4

Scraping News Sites by Date and Article Level #4

Comments

MelvinSninkle commented Oct 20, 2024 • edited by Melvillian Loading

chalcolith commented Oct 22, 2024

MelvinSninkle commented Oct 23, 2024

MelvinSninkle commented Oct 20, 2024 •

edited by Melvillian

Loading