topnews

Universal scraping platform for news and comments

Usage

Create a ini-file in directory "targets" with parameters according to the template (see below) and just run "topnews". You can add/modify projects without writing any code inside using text-templates (the rules are based on the XPATH).

Examples

This is a example template for any scrapping project:

[COMMON]

name = habr
description = top of week
enabled = 1
demo = 1
proxy = -1

[RULES]

scan_page_depth_max = 50
scan_article_score_min = 0
scan_comment_score_min = 0

block_pages_page = https://habr.com/ru/top/weekly/page%page_num%/
block_page_record = s.xpath('.//article[@class="post post_preview"]')

block_record_title = s.xpath('.//h2[@class="post__title"]/a[@class="post__title_link"]/text()')[0]
block_record_link = s.xpath('.//h2[@class="post__title"]//a[@class="post__title_link"]')[0].get('href')
block_record_score = int(s.xpath('.//span[@class="post-stats__result-counter voting-wjt__counter_positive "]//text()')[0])
block_record_date = s.xpath('.//span[@class="post__time"]//text()')[0]
block_record_author = s.xpath('.//span[@class="user-info__nickname user-info__nickname_small"]//text()')[0]
block_record_text = s.xpath('.//div[@class="post__body post__body_full"]')[0]
block_record_comments = s.xpath('.//div[@id="comments"]//div[@class="comment"]')

block_comment_title = None
block_comment_text = s.xpath('.//div[@class="comment__message  "]//text()')[0]
block_comment_link = s.xpath('.//li[@class="inline-list__item inline-list__item_comment-nav"]//a[@class="icon_comment-anchor"]')[0].get('href')
block_comment_score = int(s.xpath('.//span[@class="voting-wjt__counter voting-wjt__counter_positive  js-score"]//text()')[0])
block_comment_date = s.xpath('.//time[@class="comment__date-time comment__date-time_published"]//text()')[0]
block_comment_author = s.xpath('.//span[@class="user-info__nickname user-info__nickname_small user-info__nickname_comment"]//text()')[0]

Here you should set definitions for content blocks and parsing settings:

name: name of your project
description: description of your project
enabled: if 1 will be run, otherwise it will be skipped
demo: if 1 will be procced only one element of each target structure (good for debugging and testing rules)
proxy: proxy server usage (-1 - No Proxy, 0 - TOR Proxy, 1 - Random Proxy)
scan_page_depth_max, scan_article_score_min, scan_comment_score_min: limits for resource scraping (page depth, min. article score, min. comment score)
block_*: block definition in XPATH format in block hierarchy:

block_pages > block_page_record > (block_record_title, ..., block_record_text) > block_record_comments > (block_comment_title, ..., block_comment_author)

Remarks

You can create such files for each of your projects and start scraping simultaneously for all projects. Results will be in "output".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

topnews

Usage

Examples

Remarks

Files

README.md

Latest commit

History

README.md

File metadata and controls

topnews

Usage

Examples

Remarks