sinhala_scrapy_project

Web scraper made using python scrapy.

Setup

support - https://docs.scrapy.org/en/latest/topics/settings.html

Install python 3
Install pip version 3
Install requirements by running this command, pip install -r requirements.txt
Setup below configurations in the scrapy_IR/settings.py file as required.

### Configure a delay for requests for the same website (default: 0)
DOWNLOAD_DELAY = 9

### The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 2
CONCURRENT_REQUESTS_PER_IP = 1

### Disable cookies (enabled by default)
COOKIES_ENABLED = False

To send data directly to ElasticSeach use following settings

### Configure ScrapeElasticSearch

ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline': 500
}

ELASTICSEARCH_SERVERS = ['http://127.0.0.1:9200']
ELASTICSEARCH_INDEX = 'sinhala_songs'              # index name
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'

Modify the starts_urls range according to your need .file path - scrapy_IR/spiders/lyric_scrape_new.py ( Recommand - Always keep page range as 1 page)

Run

Go to scrapy_IR folder and run the following command. scrapy crawl <scraper_name> -o <filename>.json

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
scrapy_IR		scrapy_IR
README.md		README.md
requirement.txt		requirement.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sinhala_scrapy_project

Setup

Run

About

Releases

Packages

Languages

iroshm/sinhala_scrapy_project

Folders and files

Latest commit

History

Repository files navigation

sinhala_scrapy_project

Setup

Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages