Simple web scraper built using Scrappy that scrapes sinhala song data and meta data. Then the scraped data is sent into Elastic Search. Some of the above scraped data may empty or not organized due to different html formats in the website.
- Install python 3.x version and pip version 3
- Install required python packages by running the following command in the project home directory.
pip install -r requirements.txt
sinhalasongs/settings.py
should be modified with following configurations file as required.
# Configure a delay
DOWNLOAD_DELAY = 9
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 2
CONCURRENT_REQUESTS_PER_IP = 1
# Cookies Disabled
COOKIES_ENABLED = False
Check here for additional configurations (https://docs.scrapy.org/en/latest/topics/settings.html).
- Documents will be sent to Elastic Search using following configurations.
# Configure ScrapeElasticSearch
ELASTICSEARCH_SERVERS = ['http://127.0.0.1:9200']
ELASTICSEARCH_INDEX = 'sinhala_songs'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'
Go to sinhalasongs
folder and run the following command.
scrapy crawl ssbscraper -o <filename>.json
Following data will be scraped by the crawler.
songName
: Name of the songartist
: All the relevant artistsgenre
: Genres of the songlyric
: Lyrics with newline charactersviews
: Views from original page for the songlyricWriter
: Writers of the songmusicDirector
: Music directors of the songbeat
: Beat of the songurl
: URL of the original lyric pagekey
: Key of the songshares
: Shares from original page for the song