sinhala-song-data-scraper

Simple web scraper built using Scrappy that scrapes sinhala song data and meta data. Then the scraped data is sent into Elastic Search. Some of the above scraped data may empty or not organized due to different html formats in the website.

Getting Started

Install python 3.x version and pip version 3
Install required python packages by running the following command in the project home directory. pip install -r requirements.txt
sinhalasongs/settings.py should be modified with following configurations file as required.

# Configure a delay 
DOWNLOAD_DELAY = 9

# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 2
CONCURRENT_REQUESTS_PER_IP = 1

# Cookies Disabled
COOKIES_ENABLED = False

Check here for additional configurations (https://docs.scrapy.org/en/latest/topics/settings.html).

Documents will be sent to Elastic Search using following configurations.

# Configure ScrapeElasticSearch

ELASTICSEARCH_SERVERS = ['http://127.0.0.1:9200']
ELASTICSEARCH_INDEX = 'sinhala_songs'
ELASTICSEARCH_TYPE = 'items'
ELASTICSEARCH_UNIQ_KEY = 'url'

Starting the scraping

Go to sinhalasongs folder and run the following command.

scrapy crawl ssbscraper -o <filename>.json

Scraped data

Following data will be scraped by the crawler.

songName: Name of the song
artist: All the relevant artists
genre: Genres of the song
lyric: Lyrics with newline characters
views: Views from original page for the song
lyricWriter: Writers of the song
musicDirector: Music directors of the song
beat: Beat of the song
url: URL of the original lyric page
key: Key of the song
shares: Shares from original page for the song

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
sinhalasongs		sinhalasongs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
ss.json		ss.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sinhala-song-data-scraper

Getting Started

Starting the scraping

Scraped data

About

Releases

Packages

Languages

License

maduranga95/sinhala-song-data-scraper

Folders and files

Latest commit

History

Repository files navigation

sinhala-song-data-scraper

Getting Started

Starting the scraping

Scraped data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages