Wiki_task

Description of task:

Extract company website URL for each company wikipedia page from list(wikipedia_links.csv).

Using libs:

Pandas - work with data.
urllib - download web pages.
lxml - select data from downloaded pages.
BeautifulSoup, bs4 - select data from downloaded pages.

The repository contains two scripts (script_bs4.py , script_lxml.py).

Only the "find_link" function is different. In one script, data extraction function(find_link) is done using lxml lib, in the other using BeautifulSoup lib. This task can also be performed using the data extraction framework "Scrapy", but it is more suitable for production(large) projects.

The 'task' folder contains the condition of the task. The 'data' folder contains all data. After running script, output-data saves in main folder and also save in this folder (In my opinion structure of the work project must logically separated).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
share/python-wheels		share/python-wheels
task		task
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
script_bs4.py		script_bs4.py
script_lxml.py		script_lxml.py
wikipedia_links.csv		wikipedia_links.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wiki_task

Description of task:

Using libs:

The repository contains two scripts (script_bs4.py , script_lxml.py).

About

Releases

Packages

Contributors 2

Languages

eugene-vasilev/Wiki_task

Folders and files

Latest commit

History

Repository files navigation

Wiki_task

Description of task:

Using libs:

The repository contains two scripts (script_bs4.py , script_lxml.py).

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages