Crawler-for-IMDB-Movie-based-on-IDs-from-MovieLens

It is a crawler built for fetching movie information from IMDB movie pages. The fetching process need a initial data from MovieLens since there are ID information for movies on IMDB.

Structure

The project is based on the Java web crawler built by HTTPClient library. For each fetching request, it pass on the Movie ID of IMDB from MovieLens dataset and construct the fetching url with the ID. After got the HTML page information of Movie from IMDB, I use Regular Expression to match certain values that I need (for example, director, producer, actor, actress...) to save into the database. I use MySQL as my Database.

Future work

We can expand this crawler framework into distributed system and multi-task crawlers

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
properties		properties
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler-for-IMDB-Movie-based-on-IDs-from-MovieLens

Structure

Future work

About

Releases

Packages

Languages

License

lpeixin/Crawler-for-IMDB-Movie-based-on-IDs-from-MovieLens

Folders and files

Latest commit

History

Repository files navigation

Crawler-for-IMDB-Movie-based-on-IDs-from-MovieLens

Structure

Future work

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages