Skip to content

It is a crawler built for fetching movie information from IMDB movie pages. The fetching process need a initial data from MovieLens since there are ID information for movies on IMDB.

License

Notifications You must be signed in to change notification settings

lpeixin/Crawler-for-IMDB-Movie-based-on-IDs-from-MovieLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler-for-IMDB-Movie-based-on-IDs-from-MovieLens

It is a crawler built for fetching movie information from IMDB movie pages. The fetching process need a initial data from MovieLens since there are ID information for movies on IMDB.

Structure

The project is based on the Java web crawler built by HTTPClient library. For each fetching request, it pass on the Movie ID of IMDB from MovieLens dataset and construct the fetching url with the ID. After got the HTML page information of Movie from IMDB, I use Regular Expression to match certain values that I need (for example, director, producer, actor, actress...) to save into the database. I use MySQL as my Database.

Future work

We can expand this crawler framework into distributed system and multi-task crawlers

About

It is a crawler built for fetching movie information from IMDB movie pages. The fetching process need a initial data from MovieLens since there are ID information for movies on IMDB.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages