Skip to content

This is an implementation of the page rank using MapReduce

Notifications You must be signed in to change notification settings

Jess0-0/Google-Page-Rank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Google PageRank Project

Basic Theory Behind It:

  • More important websites are likely to receive more links from other websites (Quantity Hypothesis)
  • Website with higher PageRank will pass higher weight (Quality Hypothesis)

To implement the quantity hypothesis, to represent the directivity between pages, we use transition matrix.

To implement the quality hypothesis, to represent the importance of each page, we use pagerank matrix, denoted by PR0, initialize it to be evenly distributed, having the same weight.

We then calculate PR1, what the pagerank matrix will be like after taking the directivity into consideration, by multiplying the transition matrix and PR0. After that, we keep updating the pagerank matrix according to the changing directivity. After about 30 to 40 iterations, the pagerank matrix will finally converge. (The weight of important pages will approach 1 while the weights of unimportant pages will approach 0)

The input data comes from data generated by existing searching algorithms such as rank by title or by frequency.

Edge Cases:

Dead Ends:

Matrix will become all 0

Spider Traps:

That node will approach 1 while others kept approaching 0

Solve by:

Introduced a beta value that represent the probability the user will close current page and start a new page. Use this to simulate human behavior to make the calculation more accurate.

About

This is an implementation of the page rank using MapReduce

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages