Skip to content

Latest commit

 

History

History
20 lines (11 loc) · 1.09 KB

README.md

File metadata and controls

20 lines (11 loc) · 1.09 KB

Planetary defense (PD) Web crawler

Most open source Web crawlers (e.g. Apache Nutch) deal with focused crawling by relying on a keyword or document list composed by subject matter experts and similarity measures such as cosine similarity and Naïve Bayes classifier. This work has extended Nutch by developing a semi-supervised method of creating keyword list and considering both text content and hyperlink structure in the Planetary Defense Framework Gateway project, a NASA funded effort aimed to develop a cyberinfrastructure for scientific collaboration across different organizations. Please refer to the slides here for more detail.

Apache Nutch

For the latest information about Nutch, please visit our website at:

http://nutch.apache.org

and our wiki, at:

http://wiki.apache.org/nutch/

To get started using Nutch read Tutorial:

http://wiki.apache.org/nutch/NutchTutorial