Skip to content
Change the repository type filter

All

    Repositories list

    • ClueWeb22

      Public
      Python
      11310Updated Dec 11, 2024Dec 11, 2024
    • Lucindri

      Public
      Indri search implementation on top of Lucene search engine
      Java
      53436Updated Mar 12, 2024Mar 12, 2024
    • mturk

      Public
      Java
      1000Updated Apr 14, 2023Apr 14, 2023
    • Python
      0100Updated Feb 2, 2021Feb 2, 2021
    • Unifies all code written for processing clueweb12++
      Clojure
      0200Updated Aug 20, 2013Aug 20, 2013
    • nopol

      Public
      Simple command line tool to export the ClueWeb dataset as HTML files.
      Java
      5510Updated Jul 21, 2013Jul 21, 2013
    • Tools for processing our nabble crawl
      Clojure
      1000Updated May 13, 2013May 13, 2013
    • Processing the Yahoo groups crawl (part of clueweb12++)
      Clojure
      0000Updated May 7, 2013May 7, 2013
    • Download the reddit dataset for ClueWeb12++
      Clojure
      0000Updated May 4, 2013May 4, 2013
    • scrapers

      Public
      Collection of Scrapers I am writing
      Clojure
      0000Updated Apr 11, 2013Apr 11, 2013
    • kba-tools

      Public
      Tools for processing the Trec KBA dataset
      Java
      0000Updated Mar 16, 2013Mar 16, 2013
    • Command line utilities for working with WARC files
      Java
      0200Updated Mar 13, 2013Mar 13, 2013
    • A collection of tools for running the crawler, including scripts and scrapers to collect seeds.
      Python
      0000Updated Feb 28, 2013Feb 28, 2013
    • Extension to the Clueweb12 dataset
      Python
      0300Updated Feb 14, 2013Feb 14, 2013
    • bindings for the Blekko search engine API
      Python
      2000Updated Feb 10, 2013Feb 10, 2013
    • Scala
      0000Updated Jan 31, 2013Jan 31, 2013
    • Oauth Implementation in Racket (PLT Scheme)
      Racket
      1300Updated Jan 7, 2013Jan 7, 2013
    • Configuration files and scripts needed to run the Heritrix jobs
      1100Updated Dec 16, 2012Dec 16, 2012
    • heritrix3

      Public
      Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
      Java
      762000Updated Dec 8, 2012Dec 8, 2012
    • language identification module in racket
      Racket
      0000Updated Dec 4, 2012Dec 4, 2012
    • Code to sample forums and check stats on pages and their information content
      Racket
      0000Updated Nov 20, 2012Nov 20, 2012
    • Code and other stuff for discussion-forum seeds.
      Python
      Other
      1110Updated Oct 29, 2012Oct 29, 2012
    • warc

      Public
      Python library for reading and writing warc files
      Python
      GNU General Public License v2.0
      114100Updated May 10, 2012May 10, 2012