Skip to content

oscar-project/ungoliant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ungoliant

codecov

🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️

It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.

Installation

Installing/Compiling the binary

  • Via cargo: cargo install ungoliant
  • Via git: cargo install --git https://github.com/oscar-corpus/ungoliant

Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc can be needed as the project uses fasttext-rs.

KenLM feature

The KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.

To enable it, install KenLM requirements:

apt install -y libboost-all-dev libeigen3-dev

and use cargo install ungoliant --features kenlm or cargo b --features kenlm if you're building from source.

Getting a language identification file (for fastText):

By default, ungoliant expects the lid.176.bin model by meta. Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin to get it.

However, you can use the model you want: just point to its path using ungoliant download --lid-path <path to lid>.

Other options include:

Usage

The usual way of generating corpora is:

  1. Fetch the wet.paths.gz file from the last CommonCrawl dump and decompress it.
  2. Download the files using the download command.
  3. Generate the corpus using the pipeline command (it may take some time).
  4. Head on to oscar-tools for the packaging steps

You can find more information on each command's --help.

ungoliant 2
corpus generation tool.

USAGE:
    ungoliant <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    download    Download a CommonCrawl release
    help        Prints this message or the help of the given subcommand(s)
    pipeline    Run pipeline
    rebuild     Rebuild the corpus for a given language.

Documentation

Ungoliant is not yet on docs.rs: use cargo doc --bins --open to open the documentation.

Head on to OSCAR Documentation for more info about the project.