Skip to content

Code velds encapsulating downloading and extracting wikipedia dumps from its official source.

License

Notifications You must be signed in to change notification settings

veldhub/veld_code__wikipedia_nlp_preprocessing

Repository files navigation

veld chain veld_code__wikipedia_nlp_preprocessing

This repo contains code velds encapsulating downloading and extracting wikipedia dumps from its official source.

requirements

  • git
  • docker compose (note: older docker compose versions require running docker-compose instead of docker compose)

how to use

A code veld may be integrated into a chain veld, or used directly by adapting the configuration within its yaml file and using the template folders provided in this repo. Open the respective veld yaml file for more information.

Run a veld with:

docker compose -f <VELD_NAME>.yaml up

contained code velds

./veld_download_and_extract.yaml

Downloads and extracts an entire wikipedia dump into json files.

docker compose -f veld_download_and_extract.yaml up

./veld_transform_wiki_json_to_txt.yaml

Transforms the wikipedia json files into one single txt

docker compose -f veld_transform_wiki_json_to_txt.yaml up

About

Code velds encapsulating downloading and extracting wikipedia dumps from its official source.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published