_A logical, reasonably standardized, but flexible tas as containers project structure for doing, sharing and deploying data science work
Inspired by on cookiecutter-data-science
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── root_dir <- The root filesystem mapping directory. This
│ │ can be and will be replaced with remote file-
│ │ systems to scale the application up to real world
│ │ data.
│ ├── models <- Trained and serialized models, model predictions, or model summaries
│ │ │ Naming convention <model-type>-<param-desc>-<train-dates-hash>-<feature-hash>
│ │ └── predictions<- predictions on full test datasets. Naming convention a model description
│ │ and a '-' delimited date descriptor '<model-id>-2016-01-01'
│ └── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── deploy <- deployment configurations
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ │ the creator's initials, and a short `-` delimited description, e.g.
│ │ `1.0-jqp-initial-data-exploration`.
│ ├── exploratory <- excluded from version control use for fast drafts
│ └── experiments <- use to run a whole experiment one folder per experiment
│ with a executable pipeline.py and evaluation.py script
│
├── references <- Publications, Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
│
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── dataset.py
│ │
│ ├── features <- All kind of feature files
│ │ └──features.csv
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ ├── tasks <- Long/Recurrent running luigi tasks
│ │
│ ├── tests <- integration and unittests
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
- Python>=3.5
- Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
Make sure you have docker and docker-compose installed and the docker daemon running.
# Instantiate template
cookiecutter https://github.com/datarevenue-berlin/project-template.git
# cd into project
cd <repo_name>
# Build/Pull images and run example task
docker-compose -f docker-compose.yml run controller Example
If you see something similar to this everything worked well and you are ready to start developing:
Alans-MBP:test_project kayibal$ docker-compose run controller Example
Starting test_project_dask-scheduler ... done
Starting test_project_luigid ... done
Starting test_project_dask-worker_1 ... done
2018-05-30 18:53:39 INFO luigi-interface Informed scheduler that task Example_False_847ab9f492 has status PENDING
2018-05-30 18:53:39 INFO luigi-interface Informed scheduler that task ClientUpload__99914b932b has status DONE
2018-05-30 18:53:39 INFO luigi-interface Done scheduling tasks
2018-05-30 18:53:39 INFO luigi-interface Running Worker with 1 processes
2018-05-30 18:53:39 INFO luigi-interface [pid 1] Worker Worker(salt=869995943, workers=1, host=06c48c4fe678, username=root, pid=1) running Example(no_remove_finished=False)
2018-05-30 18:53:39 INFO root /home/drtools/test_project/root_dir/data/raw/clickstream.csv
|CONTAINER| __main__ 2018-05-30 18:53:42 INFO Connected to scheduler
|CONTAINER| __main__ 2018-05-30 18:53:42 INFO Clickstream loaded:
|CONTAINER| id page_id
|CONTAINER| 0 A page-1
|CONTAINER| 1 B page-2
|CONTAINER| 2 A page-2
|CONTAINER| 3 B page-1
2018-05-30 18:53:43 INFO luigi-interface [pid 1] Worker Worker(salt=869995943, workers=1, host=06c48c4fe678, username=root, pid=1) done Example(no_remove_finished=False)
2018-05-30 18:53:43 INFO luigi-interface Informed scheduler that task Example_False_847ab9f492 has status DONE
2018-05-30 18:53:43 INFO luigi-interface Worker Worker(salt=869995943, workers=1, host=06c48c4fe678, username=root, pid=1) was stopped. Shutting down Keep-Alive thread
2018-05-30 18:53:43 INFO luigi-interface
===== Luigi Execution Summary =====
Scheduled 2 tasks of which:
* 1 present dependencies were encountered:
- 1 ClientUpload()
* 1 ran successfully:
- 1 Example(no_remove_finished=False)
This progress looks :) because there were no failed tasks or missing external dependencies
===== Luigi Execution Summary =====
Any data generated by your tasks will be persisted in to root_dir.
Code is ran inside docker containers which are seen as a logical task units in a machine learning pipeline see more
- Deploy on a docker + compose machine with a single command
- No more host python installation needed thanks to controller container (mdank)
- Fast and simple image building process
- Expressive and short commands which avoid boilerplate code
- Improved file handing with FileStructure class
- Fully configurable dask + luigi + logging
- Out of the box luigi email notifications
- Ready to push package with git vc and versioneer versioning
- Clean and minimal logs: no more double logging
- Configured with code mapping to avoid rebuild on each code change
- Plug and play replaceable root data directory, scales easy to more (possibly remote) data
- jupyter notebook
- aws templates
- ecs integration
Some tests might require you to have a docker daemon running locally.
py.test tests