Cookiecutter Data Science

_A logical, reasonably standardized, but flexible tas as containers project structure for doing, sharing and deploying data science work

Inspired by on cookiecutter-data-science

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── root_dir           <- The root filesystem mapping directory. This 
│   │                     can be and will be replaced with remote file-
│   │                     systems to scale the application up to real world
│   │                     data.
│   ├── models         <- Trained and serialized models, model predictions, or model summaries
│   │   │                 Naming convention <model-type>-<param-desc>-<train-dates-hash>-<feature-hash>
│   │   └── predictions<- predictions on full test datasets. Naming convention a model description
│   │                     and a '-' delimited date descriptor '<model-id>-2016-01-01'
│   └── data
│       ├── external   <- Data from third party sources.
│       ├── interim    <- Intermediate data that has been transformed.
│       ├── processed  <- The final, canonical data sets for modeling.
│       └── raw        <- The original, immutable data dump.
│
├── deploy             <- deployment configurations
│   
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│   │                     the creator's initials, and a short `-` delimited description, e.g.
│   │                     `1.0-jqp-initial-data-exploration`.
│   ├── exploratory    <- excluded from version control use for fast drafts
│   └── experiments    <- use to run a whole experiment one folder per experiment
│                         with a executable pipeline.py and evaluation.py script
│
├── references         <- Publications, Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── dataset.py
│   │
│   ├── features       <- All kind of feature files
│   │   └──features.csv
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   ├── tasks          <- Long/Recurrent running luigi tasks
│   │
│   ├── tests          <- integration and unittests
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py

Requirements to use the cookiecutter template:

Python>=3.5
Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:

$ pip install cookiecutter

or

$ conda config --add channels conda-forge
$ conda install cookiecutter

Start a new project

Make sure you have docker and docker-compose installed and the docker daemon running.

# Instantiate template
cookiecutter https://github.com/datarevenue-berlin/project-template.git

# cd into project
cd <repo_name>

# Build/Pull images and run example task
docker-compose -f docker-compose.yml run controller Example

If you see something similar to this everything worked well and you are ready to start developing:

Alans-MBP:test_project kayibal$ docker-compose run controller Example
Starting test_project_dask-scheduler ... done
Starting test_project_luigid         ... done
Starting test_project_dask-worker_1  ... done
2018-05-30 18:53:39 INFO     luigi-interface Informed scheduler that task   Example_False_847ab9f492   has status   PENDING
2018-05-30 18:53:39 INFO     luigi-interface Informed scheduler that task   ClientUpload__99914b932b   has status   DONE
2018-05-30 18:53:39 INFO     luigi-interface Done scheduling tasks
2018-05-30 18:53:39 INFO     luigi-interface Running Worker with 1 processes
2018-05-30 18:53:39 INFO     luigi-interface [pid 1] Worker Worker(salt=869995943, workers=1, host=06c48c4fe678, username=root, pid=1) running   Example(no_remove_finished=False)
2018-05-30 18:53:39 INFO     root            /home/drtools/test_project/root_dir/data/raw/clickstream.csv
|CONTAINER| __main__ 2018-05-30 18:53:42 INFO Connected to scheduler
|CONTAINER| __main__ 2018-05-30 18:53:42 INFO Clickstream loaded:
|CONTAINER| id  page_id
|CONTAINER| 0  A   page-1
|CONTAINER| 1  B   page-2
|CONTAINER| 2  A   page-2
|CONTAINER| 3  B   page-1
2018-05-30 18:53:43 INFO     luigi-interface [pid 1] Worker Worker(salt=869995943, workers=1, host=06c48c4fe678, username=root, pid=1) done      Example(no_remove_finished=False)
2018-05-30 18:53:43 INFO     luigi-interface Informed scheduler that task   Example_False_847ab9f492   has status   DONE
2018-05-30 18:53:43 INFO     luigi-interface Worker Worker(salt=869995943, workers=1, host=06c48c4fe678, username=root, pid=1) was stopped. Shutting down Keep-Alive thread
2018-05-30 18:53:43 INFO     luigi-interface
===== Luigi Execution Summary =====

Scheduled 2 tasks of which:
* 1 present dependencies were encountered:
    - 1 ClientUpload()
* 1 ran successfully:
    - 1 Example(no_remove_finished=False)

This progress looks :) because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====

Any data generated by your tasks will be persisted in to root_dir.

Code is ran inside docker containers which are seen as a logical task units in a machine learning pipeline see more

Changes

Deploy on a docker + compose machine with a single command
No more host python installation needed thanks to controller container (mdank)
Fast and simple image building process
Expressive and short commands which avoid boilerplate code
Improved file handing with FileStructure class
Fully configurable dask + luigi + logging
Out of the box luigi email notifications
Ready to push package with git vc and versioneer versioning
Clean and minimal logs: no more double logging
Configured with code mapping to avoid rebuild on each code change
Plug and play replaceable root data directory, scales easy to more (possibly remote) data

Next Steps

jupyter notebook
aws templates
ecs integration

Running the tests

Some tests might require you to have a docker daemon running locally.

py.test tests

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
.idea		.idea
docs		docs
hooks		hooks
tests		tests
{{ cookiecutter.repo_name }}		{{ cookiecutter.repo_name }}
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cookiecutter.json		cookiecutter.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cookiecutter Data Science

Project Organization

Requirements to use the cookiecutter template:

Start a new project

Changes

Next Steps

Running the tests

About

Releases

Packages

Languages

License

datarevenue-berlin/project-template

Folders and files

Latest commit

History

Repository files navigation

Cookiecutter Data Science

Project Organization

Requirements to use the cookiecutter template:

Start a new project

Changes

Next Steps

Running the tests

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages