Skip to content

Commit

Permalink
Update datasets.py (#86)
Browse files Browse the repository at this point in the history
* Added `fetch_pose_data()` function

* Construct download registry from newest `metadata.yaml`

* Updated `datasets.py` module name, docstrings, and functions

* Added `from_lp_file` to `fetch_sample_data`

* Renamed `sample_dataset.py` and added test for `fetch_sample_data`

* Updated docs and docstrings

* Fixed `sample_data.fetch_sample_data()` to load data with correct FPS and renamed `POSE_DATA` download manager

* Cleaned up `pyproject.toml` dependencies and improved metadata-fetching logic in `sample_data.py`

* Removed hard-coded list of sample file names in `conftest.py`

* Minor cleanup of docs and docstrings

* Clarified "Adding New Data" instructions on `CONTRIBUTING.md`

Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>

* Small edit to `getting_started.md`

* Extended `fetch_sample_data_path()` to catch case in which file is not in the registry + added test for `list_sample_data()`

* Fetch metadata using `pooch.retrieve()`

* Fixed bug in `sample_data.py`

* update fetch_metadata function

* refactored test_sample_data using a fixture

* refactor and test fetching of metadata

* more explicit mention of sample metadata in contributing guide

* renamed POSE_DATA to POSE_DATA_PATHS in testing suite, to be more explicit

---------

Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
  • Loading branch information
b-peri and niksirbi authored Jan 10, 2024
1 parent 5249063 commit 03ddaf1
Show file tree
Hide file tree
Showing 13 changed files with 386 additions and 144 deletions.
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ repos:
- types-setuptools
- pandas-stubs
- types-attrs
- types-PyYAML
- types-requests
- repo: https://github.com/mgedmin/check-manifest
rev: "0.49"
hooks:
Expand Down
28 changes: 14 additions & 14 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,25 +259,26 @@ by the [German Neuroinformatics Node](https://www.g-node.org/).
GIN has a GitHub-like interface and git-like
[CLI](gin:G-Node/Info/wiki/GIN+CLI+Setup#quickstart) functionalities.

Currently the data repository contains sample pose estimation data files
stored in the `poses` folder. Each file name starts with either "DLC" or "SLEAP",
depending on the pose estimation software used to generate the data.
Currently, the data repository contains sample pose estimation data files
stored in the `poses` folder. Metadata for these files, including information
about their provenance, is stored in the `poses_files_metadata.yaml` file.

### Fetching data
To fetch the data from GIN, we use the [pooch](https://www.fatiando.org/pooch/latest/index.html)
Python package, which can download data from pre-specified URLs and store them
locally for all subsequent uses. It also provides some nice utilities,
like verification of sha256 hashes and decompression of archives.

The relevant functionality is implemented in the `movement.datasets.py` module.
The relevant functionality is implemented in the `movement.sample_data.py` module.
The most important parts of this module are:

1. The `POSE_DATA` download manager object, which contains a list of stored files and their known hashes.
2. The `list_pose_data()` function, which returns a list of the available files in the data repository.
3. The `fetch_pose_data_path()` function, which downloads a file (if not already cached locally) and returns the local path to it.
1. The `SAMPLE_DATA` download manager object.
2. The `list_sample_data()` function, which returns a list of the available files in the data repository.
3. The `fetch_sample_data_path()` function, which downloads a file (if not already cached locally) and returns the local path to it.
4. The `fetch_sample_data()` function, which downloads a file and loads it into movement directly, returning an `xarray.Dataset` object.

By default, the downloaded files are stored in the `~/.movement/data` folder.
This can be changed by setting the `DATA_DIR` variable in the `movement.datasets.py` module.
This can be changed by setting the `DATA_DIR` variable in the `movement.sample_data.py` module.

### Adding new data
Only core movement developers may add new files to the external data repository.
Expand All @@ -287,9 +288,8 @@ To add a new file, you will need to:
2. Ask to be added as a collaborator on the [movement data repository](gin:neuroinformatics/movement-test-data) (if not already)
3. Download the [GIN CLI](gin:G-Node/Info/wiki/GIN+CLI+Setup#quickstart) and set it up with your GIN credentials, by running `gin login` in a terminal.
4. Clone the movement data repository to your local machine, by running `gin get neuroinformatics/movement-test-data` in a terminal.
5. Add your new files and commit them with `gin commit -m <message> <filename>`.
6. Upload the commited changes to the GIN repository, by running `gin upload`. Latest changes to the repository can be pulled via `gin download`. `gin sync` will synchronise the latest changes bidirectionally.
7. Determine the sha256 checksum hash of each new file, by running `sha256sum <filename>` in a terminal. Alternatively, you can use `pooch` to do this for you: `python -c "import pooch; pooch.file_hash('/path/to/file')"`. If you wish to generate a text file containing the hashes of all the files in a given folder, you can use `python -c "import pooch; pooch.make_registry('/path/to/folder', 'sha256_registry.txt')`.
8. Update the `movement.datasets.py` module on the [movement GitHub repository](movement-github:) by adding the new files to the `POSE_DATA` registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual [guidelines for contributing code](#contributing-code). Make sure to test whether the new files can be fetched successfully (see [fetching data](#fetching-data) above) before submitting your pull request.

You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.
5. Add your new files to `/movement-test-data/poses/`.
6. Determine the sha256 checksum hash of each new file by running `sha256sum <filename>` in a terminal. Alternatively, you can use `pooch` to do this for you: `python -c "import pooch; hash = pooch.file_hash('/path/to/file'); print(hash)"`. If you wish to generate a text file containing the hashes of all the files in a given folder, you can use `python -c "import pooch; pooch.make_registry('/path/to/folder', 'sha256_registry.txt')`.
7. Add metadata for your new files to `poses_files_metadata.yaml`, including their sha256 hashes.
8. Commit your changes using `gin commit -m <message> <filename>`.
9. Upload the committed changes to the GIN repository by running `gin upload`. Latest changes to the repository can be pulled via `gin download`. `gin sync` will synchronise the latest changes bidirectionally.
11 changes: 6 additions & 5 deletions docs/source/api_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,15 @@ Input/Output
ValidPosesCSV
ValidPoseTracks

Datasets
--------
.. currentmodule:: movement.datasets
Sample Data
-----------
.. currentmodule:: movement.sample_data
.. autosummary::
:toctree: api

list_pose_data
fetch_pose_data_path
list_sample_data
fetch_sample_data_path
fetch_sample_data

Logging
-------
Expand Down
25 changes: 17 additions & 8 deletions docs/source/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Please see the [contributing guide](target-contributing) for more information.

## Loading data
You can load predicted pose tracks from the pose estimation software packages
[DeepLabCut](dlc:) or [SLEAP](sleap:).
[DeepLabCut](dlc:), [SLEAP](sleap:), or [LightingPose](lp:).

First import the `movement.io.load_poses` module:

Expand Down Expand Up @@ -114,27 +114,36 @@ You can also try movement out on some sample data included in the package.
You can view the available sample data files with:

```python
from movement import datasets
from movement import sample_data

file_names = datasets.list_pose_data()
file_names = sample_data.list_sample_data()
print(file_names)
```

This will print a list of file names containing sample pose data.
The files are prefixed with the name of the pose estimation software package,
either "DLC" or "SLEAP".
Each file is prefixed with the name of the pose estimation software package
that was used to generate it - either "DLC", "SLEAP", or "LP".

To get the path to one of the sample files,
you can use the `fetch_pose_data_path` function:

```python
file_path = datasets.fetch_pose_data_path("DLC_two-mice.predictions.csv")
file_path = sample_data.fetch_sample_data_path("DLC_two-mice.predictions.csv")
```
The first time you call this function, it will download the corresponding file
to your local machine and save it in the `~/.movement/data` directory. On
subsequent calls, it will simply return the path to that local file.

You can feed the path to the `from_dlc_file` or `from_sleap_file` functions
and load the data, as shown above.
You can feed the path to the `from_dlc_file`, `from_sleap_file`, or
`from_lp_file` functions and load the data, as shown above.

Alternatively, you can skip the `fetch_sample_data_path()` step and load the
data directly using the `fetch_sample_data()` function:

```python
ds = sample_data.fetch_sample_data("DLC_two-mice.predictions.csv")
```

:::

## Working with movement datasets
Expand Down
6 changes: 3 additions & 3 deletions examples/load_and_explore_poses.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,22 +10,22 @@
# -------
from matplotlib import pyplot as plt

from movement import datasets
from movement import sample_data
from movement.io import load_poses

# %%
# Fetch an example dataset
# ------------------------
# Print a list of available datasets:

for file_name in datasets.list_pose_data():
for file_name in sample_data.list_sample_data():
print(file_name)

# %%
# Fetch the path to an example dataset.
# Feel free to replace this with the path to your own dataset.
# e.g., ``file_path = "/path/to/my/data.h5"``)
file_path = datasets.fetch_pose_data_path(
file_path = sample_data.fetch_sample_data_path(
"SLEAP_three-mice_Aeon_proofread.analysis.h5"
)

Expand Down
74 changes: 0 additions & 74 deletions movement/datasets.py

This file was deleted.

Loading

0 comments on commit 03ddaf1

Please sign in to comment.