Update datasets.py (#86)

* Added `fetch_pose_data()` function * Construct download registry from newest `metadata.yaml` * Updated `datasets.py` module name, docstrings, and functions * Added `from_lp_file` to `fetch_sample_data` * Renamed `sample_dataset.py` and added test for `fetch_sample_data` * Updated docs and docstrings * Fixed `sample_data.fetch_sample_data()` to load data with correct FPS and renamed `POSE_DATA` download manager * Cleaned up `pyproject.toml` dependencies and improved metadata-fetching logic in `sample_data.py` * Removed hard-coded list of sample file names in `conftest.py` * Minor cleanup of docs and docstrings * Clarified "Adding New Data" instructions on `CONTRIBUTING.md` Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com> * Small edit to `getting_started.md` * Extended `fetch_sample_data_path()` to catch case in which file is not in the registry + added test for `list_sample_data()` * Fetch metadata using `pooch.retrieve()` * Fixed bug in `sample_data.py` * update fetch_metadata function * refactored test_sample_data using a fixture * refactor and test fetching of metadata * more explicit mention of sample metadata in contributing guide * renamed POSE_DATA to POSE_DATA_PATHS in testing suite, to be more explicit --------- Co-authored-by: Niko Sirmpilatze <niko.sirbiladze@gmail.com>
neuroinformatics-unit · Jan 10, 2024 · 03ddaf1 · 03ddaf1
1 parent 5249063
commit 03ddaf1
Show file tree

Hide file tree

Showing 13 changed files with 386 additions and 144 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -32,6 +32,8 @@ repos:
                 - types-setuptools
                 - pandas-stubs
                 - types-attrs
+                - types-PyYAML
+                - types-requests
     - repo: https://github.com/mgedmin/check-manifest
       rev: "0.49"
       hooks:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -259,25 +259,26 @@ by the [German Neuroinformatics Node](https://www.g-node.org/).
 GIN has a GitHub-like interface and git-like
 [CLI](gin:G-Node/Info/wiki/GIN+CLI+Setup#quickstart) functionalities.
 
-Currently the data repository contains sample pose estimation data files
-stored in the `poses` folder. Each file name starts with either "DLC" or "SLEAP",
-depending on the pose estimation software used to generate the data.
+Currently, the data repository contains sample pose estimation data files
+stored in the `poses` folder. Metadata for these files, including information
+about their provenance, is stored in the `poses_files_metadata.yaml` file.
 
 ### Fetching data
 To fetch the data from GIN, we use the [pooch](https://www.fatiando.org/pooch/latest/index.html)
 Python package, which can download data from pre-specified URLs and store them
 locally for all subsequent uses. It also provides some nice utilities,
 like verification of sha256 hashes and decompression of archives.
 
-The relevant functionality is implemented in the `movement.datasets.py` module.
+The relevant functionality is implemented in the `movement.sample_data.py` module.
 The most important parts of this module are:
 
-1. The `POSE_DATA` download manager object, which contains a list of stored files and their known hashes.
-2. The `list_pose_data()` function, which returns a list of the available files in the data repository.
-3. The `fetch_pose_data_path()` function, which downloads a file (if not already cached locally) and returns the local path to it.
+1. The `SAMPLE_DATA` download manager object.
+2. The `list_sample_data()` function, which returns a list of the available files in the data repository.
+3. The `fetch_sample_data_path()` function, which downloads a file (if not already cached locally) and returns the local path to it.
+4. The `fetch_sample_data()` function, which downloads a file and loads it into movement directly, returning an `xarray.Dataset` object.
 
 By default, the downloaded files are stored in the `~/.movement/data` folder.
-This can be changed by setting the `DATA_DIR` variable in the `movement.datasets.py` module.
+This can be changed by setting the `DATA_DIR` variable in the `movement.sample_data.py` module.
 
 ### Adding new data
 Only core movement developers may add new files to the external data repository.
@@ -287,9 +288,8 @@ To add a new file, you will need to:
 2. Ask to be added as a collaborator on the [movement data repository](gin:neuroinformatics/movement-test-data) (if not already)
 3. Download the [GIN CLI](gin:G-Node/Info/wiki/GIN+CLI+Setup#quickstart) and set it up with your GIN credentials, by running `gin login` in a terminal.
 4. Clone the movement data repository to your local machine, by running `gin get neuroinformatics/movement-test-data` in a terminal.
-5. Add your new files and commit them with `gin commit -m <message> <filename>`.
-6. Upload the commited changes to the GIN repository, by running `gin upload`. Latest changes to the repository can be pulled via `gin download`. `gin sync` will synchronise the latest changes bidirectionally.
-7. Determine the sha256 checksum hash of each new file, by running `sha256sum <filename>` in a terminal. Alternatively, you can use `pooch` to do this for you: `python -c "import pooch; pooch.file_hash('/path/to/file')"`. If you wish to generate a text file containing the hashes of all the files in a given folder, you can use `python -c "import pooch; pooch.make_registry('/path/to/folder', 'sha256_registry.txt')`.
-8. Update the `movement.datasets.py` module on the [movement GitHub repository](movement-github:) by adding the new files to the `POSE_DATA` registry. Make sure to include the correct sha256 hash, as determined in the previous step. Follow all the usual [guidelines for contributing code](#contributing-code). Make sure to test whether the new files can be fetched successfully (see [fetching data](#fetching-data) above) before submitting your pull request.
-
-You can also perform steps 3-6 via the GIN web interface, if you prefer to avoid using the CLI.
+5. Add your new files to `/movement-test-data/poses/`.
+6. Determine the sha256 checksum hash of each new file by running `sha256sum <filename>` in a terminal. Alternatively, you can use `pooch` to do this for you: `python -c "import pooch; hash = pooch.file_hash('/path/to/file'); print(hash)"`. If you wish to generate a text file containing the hashes of all the files in a given folder, you can use `python -c "import pooch; pooch.make_registry('/path/to/folder', 'sha256_registry.txt')`.
+7. Add metadata for your new files to `poses_files_metadata.yaml`, including their sha256 hashes.
+8. Commit your changes using `gin commit -m <message> <filename>`.
+9. Upload the committed changes to the GIN repository by running `gin upload`. Latest changes to the repository can be pulled via `gin download`. `gin sync` will synchronise the latest changes bidirectionally.
diff --git a/docs/source/api_index.rst b/docs/source/api_index.rst
@@ -33,14 +33,15 @@ Input/Output
     ValidPosesCSV
     ValidPoseTracks
 
-Datasets
---------
-.. currentmodule:: movement.datasets
+Sample Data
+-----------
+.. currentmodule:: movement.sample_data
 .. autosummary::
     :toctree: api
 
-    list_pose_data
-    fetch_pose_data_path
+    list_sample_data
+    fetch_sample_data_path
+    fetch_sample_data
 
 Logging
 -------

diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md
@@ -53,7 +53,7 @@ Please see the [contributing guide](target-contributing) for more information.
 
 ## Loading data
 You can load predicted pose tracks from the pose estimation software packages
-[DeepLabCut](dlc:) or [SLEAP](sleap:).
+[DeepLabCut](dlc:), [SLEAP](sleap:), or [LightingPose](lp:).
 
 First import the `movement.io.load_poses` module:
 
@@ -114,27 +114,36 @@ You can also try movement out on some sample data included in the package.
 You can view the available sample data files with:
 
 ```python
-from movement import datasets
+from movement import sample_data
 
-file_names = datasets.list_pose_data()
+file_names = sample_data.list_sample_data()
 print(file_names)
 ```
+
 This will print a list of file names containing sample pose data.
-The files are prefixed with the name of the pose estimation software package,
-either "DLC" or "SLEAP".
+Each file is prefixed with the name of the pose estimation software package
+that was used to generate it - either "DLC", "SLEAP", or "LP".
 
 To get the path to one of the sample files,
 you can use the `fetch_pose_data_path` function:
 
 ```python
-file_path = datasets.fetch_pose_data_path("DLC_two-mice.predictions.csv")
+file_path = sample_data.fetch_sample_data_path("DLC_two-mice.predictions.csv")
 ```
 The first time you call this function, it will download the corresponding file
 to your local machine and save it in the `~/.movement/data` directory. On
 subsequent calls, it will simply return the path to that local file.
 
-You can feed the path to the `from_dlc_file` or `from_sleap_file` functions
-and load the data, as shown above.
+You can feed the path to the `from_dlc_file`, `from_sleap_file`, or
+`from_lp_file` functions and load the data, as shown above.
+
+Alternatively, you can skip the `fetch_sample_data_path()` step and load the
+data directly using the `fetch_sample_data()` function:
+
+```python
+ds = sample_data.fetch_sample_data("DLC_two-mice.predictions.csv")
+```
+
 :::
 
 ## Working with movement datasets

diff --git a/examples/load_and_explore_poses.py b/examples/load_and_explore_poses.py
@@ -10,22 +10,22 @@
 # -------
 from matplotlib import pyplot as plt
 
-from movement import datasets
+from movement import sample_data
 from movement.io import load_poses
 
 # %%
 # Fetch an example dataset
 # ------------------------
 # Print a list of available datasets:
 
-for file_name in datasets.list_pose_data():
+for file_name in sample_data.list_sample_data():
     print(file_name)
 
 # %%
 # Fetch the path to an example dataset.
 # Feel free to replace this with the path to your own dataset.
 # e.g., ``file_path = "/path/to/my/data.h5"``)
-file_path = datasets.fetch_pose_data_path(
+file_path = sample_data.fetch_sample_data_path(
     "SLEAP_three-mice_Aeon_proofread.analysis.h5"
 )
 

diff --git a/movement/datasets.py b/movement/datasets.py