Update datasets to use API #6126

anthayes92 · 2024-08-21T21:25:29Z

Context:
Datasets are currently downloaded by hitting the S3 bucket (via CloudFront) where the .h5 files are stored. This PR updates the way datasets are downloaded by directing download requests to the Software Cloud managed Datasets Service.

Description of the Change:
The qml.data.load function now queries the Datasets API in order to download datasets. Additionally:

Updates parameter formatting for the new API
Removes URL escaping functionality since the download URLS retrieved from the API are already escaped

Benefits:

Removes almost all dependency on the foldermap and data_struct files, alleviating the need to manually manage them.

⚠️ There is a lingering foldermap dependency for the list_datasets() function which will likely be removed in 0.40.0

Facilitates tracking dataset downloads for analytics.

Possible Drawbacks:

Introduces network dependency for accessing the external API.

Related GitHub Issues:

codecov · 2024-08-26T19:58:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.38%. Comparing base (43cb979) to head (ef8c82f).
Report is 321 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6126      +/-   ##
==========================================
- Coverage   99.38%   99.38%   -0.01%     
==========================================
  Files         450      451       +1     
  Lines       42619    42670      +51     
==========================================
+ Hits        42359    42408      +49     
- Misses        260      262       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

astralcai

What's the decision regarding the old list_datasets? If we're deprecating it, it should raise a deprecation warning, and both the changelog and docs/development/deprecations.rst should mention this deprecation.

pennylane/data/data_manager/__init__.py

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

anthayes92 · 2024-10-22T17:01:24Z

What's the decision regarding the old list_datasets? If we're deprecating it, it should raise a deprecation warning, and both the changelog and docs/development/deprecations.rst should mention this deprecation.

We'll be maintaining list_datasets() with the existing functionality in this release. We currently plan to revisit this in 0.40.0 to remove the dependency on the foldermap, the list_datasets() will not be deprecated regardless.

astralcai

LGTM!

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

DSGuala · 2024-10-23T15:39:54Z

Thanks for pushing the recent updates! Defaults are now working nicely 👍
Some additional QA feedback from testing:

This code gives an obscure error message: [ds] = qml.data.load('qchem', molname='fail')
ValueError: max_workers must be greater than 0
Previously we would have this error message:
ValueError: molname value(s) ['fail'] are not available. Available values are: ['BH3', 'BeH2', 'C2',
~~Not a high priority, but~~ there is some difference in how we store the files now (new folder structure and file names). Users may have to redownload their existing datasets.
old:

vs new:
The 'full' keyword is no longer valid:
[ds] = qml.data.load('qchem', molname='H2', basis='STO-3G', bondlength='full')
TypeError: Object of type ParamArg is not JSON serializable

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

anthayes92 · 2024-10-23T19:36:27Z

Thanks for testing this out and leaving this useful feedback @DSGuala! To address your points:

This code gives an obscure error message: [ds] = qml.data.load('qchem', molname='fail')
ValueError: max_workers must be greater than 0
Previously we would have this error message:
ValueError: molname value(s) ['fail'] are not available. Available values are: ['BH3', 'BeH2', 'C2',

I've updated this to give a general error message:

ValueError: No datasets exist for the provided configuration.
Please check the available datasets by using the ``qml.data.list_datasets()`` function.

Since retrieving the specific values available for each parameter presents a similar issue to the problem with updating list_datasets(), an intermediate solution for now is to point the user to the existing list_datasets() function to check the available datasets.

Not a high priority, but there is some difference in how we store the files now (new folder structure and file names). Users may have to redownload their existing datasets.

This could be quite involved given the new formatting of the download IDs. Users can also specify these paths manually if desired. Given this is lower priority and considering the current time constraints this could be updated in a followup if we still want it.

The 'full' keyword is no longer valid:
[ds] = qml.data.load('qchem', molname='H2', basis='STO-3G', bondlength='full')
TypeError: Object of type ParamArg is not JSON serializable

Good catch, this was overlooked. A fix has been added now thanks!

pennylane/data/data_manager/graphql.py

pennylane/data/data_manager/params.py

pennylane/data/data_manager/graphql.py

pennylane/data/data_manager/__init__.py

brownj85

Looks good!

DSGuala

Looks good to me! Thanks for addressing everything quickly 🙏

**Context:** Datasets are currently downloaded by hitting the S3 bucket (via CloudFront) where the `.h5` files are stored. This PR updates the way datasets are downloaded by directing download requests to the Software Cloud managed Datasets Service. **Description of the Change:** The `qml.data.load` function now queries the Datasets API in order to download datasets. Additionally: - Updates parameter formatting for the new API - Removes URL escaping functionality since the download URLS retrieved from the API are already escaped **Benefits:** - Removes almost all dependency on the `foldermap` and `data_struct` files, alleviating the need to manually manage them. > ⚠️ There is a lingering `foldermap` dependency for the `list_datasets()` function which will likely be removed in `0.40.0` - Facilitates tracking dataset downloads for analytics. **Possible Drawbacks:** - Introduces network dependency for accessing the external API. **Related GitHub Issues:** --------- Co-authored-by: Paul Finlay <50180049+doctorperceptron@users.noreply.github.com>

anthayes92 added 7 commits August 21, 2024 17:24

add dataset queries

da38e02

update query

94c2cae

update url

29acd12

update list tests

b6f799e

fix existing tests

c5c89f9

fix tests

c253c6d

fmt

9eb725a

anthayes92 marked this pull request as ready for review August 26, 2024 19:20

DSGuala self-requested a review August 26, 2024 19:28

anthayes92 added 2 commits August 26, 2024 15:59

fmt

7eb42a4

appease codefactor

898766d

anthayes92 marked this pull request as draft August 26, 2024 20:49

anthayes92 and others added 9 commits August 26, 2024 16:50

fix docstrings

1bdd34f

update docstrings

2254fa4

fmt

0da7c2c

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

156268c

update interactive download, update tests

fa31d22

fmt

4b339da

fmt

8986ea0

fmt

20c9d09

fix val

56f8662

anthayes92 marked this pull request as ready for review September 13, 2024 21:44

anthayes92 and others added 7 commits September 13, 2024 18:15

fmt

b1f5ca3

fmt

361f7a2

add graphql tests

1234ff3

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

e2d329f

add edge case tests

b00fc0c

Merge branch 'sc-70918-pennylane-oss-uses-new-datasets-api' of github…

2b0b1c4

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

6b67f33

anthayes92 requested a review from astralcai October 22, 2024 13:30

anthayes92 added 2 commits October 22, 2024 10:01

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

a9070d0

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

2418cf9

astralcai reviewed Oct 22, 2024

View reviewed changes

pennylane/data/data_manager/__init__.py Outdated Show resolved Hide resolved

anthayes92 and others added 3 commits October 22, 2024 12:54

fmt

a84d5e2

Merge branch 'sc-70918-pennylane-oss-uses-new-datasets-api' of github…

2e88015

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

5f6eaf8

anthayes92 requested a review from astralcai October 22, 2024 17:01

astralcai approved these changes Oct 22, 2024

View reviewed changes

anthayes92 and others added 4 commits October 22, 2024 16:05

remove url escaping

24d5f08

Merge branch 'sc-70918-pennylane-oss-uses-new-datasets-api' of github…

f74a3ef

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

handle default query params

a5700e6

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

8b4c8a7

anthayes92 added 2 commits October 23, 2024 15:19

add error message, handle full params

e4925dc

Merge branch 'sc-70918-pennylane-oss-uses-new-datasets-api' of github…

6e93c62

….com:PennyLaneAI/pennylane into sc-70918-pennylane-oss-uses-new-datasets-api

brownj85 requested changes Oct 23, 2024

View reviewed changes

add feedback

34bcd29

anthayes92 requested a review from brownj85 October 23, 2024 22:20

brownj85 approved these changes Oct 24, 2024

View reviewed changes

DSGuala approved these changes Oct 24, 2024

View reviewed changes

anthayes92 enabled auto-merge (squash) October 25, 2024 15:24

anthayes92 added 3 commits October 25, 2024 11:24

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

7467c0b

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

47dfc10

Merge branch 'master' into sc-70918-pennylane-oss-uses-new-datasets-api

ef8c82f

anthayes92 merged commit 630cf78 into master Oct 25, 2024
39 checks passed

anthayes92 deleted the sc-70918-pennylane-oss-uses-new-datasets-api branch October 25, 2024 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update datasets to use API #6126

Update datasets to use API #6126

anthayes92 commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 26, 2024 •

edited

Loading

astralcai left a comment

anthayes92 commented Oct 22, 2024

astralcai left a comment

DSGuala commented Oct 23, 2024 •

edited

Loading

anthayes92 commented Oct 23, 2024

brownj85 left a comment

DSGuala left a comment

Update datasets to use API #6126

Update datasets to use API #6126

Conversation

anthayes92 commented Aug 21, 2024 • edited Loading

codecov bot commented Aug 26, 2024 • edited Loading

Codecov Report

astralcai left a comment

Choose a reason for hiding this comment

anthayes92 commented Oct 22, 2024

astralcai left a comment

Choose a reason for hiding this comment

DSGuala commented Oct 23, 2024 • edited Loading

anthayes92 commented Oct 23, 2024

brownj85 left a comment

Choose a reason for hiding this comment

DSGuala left a comment

Choose a reason for hiding this comment

anthayes92 commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 26, 2024 •

edited

Loading

DSGuala commented Oct 23, 2024 •

edited

Loading