Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update datasets to use API #6126

Merged
merged 66 commits into from
Oct 25, 2024

Conversation

anthayes92
Copy link
Contributor

@anthayes92 anthayes92 commented Aug 21, 2024

Context:
Datasets are currently downloaded by hitting the S3 bucket (via CloudFront) where the .h5 files are stored. This PR updates the way datasets are downloaded by directing download requests to the Software Cloud managed Datasets Service.

Description of the Change:
The qml.data.load function now queries the Datasets API in order to download datasets. Additionally:

  • Updates parameter formatting for the new API
  • Removes URL escaping functionality since the download URLS retrieved from the API are already escaped

Benefits:

  • Removes almost all dependency on the foldermap and data_struct files, alleviating the need to manually manage them.

⚠️ There is a lingering foldermap dependency for the list_datasets() function which will likely be removed in 0.40.0

  • Facilitates tracking dataset downloads for analytics.

Possible Drawbacks:

  • Introduces network dependency for accessing the external API.

Related GitHub Issues:

@anthayes92 anthayes92 marked this pull request as ready for review August 26, 2024 19:20
@DSGuala DSGuala self-requested a review August 26, 2024 19:28
Copy link

codecov bot commented Aug 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.38%. Comparing base (43cb979) to head (ef8c82f).
Report is 321 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6126      +/-   ##
==========================================
- Coverage   99.38%   99.38%   -0.01%     
==========================================
  Files         450      451       +1     
  Lines       42619    42670      +51     
==========================================
+ Hits        42359    42408      +49     
- Misses        260      262       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@anthayes92 anthayes92 marked this pull request as draft August 26, 2024 20:49
@anthayes92 anthayes92 marked this pull request as ready for review September 13, 2024 21:44
@anthayes92 anthayes92 requested a review from astralcai October 22, 2024 13:30
Copy link
Contributor

@astralcai astralcai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the decision regarding the old list_datasets? If we're deprecating it, it should raise a deprecation warning, and both the changelog and docs/development/deprecations.rst should mention this deprecation.

pennylane/data/data_manager/__init__.py Outdated Show resolved Hide resolved
@anthayes92
Copy link
Contributor Author

What's the decision regarding the old list_datasets? If we're deprecating it, it should raise a deprecation warning, and both the changelog and docs/development/deprecations.rst should mention this deprecation.

We'll be maintaining list_datasets() with the existing functionality in this release. We currently plan to revisit this in 0.40.0 to remove the dependency on the foldermap, the list_datasets() will not be deprecated regardless.

@anthayes92 anthayes92 requested a review from astralcai October 22, 2024 17:01
Copy link
Contributor

@astralcai astralcai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@DSGuala
Copy link
Contributor

DSGuala commented Oct 23, 2024

Thanks for pushing the recent updates! Defaults are now working nicely 👍
Some additional QA feedback from testing:

  1. This code gives an obscure error message: [ds] = qml.data.load('qchem', molname='fail')
    ValueError: max_workers must be greater than 0
    Previously we would have this error message:
    ValueError: molname value(s) ['fail'] are not available. Available values are: ['BH3', 'BeH2', 'C2',
  2. Not a high priority, but there is some difference in how we store the files now (new folder structure and file names). Users may have to redownload their existing datasets.
    old:
    image
    vs new:
    image
  3. The 'full' keyword is no longer valid:
    [ds] = qml.data.load('qchem', molname='H2', basis='STO-3G', bondlength='full')
    TypeError: Object of type ParamArg is not JSON serializable

@anthayes92
Copy link
Contributor Author

Thanks for testing this out and leaving this useful feedback @DSGuala! To address your points:

This code gives an obscure error message: [ds] = qml.data.load('qchem', molname='fail')
ValueError: max_workers must be greater than 0
Previously we would have this error message:
ValueError: molname value(s) ['fail'] are not available. Available values are: ['BH3', 'BeH2', 'C2',

I've updated this to give a general error message:

ValueError: No datasets exist for the provided configuration.
Please check the available datasets by using the ``qml.data.list_datasets()`` function.

Since retrieving the specific values available for each parameter presents a similar issue to the problem with updating list_datasets(), an intermediate solution for now is to point the user to the existing list_datasets() function to check the available datasets.

Not a high priority, but there is some difference in how we store the files now (new folder structure and file names). Users may have to redownload their existing datasets.

This could be quite involved given the new formatting of the download IDs. Users can also specify these paths manually if desired. Given this is lower priority and considering the current time constraints this could be updated in a followup if we still want it.

The 'full' keyword is no longer valid:
[ds] = qml.data.load('qchem', molname='H2', basis='STO-3G', bondlength='full')
TypeError: Object of type ParamArg is not JSON serializable

Good catch, this was overlooked. A fix has been added now thanks!

pennylane/data/data_manager/graphql.py Outdated Show resolved Hide resolved
pennylane/data/data_manager/params.py Show resolved Hide resolved
pennylane/data/data_manager/graphql.py Outdated Show resolved Hide resolved
pennylane/data/data_manager/__init__.py Outdated Show resolved Hide resolved
pennylane/data/data_manager/__init__.py Outdated Show resolved Hide resolved
@anthayes92 anthayes92 requested a review from brownj85 October 23, 2024 22:20
Copy link
Contributor

@brownj85 brownj85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

@DSGuala DSGuala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for addressing everything quickly 🙏

@anthayes92 anthayes92 enabled auto-merge (squash) October 25, 2024 15:24
@anthayes92 anthayes92 merged commit 630cf78 into master Oct 25, 2024
39 checks passed
@anthayes92 anthayes92 deleted the sc-70918-pennylane-oss-uses-new-datasets-api branch October 25, 2024 18:19
mudit2812 pushed a commit that referenced this pull request Nov 11, 2024
**Context:**
Datasets are currently downloaded by hitting the S3 bucket (via
CloudFront) where the `.h5` files are stored. This PR updates the way
datasets are downloaded by directing download requests to the Software
Cloud managed Datasets Service.

**Description of the Change:**
The `qml.data.load` function now queries the Datasets API in order to
download datasets. Additionally:
- Updates parameter formatting for the new API
- Removes URL escaping functionality since the download URLS retrieved
from the API are already escaped

**Benefits:**
- Removes almost all dependency on the `foldermap` and `data_struct`
files, alleviating the need to manually manage them.
> ⚠️ There is a lingering `foldermap` dependency for the
`list_datasets()` function which will likely be removed in `0.40.0`
- Facilitates tracking dataset downloads for analytics. 

**Possible Drawbacks:**
- Introduces network dependency for accessing the external API.

**Related GitHub Issues:**

---------

Co-authored-by: Paul Finlay <50180049+doctorperceptron@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants