Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADLS Gen2 FileSystemClient.get_paths() returns only 5000 paths (1 page in PageIterator) #16531

Closed
siarblack opened this issue Feb 4, 2021 · 9 comments · Fixed by #16581
Closed
Assignees
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Data Lake Storage Gen2 needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@siarblack
Copy link

siarblack commented Feb 4, 2021

  • Package Name: azure.storage.filedatalake
  • Package Version: 12.2.2
  • Operating System: Azure Databricks, Ubuntu 16.04.6LTS
  • Python Version: 3.7.3

Describe the bug
After getting FileSystemClient of particular container in ADLS Gen2, that contains more than 5000 files & folders, I am trying to retrieve all paths from this container using get_paths() method, which returns me the iterator, that contains only 5000 items of PathProperties or only 1 page in case I am using by_page() method.

To Reproduce
Steps to reproduce the behavior:

  1. Connect to the ADLS Gen2 Storage - in my case I used DataLakeServiceClient with storage account key as credential.
  2. Get the FileSystemClient of Container which contains >5000 files & folders using get_file_system_client() method.
  3. Use get_paths() method to get the PathProperties iterator.
  4. Check the number of retreived paths after transforming iterator to list
  5. Optional: Check the number of Pages of retreived paths using by_pages() method.

Expected behavior
I excpect to get the iterator of PathProperties with correct number of items corresponding to the particular container (>5000).
Optional: I excpect to get the PageIterator with correct number pages (>1) in case of container contains >5000 paths.

Screenshots
image
image

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Feb 4, 2021
@xiangyan99 xiangyan99 added bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. Data Lake Storage Gen2 Service Attention Workflow: This issue is responsible by Azure service team. labels Feb 4, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 4, 2021
@ghost
Copy link

ghost commented Feb 4, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @sumantmehtams.

Issue Details
  • Package Name: azure.storage.filedatalake
  • Package Version: 12.2.2
  • Operating System: Azure Databricks, Ubuntu 16.04.6LTS
  • Python Version: 3.7.3

Describe the bug
After getting FileSystemClient of particular container in ADLS Gen2, that contains more than 5000 files & folders, I am trying to retrieve all paths from this container using get_paths() method, which returns me the iterator, that contains only 5000 items of PathProperties or only 1 page in case I am using by_page() method.

To Reproduce
Steps to reproduce the behavior:

  1. Connect to the ADLS Gen2 Storage - in my case I used DataLakeServiceClient with storage account key as credential.
  2. Get the FileSystemClient of Container which contains >5000 files & folders using get_file_system_client() method.
  3. Use get_paths() method to get the PathProperties iterator.
  4. Check the number of retreived paths after transforming iterator to list
  5. Optional: Check the number of Pages of retreived paths using by_pages() method.

Expected behavior
I excpect to get the iterator of PathProperties with correct number of items corresponding to the particular container (>5000).
Optional: I excpect to get the PageIterator with correct number pages (>1) in case of container contains >5000 paths.

Screenshots
image
image

Author: siarblack
Assignees: xiafu-msft
Labels:

Client, Data Lake Storage Gen2, Service Attention, bug, customer-reported, needs-triage, question

Milestone: -

@xiangyan99 xiangyan99 removed the question The issue doesn't require a change to the product in order to be resolved. Most issues start as that label Feb 4, 2021
@xiangyan99
Copy link
Member

Thanks for the feedback, we’ll investigate asap.

@tasherif-msft
Copy link
Contributor

Hi @siarblack, the server returns/defaults to 5000 paths when you do a get_paths() operation.
To change this you can simply pass max_results i.e:
path_list = fsc.get_paths(max_results=6000)

Let me know if this works :)
Thanks.

@tasherif-msft tasherif-msft added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Feb 4, 2021
@siarblack
Copy link
Author

Hi @tasherif-msft !
I've already tried this before.
The output is still same:
image

I guess this is because of max_results default value is None according to the docs:
https://docs.microsoft.com/en-us/python/api/azure-storage-file-datalake/azure.storage.filedatalake.filesystemclient?view=azure-python#get-paths-path-none--recursive-true--max-results-none----kwargs-

@ghost ghost added needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-author-feedback Workflow: More information is needed from author to address the issue. labels Feb 4, 2021
@tasherif-msft
Copy link
Contributor

Hi @siarblack max_results can only be set up to 5000 (what the server allows) per page. The behavior you're encountering is indeed a bug. It appears paging isn't being done correctly. I have been investigating this and it seems it is due to a recent migration in our generated layer.
We will work on this fix asap and probably ship a patch release.

I will keep you updated. Sorry for the inconvenience!

@tasherif-msft
Copy link
Contributor

Hi @siarblack the fix got merged and we will be doing a patch release for this very soon. I will keep you updated.

@tasherif-msft
Copy link
Contributor

Hi @siarblack, datalake 12.2.3 has been released!

@siarblack
Copy link
Author

Thanks a lot @tasherif-msft for so fast and effective response!
Now get_paths() method works as expected:
image

ItemPaged iterator by_page() works as expected also:
image

@tasherif-msft
Copy link
Contributor

@siarblack my pleasure :) sorry for the inconvenience!

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Data Lake Storage Gen2 needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants