Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF idea: deprecate same_names in favor of a more generic layout parameter #555

Open
yarikoptic opened this issue Mar 3, 2021 · 1 comment

Comments

@yarikoptic
Copy link
Contributor

ATM CachingFileSystem has a single bool option same_names to switch layout of files from /hash to /url-filename and thus does not leave room for "improvement":

Under heavy use of the cache use having a flat tree of files (/hash or /url-filename based) could lead to a very heavy directory so filesystem could become inefficient in listing that directory etc.

  • A common (look under .git/objects, same approach used by git-annex, girder etc) workaround is to establish leading directories, e.g. for a /hash it could be /hash[:2]/hash[2:4]/hash[4:] path to the file, thus reducing impact on the file system
  • for url-based path, it could simply be a path constructed from URI components, e.g. for http://domain/p1/p2/filename URL it could become http/domain/p1/p2/filename path, thus allowing to disambiguate between file systems etc, and also avoid conflicts for the same common filename (as I guess would be now with same_names=True).

With above in mind, I think it would have been nice if instead of same_names there was a layout={hash,hashtree,url_filename,url_fullpath} or alike, thus allowing users to switch to most appropriate layout depending on their use case.

@martindurant
Copy link
Member

Agreed that this would be useful too. I would love for someone to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants