Ability to rsync zipped directory with gcs ? #1765
Replies: 3 comments 2 replies
-
Update: I was able to get this to work, although it works sequentially.
I used get_mapper to list all the files in GCS, so that I only transfer new files. Building on that, I used ThreadPoolExecutor to create multiple connections to the ZIP filesystem, since the bottleneck seemed to be the connection and not the number of files I was attempting to read simultaneously. I have something that works now, but I'd still be interested in seeing if rsync can apply to my use case. |
Beta Was this translation helpful? Give feedback.
-
Your intuiting here is correct. rsync operates on ZipFS, which is not asynchronous, even if the calls it makes on the inner filesystem do end up being asynchronous - so they would be performed serially. The reason is, that ZipFS uses the builtin python zipfile implementation, requiring a file-like object to operate on with state - it does seek() constantly to fetch parts. In your workflow, you are essentially cloning this file-like, to get a set of independent states. It would be possible to write an async ZipFS (if the inner filesystem is also async) that could rsync concurrently, since the offsets of the individual compressed component files are easily obtainable using the classes of zipfile. Some care would need to be taken with large (compressed) component files, since the deflate compression algorithm doesn't support random access or blocks, only streaming. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the answer, I appreciate it and I understand better why in this case to speed things up multithreading works well. I have a few other ftp servers I’m working with that host zip files that contain only a dozen or so files so rsync could be a good fit in that case to reduce the amount of code; however using the syntax in my initial comment I can’t get rsync to either recognize a chained url since strip protocol fails, nor can I get rsync to use paths that I created using the generic file system. Did I make a mistake with the syntax for generic file system ? Basically, for a smaller scale use case is it feasible to use rsync between a remote zip source and remote unzipped destination ? |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thanks for this great library.
I’m wondering if it’s possible to use rsync for my use case. So far, after trying a few different methods I have not been successful. I’ve also reviewed the Q&A and issues and wasn’t able to find a clear example of what I’m trying to do.
I would like to sync a zipped directory hosted on an FTP server, to an unzipped directory on GCS.
Rsync would decompress and copy files from ftp://user:pass@ftp.example.com/path/to/archive.zip to gs://my-bucket/targetdirectory with update condition set to never.
Is this supported? And if so would it be possible to get an example of the syntax required ? So far my efforts all result in a StartsWith error or an IsADirectoryError
Using different syntax I’ve tried passing the ftp url ending in .zip using URL chaining as shown here URL Chaining which resulted in a protocol not recognized (zip::ftp) or opening the ftp with ZipFileSystem and setting source as (myzipfilesysteminstance, “”) or “/“ to represent the root even thought rsync needs a string.
Based on some issues I read, I also tried registering a generic file system like this
I noticed that url_to_fs does return the path to my GCS directory, but in the case of the zip the URL is shown as an empty string.
How do i get rsync to recognize my zip file system as a directory ?
My zip file is quite large (several GB) and contains tens of thousands of files, so I’m looking for a clean and efficient solution that may also leverage concurrency to sync the directories, which rsync seems to provide.
I am somewhat new to programming, and so I’m not sure if what I want is technically possible or if not how I could go about this using fsspec, maybe using copy with many files to a directory; any help would be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions