Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib Module #6069

Merged
merged 13 commits into from
Apr 6, 2022
Merged

zlib Module #6069

merged 13 commits into from
Apr 6, 2022

Conversation

gamblor21
Copy link
Member

Moving the MicroPython extmod/uzlib to the CircuitPython shared-bindings/module pattern.

Provides zlib decompression functionality.

Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I have not looked at the rest of it yet, but we want to call this zlib, not uzlib. If there are incompatible differences with the CPython API (that is, if it's not just a subset), then we would want to change them to match the CPython API.

So call everything zlib instead of uzlib. An example of this renaming without the shared-bindings/shared-module changes is re, which is ure in MicroPython.

@FoamyGuy
Copy link
Collaborator

Thanks for working on this @gamblor21!

I tested the functionality of this successfully with this:

import uzlib
data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03EPAn\xc2@\x0c\xbc\xe7\x15\xd6\x9e\tJ\x02\xa2Jn\xfd\x00\x87\xaa\\ZU\xc8\r.\xac\xbaI\xc0\xeb\xd0\xa2(\x7f\xafw\x13\xca\x9e<c\xcf\xd8\xb3C\x02`Zl\xc8T`\x98\xd0\x89\xd5z\x11X\x8f\xcd\xd9\xd9\xf6\xa8\x9da\x8c\xcc\xa5\'\xbe\x05\xa8@aC\xc2\xb6\xf6J\xbcGB)\x96\nk\xb1W\xdaybo"\xfd\xb1\x98\xc7\xf17e\xf2\xbd\x93 \xc93}\xda\x98\x9c\xd5\n\x1f\xc6\xf7{\x9e\xa3\x15D/x\xb1\xc7\x93\xc0\xb6\xfb1\xb3\xdf\x81|\xcd\xf6,\xb6k\xc3\xf0\xb6o>\x89\xa1\xfb\x82>\xce\xd7=3\xb5\xe2np\xb5\xde\x8a\x06\x01t\x0e\xb4$\xbf4\xff\x9b\x0f\x187O\t\x86{\x8e)\xc4>*;\x0e\xf7\x9aU\x96\x97e6E\x1a\x939\x96\x91N\xd0\xf9\xc7\x17\t~S\xbbG\t\x8a"+\x8a4+\xd2\xbc|\xcd\x9f\xaau^\x15\xeb\xe5f\xb5y3\xc9\xf8\x07\xbb\x92\xbdow\x01\x00\x00'
data_without_header = data[10:]
decompressed = uzlib.decompress(data_without_header, -31)
print(decompressed)

I'm still a bit unclear on how the wbits parameter works, but I found that removing the 10 byte gzip header from the data and then using any negative value (tested -1 to -100) allows the data in the example to decompress.

My interpretation of the micropython docs is it may be meant to work with positive values with data that does include the header. I wasn't able to get any positive values to work in the script above though with or without the header bytes in the data.

@gamblor21
Copy link
Member Author

So call everything zlib instead of uzlib. An example of this renaming without the shared-bindings/shared-module changes is re, which is ure in MicroPython.

Will do. Just a FYI for you (and others) the underlying library itself is called uzlib (it is separate from MP) so I won't change the naming within that but the CP visible calls I'll rename.

@gamblor21 gamblor21 changed the title uzlib Module zlib Module Feb 21, 2022
@gamblor21 gamblor21 marked this pull request as ready for review February 21, 2022 17:18
@FoamyGuy
Copy link
Collaborator

I tried out this branch again after newest commits on Feather TFT ESP32-S2. Confirmed that the name is now zlib and functionality still working as normal for unzipping data from the US Govt. analytics site.

Copy link
Member

@tannewt tannewt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @gamblor21!

It looks like the uzlib API isn't grounded in the real zlib API. The CPython API doesn't have an IO wrapper class at all. The decompress object doesn't really have the stream API.

However, gzip has a GzipFile class and a gzip.decompress() function. I think that's the best approach for making this CPython compatible. Rename to gzip and rework the API to match. https://docs.python.org/3/library/gzip.html

@gamblor21
Copy link
Member Author

I'll take a look a bit more in depth at the gzip API but I did not see anything that allows a stream just the file functionality. I think the stream functionality was required in the example @FoamyGuy had decompressing the web service request. Or can the file functionality be used for a data stream from HTTP?

zlib does have a decompressobj(...) function that states it is used to decompress streams of data (and returns a zlib.Decompress object). But may also take some work to change to fit.

Just as FYI did this as a quick change to hope the extmod could work so depending on the extent of rework this may take a while as it is low on my priority list at the moment.

@FoamyGuy
Copy link
Collaborator

My current code is access response.content which I originally did think was a stream. But the docs list bytes, so perhaps I misunderstood what it was.

I've only taken a quick look, but I didn't see anything in gzip that seems to have the stream functionality.

I do think being able to decompress a stream would be good functionality to have. I can see CPython not benefiting from it as much since RAM is closer to infinite it's less likely to encounter compressed data that doesn't fully fit into RAM with plenty left over. Maybe since the streaming functionality differs from the CPython API we could make a new module name for it? it also doesn't seem to exist inside of gzip that I can tell.

gzip.decompress() also seems odd to me in that it doesn't have the parameter for wbits. The details of that are definitely over my head, but I found that some values for that parameter do not work depending on the compressed data. I can't tell how this method would know what value to use for that when decompressing the data.

Or we could remove the streaming functionality entirely to match the CPython API closer? I think it seems nice to have, but maybe I overestimate it's usefulness.

As far as the zlib.decompress() function though, we are currently in alignment with the CPython API as far as I understand it.

This code runs and has the same output in both CPython and CircuitPython (built from this branch):

import zlib
data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03EPAn\xc2@\x0c\xbc\xe7\x15\xd6\x9e\tJ\x02\xa2Jn\xfd\x00\x87\xaa\\ZU\xc8\r.\xac\xbaI\xc0\xeb\xd0\xa2(\x7f\xafw\x13\xca\x9e<c\xcf\xd8\xb3C\x02`Zl\xc8T`\x98\xd0\x89\xd5z\x11X\x8f\xcd\xd9\xd9\xf6\xa8\x9da\x8c\xcc\xa5\'\xbe\x05\xa8@aC\xc2\xb6\xf6J\xbcGB)\x96\nk\xb1W\xdaybo"\xfd\xb1\x98\xc7\xf17e\xf2\xbd\x93 \xc93}\xda\x98\x9c\xd5\n\x1f\xc6\xf7{\x9e\xa3\x15D/x\xb1\xc7\x93\xc0\xb6\xfb1\xb3\xdf\x81|\xcd\xf6,\xb6k\xc3\xf0\xb6o>\x89\xa1\xfb\x82>\xce\xd7=3\xb5\xe2np\xb5\xde\x8a\x06\x01t\x0e\xb4$\xbf4\xff\x9b\x0f\x187O\t\x86{\x8e)\xc4>*;\x0e\xf7\x9aU\x96\x97e6E\x1a\x939\x96\x91N\xd0\xf9\xc7\x17\t~S\xbbG\t\x8a"+\x8a4+\xd2\xbc|\xcd\x9f\xaau^\x15\xeb\xe5f\xb5y3\xc9\xf8\x07\xbb\x92\xbdow\x01\x00\x00'
data_body = data[10:]
decompressed = zlib.decompress(data_body, -15)
print(decompressed)

@tannewt
Copy link
Member

tannewt commented Feb 22, 2022

Just as FYI did this as a quick change to hope the extmod could work so depending on the extent of rework this may take a while as it is low on my priority list at the moment.

Ya, totally get that. We do have a third option of picking a non-CPython name for the existing API. I'd avoid uzlib because it's not like zlib.

Can't the file API be used for streams? I wonder how the real requests library handles gzip.

@FoamyGuy
Copy link
Collaborator

If a new name were selected would it be for the stream API only and decompress() would stay under zlib since zlib.decompress() does seem to match the CPython API as-is. Or would decompress() move as well?

I tried poking around in https://github.com/psf/requests to figure out how it's handled in there, but I'm coming up empty. I tried some git greps for terms like gzip, zlib, and zip and the majority of the results I found seem to be in the documentation or the test files. This was the only code I found that wasn't in tests or docs that I could identify as doing anything related to zip:
https://github.com/psf/requests/blob/5e749546a26ce62e53d0a14ae1455f8b066fe19f/requests/utils.py#L244 but this does not really seem like the code that is responsible for the "auto-unzipping" that the library does.

I am a bit perplexed to not to find much with gzip because I would have assumed some portion of the code would be checking the content-encoding header for that value to determine that the data needs to be decompressed.

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 23, 2022

@FoamyGuy Looks like psf/requests uses the zipfile module. It writes a tmp file and then unzips it, in the utils.py file you found.

Never mind, @Neradoc gave you the correct answer.

@tannewt
Copy link
Member

tannewt commented Feb 23, 2022

If a new name were selected would it be for the stream API only and decompress() would stay under zlib since zlib.decompress() does seem to match the CPython API as-is. Or would decompress() move as well?

I'd be ok if they were split into two modules.

@gamblor21
Copy link
Member Author

Did some quick research. First off from what has been mentioned I can have just decompress in zlib but any thoughts on a module name if we want/need DecompIO? Maybe just that decompio or maybe zlibio?

I also looked quickly into how CPython works and what it uses. The full zlib is almost 100K so that is a non-starter.

I think if things are set up and then future work could be done finding a smaller zlib compatible library (or zlib may have some smaller options already I'm not sure). At a later date then more of the CPython functionality could be included.

@tannewt
Copy link
Member

tannewt commented Mar 1, 2022

I believe DecompIO can be made into gzip.GzipFile(). The wrapped "stream" is the fileobj parameter because it can be a BytesIO or any other file-like object. Streams and files are basically the same.

GzipFile implements the io.BufferedIOBase interface that has read, readinto, and readline.

We're not sure what wbits is used for, so I'd just drop it and make it gzip.decompress().

So, I still think we want to make this gzip.

@gamblor21
Copy link
Member Author

I will take a look into the DecompIO being able to read files. Just need some time to figure out how that all works but I don't see it being too hard.

I did read some information on what wbits are and how they matter. It has to do with the type of compression used. I think for gzip we may not need them or they can be limited. I will do some more research on it as well.

One thing just to keep in mind (for anyone reading this in the future) the CPython gzip module uses the zlib module to do most of the work. The gzip.decompress function is just called the zlib.decompress function with certain parameters set. (As close as I can tell)

@gamblor21
Copy link
Member Author

I did find some information about wbits (putting it here for future reference if needed):

  • to (de-)compress deflate format, use wbits = -zlib.MAX_WBITS
  • to (de-)compress zlib format, use wbits = zlib.MAX_WBITS
  • to (de-)compress gzip format, use wbits = zlib.MAX_WBITS | 16

From https://stackoverflow.com/questions/3122145/zlib-error-error-3-while-decompressing-incorrect-header-check
(More information located there as well)

@zestyping
Copy link

Hi, I'm pretty interested in using decompression in CircuitPython. What remaining steps need to happen for this to make it into the next release? If there are things I can help with, let me know.

@gamblor21
Copy link
Member Author

Hi, I'm pretty interested in using decompression in CircuitPython. What remaining steps need to happen for this to make it into the next release? If there are things I can help with, let me know.

Are there any specific parts of this you are looking for? Short term I want to get at least zlib.decompress() working. Longer term is time permits getting the gzip module working would be ideal but takes some more work.

@zestyping
Copy link

zestyping commented Mar 27, 2022

Hi @gamblor21 — thanks for the update! I'm implementing over-the-air software installation for a MatrixPortal M4, and my hope is to be able to unpack a downloaded zip file full of Python files. I know that zipfile support might be a long way off, but even having basic decompression (such as zlib) would get me pretty far; perhaps I could implement a subset of tar format or a simple invented format in plain Python, for example. Happy to help you if needed!

@gamblor21 gamblor21 force-pushed the uzlib-module branch 2 times, most recently from 100a2bc to 1d470c2 Compare April 3, 2022 16:23
@dhalbert
Copy link
Collaborator

dhalbert commented Apr 5, 2022

@gamblor21 is this self-contained as it is? Should we go ahead and merge this to get part of the functionality?

If you could make an issue (or edit an existing one) with a task list (use [ ] in a list to allow checking off done tasks), that would be great.

@gamblor21
Copy link
Member Author

@gamblor21 is this self-contained as it is? Should we go ahead and merge this to get part of the functionality?

Yes this should be fully self-contained. Worse case if I missed something CIRCUITPY_ZLIB=0 will quickly turn it off.

If you could make an issue (or edit an existing one) with a task list (use [ ] in a list to allow checking off done tasks), that would be great.

Will do in the next day or so.

Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's merge what we have and I'll describe it as "preliminary" and/or "experimental". Thanks!

@dhalbert
Copy link
Collaborator

dhalbert commented Apr 6, 2022

@gamblor21 I re-read @tannewt's comments, and it sounds like he would like it all wrapped up in gzip, instead of exposing zlib? Do I understand correctly. Did you have further conversations with him in Discord (say), or elsewhere? I approved and then re-read, and now I'm a bit unsure.

@FoamyGuy
Copy link
Collaborator

FoamyGuy commented Apr 6, 2022

I'd be ok if they were split into two modules.

@dhalbert My understanding based on t Scott's comment from 2/3 (quoted) was that it was okay to split the zlib portion into it's own thing since it already matches the existing CPython API. I understood the later posts to mean that for the DecompIO / stream related APIs when those are made to match CPython since they don't today that they would become the gzip module.

It's possible that my interpretation is incorrect though. I know it was also discussed in the weeds a bit during one or two of the meetings but I'm not certain of which dates it was, and I think for at least one of them it was a week that Scott wasn't present.

@dhalbert
Copy link
Collaborator

dhalbert commented Apr 6, 2022

Got it, thanks, I was not following it that closely. I'll go ahead, since this is useful functionality, and gzip will end up using zlib anyway.

@dhalbert dhalbert dismissed tannewt’s stale review April 6, 2022 16:06

partial implementation now matches discussion, we believe

@dhalbert dhalbert merged commit 2693a4c into adafruit:main Apr 6, 2022
@gamblor21
Copy link
Member Author

@gamblor21 I re-read @tannewt's comments, and it sounds like he would like it all wrapped up in gzip, instead of exposing zlib? Do I understand correctly. Did you have further conversations with him in Discord (say), or elsewhere? I approved and then re-read, and now I'm a bit unsure.

Hi, as @FoamyGuy said we did discuss it during in the weeds a few weeks ago. I looked and gzip for CPython is implemented in pure python so I'm thinking if zlib can be expanded a little bit more then gzip can be done similar to CPython. I just haven't had a chance to look at file access but I do not see any reason why it cannot be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants