Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-129726: Break gzip.GzipFile reference loop #130055

Merged
merged 3 commits into from
Feb 28, 2025

Conversation

cmaloney
Copy link
Contributor

@cmaloney cmaloney commented Feb 12, 2025

A reference loop was resulting in the fileobj held by the GzipFile being closed before the GzipFile.

The issue started with gh-89550 in 3.12, but was hidden in most cases until 3.13 when gh-62948 made it more visible. I think this needs backporting to 3.12

Sample of test failing on main:

$ ./python -Wd -E -m test -uall -M8g test_gzip 
Using random seed: 195107833
0:00:00 load avg: 0.54 Run 1 test sequentially in a single process
0:00:00 load avg: 0.54 [1/1] test_gzip
test test_gzip failed -- Traceback (most recent call last):
  File "/<workdir>/python/cpython/Lib/test/test_gzip.py", line 872, in test_refloop_unraisable
    self.assertIsNone(cm.unraisable)
    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
AssertionError: UnraisableHookArgs(exc_type=<class 'ValueError'>, exc_value=ValueError('I/O operation on closed file.'), exc_traceback=<traceback object at 0x76afdad089b0>, err_msg='Exception ignored while finalizing file <gzip on 0x76afdad08910>', object=None) is not None

test_gzip failed (1 failure)

== Tests result: FAILURE ==

1 test failed:
    test_gzip

Total duration: 735 ms
Total tests: run=73 failures=1
Total test files: run=1/1 failed=1
Result: FAILURE

loop found by @danifus , bug found and explored by @xrmx

A reference loop was resulting in the `fileobj` held by the `GzipFile`
being closed before the `GzipFile`.

The issue started with pythongh-89550 in 3.12, but was hidden in most cases
until 3.13 when pythongh-62948 made it more visible.
@@ -10,6 +10,7 @@
import builtins
import io
import _compression
import weakref
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ordering of imports here is painful to me, but resisted bringing to PEP8 because this fix needs backporting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reformat these imports for PEP 8 :-) I will handle merge conflicts on backports if needed.

Copy link
Contributor

@xrmx xrmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with Python 3.13 and with Python 3.12 -X dev against their respective gzip module patched with this diff and cannot reproduce the issue anymore with my snippet. Thanks!

@danifus
Copy link
Contributor

danifus commented Feb 13, 2025

Nice work. +1 on a backport as I think this can result in data loss if there's data waiting to be flushed from the buffer or compressor.

Should we add some docs saying that the GzipFile object should be closed explicitly instead of relying on finalizers to clean it up so as to not rely on GC heuristics?

@cmaloney
Copy link
Contributor Author

@danifus 👍. The report production code did a .close() it sounds like (#129726 (comment)), feels to me like another issue that could be resolved: Close to me should mean all data is flushed and the GzipFile will not try to write again (and its write methods should error). Not sure what this last little bit of data is / why it was around after .close() and tried getting written on __del__.

@danifus
Copy link
Contributor

danifus commented Feb 14, 2025

Hmmm I had missed that production example and it baffles me :p I don't see how more data is attempted to be written after .close() has been called once.

We have this check at the start of .close():

cpython/Lib/gzip.py

Lines 353 to 355 in 05e89c3

fileobj = self.fileobj
if fileobj is None or self._buffer.closed:
return

and this in the finally block at the end of that .close():

cpython/Lib/gzip.py

Lines 365 to 366 in 05e89c3

finally:
self.fileobj = None

Seems pretty watertight.

@xrmx it's possible you saw this issue in your code in the logger.error("dropping flushed data due to transport failure back-off") branch of the following if statement (the one that doesn't close the GzipFile). Let me know if that's not the case and I'll keep investigating.

        if not self.state.should_try():
            logger.error("dropping flushed data due to transport failure back-off")
        else:
            fileobj = buffer.fileobj  # get a reference to the fileobj before closing the gzip file
            buffer.close()

            data = fileobj.getbuffer()
            data = gzip.compress(buffer.getvalue(), compresslevel=self._compress_level)

@xrmx
Copy link
Contributor

xrmx commented Feb 14, 2025

@xrmx it's possible you saw this issue in your code in the logger.error("dropping flushed data due to transport failure back-off") branch of the following if statement (the one that doesn't close the GzipFile). Let me know if that's not the case and I'll keep investigating.

        if not self.state.should_try():
            logger.error("dropping flushed data due to transport failure back-off")
        else:
            fileobj = buffer.fileobj  # get a reference to the fileobj before closing the gzip file
            buffer.close()

            data = fileobj.getbuffer()
            data = gzip.compress(buffer.getvalue(), compresslevel=self._compress_level)

I think I've only seen it from buffer.close(). Still with the patched 3.13 (so +2 on line numbers) I have seen (at least) another occurrence of the same stacktrace:

Traceback (most recent call last):
  File "/usr/lib/python3.13/gzip.py", line 361, in close
    fileobj.write(self.compress.flush())
ValueError: I/O operation on closed file.

It comes from test_send_remote_failover_sync_non_transport_exception_error that on its own prints these:

Exception ignored in: <_io.BytesIO object at 0x7f54eae0a6b0>
BufferError: Existing exports of data: object cannot be re-sized
Exception ignored in: <_io.BytesIO object at 0x7f54eada7420>
BufferError: Existing exports of data: object cannot be re-size

And when run with another test sometimes prints the traceback. I also see these BufferError whenhandle_transport_fail is called. But I'll need some time to understand what's going on and provide another reproducer.

@danifus
Copy link
Contributor

danifus commented Feb 14, 2025

That stacktrace could appear in 2 ways:

  • calling .close() directly
  • .close() being called in the finalizer before being garbage collected (__del__()).

Both would give you that traceback location but only the occurrence in the finalizer as the object goes out of scope will give you the Exception ignored in: <gzip on 0x7fa4fd99c550> text as well (exceptions are ignored in finalizers).

BufferError: Existing exports of data: object cannot be re-size is kind of related to this. You can't write to or close a BytesIO object if the buffer returned from BytesIO.getbuffer() is still held somewhere.

If you claim the buffer with BytesIO.getbuffer() before closing the GzipFile, when the GzipFile is closed (or destroyed / finalized as it goes out of scope), data will attempt to be flushed to the BytesIO object in .close() and you will get the BufferError: Existing exports of data: object cannot be re-size exception. Try doing something like:

gzf = GzipFile(fileobj=io.BytesIO())
fileobj = gzf.fileobj
gzf.close()
buf = fileobj.getbuffer()  # Grab the buffer after the GzipFile obj has been closed

As for the original issue, are you able to see if code like this:

        if not self.state.should_try():
            buffer.close()  # close buffer in this branch of the if statement too
            logger.error("dropping flushed data due to transport failure back-off")
        else:
            ...

resolves your issue without this weak reference fix applied to your python?

@xrmx
Copy link
Contributor

xrmx commented Feb 14, 2025

That stacktrace could appear in 2 ways:

* calling `.close()` directly

* `.close()` being called in the finalizer before being garbage collected (`__del__()`).

Both would give you that traceback location but only the occurrence in the finalizer as the object goes out of scope will give you the Exception ignored in: <gzip on 0x7fa4fd99c550> text as well (exceptions are ignored in finalizers).

BufferError: Existing exports of data: object cannot be re-size is kind of related to this. You can't write to or close a BytesIO object if the buffer returned from BytesIO.getbuffer() is still held somewhere.

If you claim the buffer with BytesIO.getbuffer() before closing the GzipFile, when the GzipFile is closed (or destroyed / finalized as it goes out of scope), data will attempt to be flushed to the BytesIO object in .close() and you will get the BufferError: Existing exports of data: object cannot be re-size exception. Try doing something like:

gzf = GzipFile(fileobj=io.BytesIO())
fileobj = gzf.fileobj
gzf.close()
buf = fileobj.getbuffer()  # Grab the buffer after the GzipFile obj has been closed

Thanks for the explanation, adding the explicit buffer.close() on the other path does not remove the Exception ignored in: <gzip on 0x7f551ebd2c80> I'm still seeing with the patch applied. I need to dig more.

As for the original issue, are you able to see if code like this:

        if not self.state.should_try():
            buffer.close()  # close buffer in this branch of the if statement too
            logger.error("dropping flushed data due to transport failure back-off")
        else:
            ...

resolves your issue without this weak reference fix applied to your python?

Nope, adding buffer.close() does not remove all the gzip stacktraces I was seeing before.

@xrmx
Copy link
Contributor

xrmx commented Feb 14, 2025

Oh the elephant in the room is that this code is running as a daemon thread and so it well may be that I got these when it gets killed at exit.

@xrmx
Copy link
Contributor

xrmx commented Feb 14, 2025

Oh the elephant in the room is that this code is running as a daemon thread and so it well may be that I got these when it gets killed at exit.

That's not the problem, after poking a bit with the code the following quote from @danifus started to make sense to me:

If you claim the buffer with BytesIO.getbuffer() before closing the GzipFile, when the GzipFile is closed (or destroyed / finalized
as it goes out of scope), data will attempt to be flushed to the BytesIO object in .close() and you will get the BufferError: Existing
exports of data: object cannot be re-size exception.

Adding a data.release() after closing the gzip appears to have fixed the other warnings. Now I still have a single occurrence of the original stacktrace but it's near a test that now raises ValueError: operation forbidden on released memoryview object because of the memoryview release, so maybe it's poking with internals too.

BTW the BytesIO doc talks about a view that you should release but it doesn't say how and it doesn't link the memoryview doc.

Thanks a lot @cmaloney and @danifus for the help!

@danifus
Copy link
Contributor

danifus commented Feb 15, 2025

Glad to hear most of the exceptions are gone.

As for the last one, I think you'll just want to del the instance of the view you got from .getbuffer(). I think your current issue is that there are 2 references to the same memoryview and your .release() in one location is impacting the other.

Instead, just do something like:

bio = BytesIO()
buf = bio.getbuffer()
del buf  # remove this instance and let the GC release it if no other references exist.

@xrmx
Copy link
Contributor

xrmx commented Feb 17, 2025

Glad to hear most of the exceptions are gone.

As for the last one, I think you'll just want to del the instance of the view you got from .getbuffer(). I think your current issue is that there are 2 references to the same memoryview and your .release() in one location is impacting the other.

Instead, just do something like:

bio = BytesIO()
buf = bio.getbuffer()
del buf  # remove this instance and let the GC release it if no other references exist.

The last one is from a test mocking the function getting the view as arg and trying to read it after it has already been released. If I change the code as you suggested running this test I get EOFError: Compressed file ended before the end-of-stream marker was reached from /usr/lib/python3.13/gzip.py:616: EOFError.

@danifus
Copy link
Contributor

danifus commented Feb 18, 2025

Ah ok, that does sound a bit more involved. You're right, the .release() method is the right way to go.

As far as this issue is concerned, I think the .close() method doesn't have any issues and we should add some documentation saying that GzipFile should be closed explicitly to avoid a potential race condition during garbage collection (mitigated with this patch but may be subject to resurrection if the GC algorithm is changed or a different python implementation uses a different GC). What do you guys think?

@cmaloney
Copy link
Contributor Author

I think it's still worthwhile to fix, since CPython sholdn't error if a GzipFile doesn't explicitly have .close() called. That's likely if an exception is thrown and code isn't using with / a context manager.

Definitely some Documentation improvements would also be good I think. .close() is definitely important for getting correct behavior.

@xrmx
Copy link
Contributor

xrmx commented Feb 19, 2025

As far as this issue is concerned, I think the .close() method doesn't have any issues and we should add some documentation saying that GzipFile should be closed explicitly to avoid a potential race condition during garbage collection (mitigated with this patch but may be subject to resurrection if the GC algorithm is changed or a different python implementation uses a different GC). What do you guys think?

Fyi without the gzip patch and with the explicit cleanup in my code I still see a ton of ValueError: I/O operation on closed file. warnings and it's not trivial to track them down 😓

@cmaloney
Copy link
Contributor Author

cmaloney commented Feb 26, 2025

@vstinner I think this is ready for review. (Not sure why the issue-number bot didn't link it back to the original issue). Exception on close issue that became visible with your change to log exceptions if close() fails (GH-12786 which was part of https://bugs.python.org/issue18748 / gh-62948)

Lib/gzip.py Outdated
@@ -226,7 +227,8 @@ def __init__(self, filename=None, mode=None,
0)
self._write_mtime = mtime
self._buffer_size = _WRITE_BUFFER_SIZE
self._buffer = io.BufferedWriter(_WriteBufferStream(self),
write_wrap = _WriteBufferStream(weakref.proxy(self))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to modify _WriteBufferStream to use a simpler weakref.ref() to store gzip_file instead? This object is private so we can change its API.

Something like:

class _WriteBufferStream(io.RawIOBase):
    """Minimal object to pass WriteBuffer flushes into GzipFile"""
    def __init__(self, gzip_file):
        self.gzip_file = weakref.ref(gzip_file)

    def write(self, data):
        gzip_file = self.gzip_file()
        if gzip_file is None:
            raise RuntimeError("lost gzip_file")
        return gzip_file._write_raw(data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to your suggestion here. I Spent a bit of time trying to remove _WriteBufferStream and just use weakref.proxy, but getting so GzipFile.write has a different write function from the underlying self._buffer: BufferedWriter (which needs to use self._write_raw) led me in a lot of circles. More understandable for things I tried to just keep _WriteBufferStream.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vstinner vstinner merged commit 7f39137 into python:main Feb 28, 2025
43 checks passed
@vstinner vstinner added the needs backport to 3.13 bugs and security fixes label Feb 28, 2025
@miss-islington-app
Copy link

Thanks @cmaloney for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 28, 2025
A reference loop was resulting in the `fileobj` held by the `GzipFile`
being closed before the `GzipFile`.

The issue started with pythongh-89550 in 3.12, but was hidden in most cases
until 3.13 when pythongh-62948 made it more visible.
(cherry picked from commit 7f39137)

Co-authored-by: Cody Maloney <cmaloney@users.noreply.github.com>
@bedevere-app
Copy link

bedevere-app bot commented Feb 28, 2025

GH-130669 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Feb 28, 2025
@vstinner vstinner added the needs backport to 3.12 bug and security fixes label Feb 28, 2025
@miss-islington-app
Copy link

Thanks @cmaloney for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 28, 2025
A reference loop was resulting in the `fileobj` held by the `GzipFile`
being closed before the `GzipFile`.

The issue started with pythongh-89550 in 3.12, but was hidden in most cases
until 3.13 when pythongh-62948 made it more visible.
(cherry picked from commit 7f39137)

Co-authored-by: Cody Maloney <cmaloney@users.noreply.github.com>
@bedevere-app
Copy link

bedevere-app bot commented Feb 28, 2025

GH-130670 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Feb 28, 2025
@vstinner
Copy link
Member

loop found by @danifus , bug found and explored by @xrmx

PR merged, thank you all! I backported the fix to 3.12 and 3.13.

vstinner pushed a commit that referenced this pull request Feb 28, 2025
…130670)

gh-129726: Break `gzip.GzipFile` reference loop (GH-130055)

A reference loop was resulting in the `fileobj` held by the `GzipFile`
being closed before the `GzipFile`.

The issue started with gh-89550 in 3.12, but was hidden in most cases
until 3.13 when gh-62948 made it more visible.
(cherry picked from commit 7f39137)

Co-authored-by: Cody Maloney <cmaloney@users.noreply.github.com>
vstinner pushed a commit that referenced this pull request Feb 28, 2025
…130669)

gh-129726: Break `gzip.GzipFile` reference loop (GH-130055)

A reference loop was resulting in the `fileobj` held by the `GzipFile`
being closed before the `GzipFile`.

The issue started with gh-89550 in 3.12, but was hidden in most cases
until 3.13 when gh-62948 made it more visible.
(cherry picked from commit 7f39137)

Co-authored-by: Cody Maloney <cmaloney@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants