Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird errors related to WINDOW_SIZE updates #121

Closed
MangoIV opened this issue Jun 10, 2024 · 39 comments · Fixed by #123
Closed

weird errors related to WINDOW_SIZE updates #121

MangoIV opened this issue Jun 10, 2024 · 39 comments · Fixed by #123

Comments

@MangoIV
Copy link

MangoIV commented Jun 10, 2024

when updating from http2 4.1.4 -> 5.2.2 I started encountering occasional failures with this error message:

StreamErrorIsSent FlowControlError 1 "window update for stream 1 is overflow"

There's some weird things in the log where it tries to increase the upper bound of the window by 2^31-1.

I have tried trouble shooting it but I haven't had any luck reproducing it with a small example but it looks like maybe it's related to an endianness bug.

I'm at a bit of a loss here. I was looking at whether it might have been nginx modifying http2 frames but I'd be very surprised if that would be the case.

Do you have any ideas of where I could start trouble shooting?
What code actually picks which size to upgrade the window by?
What would happen if the window upper bound would actually reach maxBound @Int, is there any logic to reset the window size?

Thank you so much in advance for any pointers!

@MangoIV
Copy link
Author

MangoIV commented Jun 10, 2024

I have already tried the patches submitted by @edsko in http-semantics and they unfortunately do not fix my issues.

@kazu-yamamoto
Copy link
Owner

Where does this error occurs? The client side or the server side?
And please tell me the CPU architecture of that side.

@MangoIV
Copy link
Author

MangoIV commented Jun 10, 2024

It is x86_64-linux - I will check again whether it only occurs on one side tomorrow and report here :)

@kazu-yamamoto
Copy link
Owner

Please check if the size of Int is 8 in both side:

ghci> import Foreign.Storable 
ghci> sizeOf (1 :: Int)
8

@MangoIV
Copy link
Author

MangoIV commented Jun 11, 2024

it should be 8 on both the server and client as I can reproduce it even if they're running on the same machine; I think the error itself actually occurs when receiving but only after the sender does the update of the window size to 2^31 - 1, example:

increased window size by: 2147418112
current window size: 2147483645
current limit: 2147483647
current sent: 2
stream id: 0

increased window size by: 2147418111
current window size: 2147680253
current limit: 2147680255
current sent: 2
stream id: 1
 
updating peer settings:
old window size 262144
new window size 65536
difference -196608
 
increased window size by: -196608
current window size: 2147483645
current limit: 2147483647
current sent: 2
stream id: 1

-- here we would see the error showing up in the logs

the update of the peer settings happens in the sender. I don't know what exactly triggers the increase of the window size by 2^31 - 1 - currentSent, so it would be nice if you perhaps have a pointer at the code which cares for calculating of how much the Sender is going to increase the window size by.

Thank you so much!

@kazu-yamamoto
Copy link
Owner

First of all, you should read the documentation of https://hackage.haskell.org/package/network-control-0.1.0/docs/Network-Control.html

@kazu-yamamoto
Copy link
Owner

I will try to implement a feature to transfer big data in util/h2-client and util/h3-client to reproduce this issue.

@kazu-yamamoto
Copy link
Owner

adb6b61in main provides this feature:

First, run h2c-server:

% h2c-server 127.0.0.1 8080

Then, run h2c-client:

h2c-client 127.0.0.1 8080 -t 10000000000
Throughput 6070.49 Mbps (10000000000 bytes in 12568 msecs)

Unfortunately I cannot reproduce your problem.

@edsko
Copy link
Collaborator

edsko commented Jun 14, 2024

FWIW, we are also hunting a bug, we're getting segfaults in a particular test (or sometimes bus error), and we believe it's got something to do with the write buffer. We'll continue digging and let you know. It may or may not be related to the window update problem, I'm unsure.

@MangoIV
Copy link
Author

MangoIV commented Jun 14, 2024

@kazu-yamamoto I am pretty sure that it has nothing to do with sending a lot of data but with doing window updates to the max bound concurrently and the sender and receiver getting out of sync. It’s not reproducing in all cases but only in some concurrent scenarios. We have a hunch at what it could be but we can only confirm next week.

@kazu-yamamoto
Copy link
Owner

kazu-yamamoto commented Jun 14, 2024

@MangoIV That sounds that network-control introduces critical sections. Uhhhm.

@kazu-yamamoto
Copy link
Owner

51cfe54 introduces concurrency:

% h2c-client 127.0.0.1 80 -t 1000000000 / / / / / / / / / / / /
Throughput 797.14 Mbps (1000000000 bytes in 9571 msecs)
Throughput 797.05 Mbps (1000000000 bytes in 9572 msecs)
Throughput 797.05 Mbps (1000000000 bytes in 9572 msecs)
Throughput 796.97 Mbps (1000000000 bytes in 9573 msecs)
Throughput 796.97 Mbps (1000000000 bytes in 9573 msecs)
Throughput 796.97 Mbps (1000000000 bytes in 9573 msecs)
Throughput 796.89 Mbps (1000000000 bytes in 9574 msecs)
Throughput 796.89 Mbps (1000000000 bytes in 9574 msecs)
Throughput 796.89 Mbps (1000000000 bytes in 9574 msecs)
Throughput 796.89 Mbps (1000000000 bytes in 9574 msecs)
Throughput 796.89 Mbps (1000000000 bytes in 9574 msecs)
Throughput 796.89 Mbps (1000000000 bytes in 9574 msecs)

Paths are dummy.
I cannot reproduce the issue yet.

@kazu-yamamoto
Copy link
Owner

@MangoIV If you could identify which version introduce your issue, it would be highly helpful.

@MangoIV
Copy link
Author

MangoIV commented Jun 14, 2024

I can bisect on Monday if we’re hitting a dead end - thank you so much for your help :)

@MangoIV
Copy link
Author

MangoIV commented Jun 18, 2024

so the outcome is as follows - I won't be able to work on this issue anymore (except for maybe in my free time) as we have decided to prioritize removing our use of http2 client, this is what I found in the meanwhile; the situation is as follows:

  • the http/2 client is http2, the http/2 server is nginx
  • we're talking via tls to nginx, the protocol negotiation happens via ALPN
  • it seems as though the following things then happen:
    • nginx immediately updates the window size to 2^31 - 1 by sending a WINDOW_UPDATE frame, assuming the default window size
    • http/2 sends the client magic
    • http2 client starts setting up the window size to its standard window size (the 2^18 constant in the http2 library) and after that processes the window update
    • the window update is too large as http2 client assumes 2^18 as the current window size
    • http2 throws the window size too large error
    • it immediately recovers but then it's already too late

As a reproducer I would next try to send a WINDOW_UPDATE frame from the server already before the magic of the client and see if that already triggers the issue

In general I think it occurs because the SETTINGS update and the WINDOW_UPDATE concurrently updating the window size.

Sorry this is all very vague, I will try to write an actual reproducer in my free time, I just wanted to share this with you :)

@MangoIV MangoIV closed this as completed Jun 18, 2024
@MangoIV MangoIV reopened this Jun 18, 2024
@MangoIV MangoIV changed the title weird errors related to WINDOW_SIZE updates -- help wanted weird errors related to WINDOW_SIZE updates Jun 18, 2024
@MangoIV
Copy link
Author

MangoIV commented Jun 18, 2024

additionally, I'm very sorry for not being able to be able to actually bisect - each rebuild takes ~ 1 hour, I think if I can write a reproducer, I will be able to actually do the bisection effectively.

@FinleyMcIlwaine
Copy link
Collaborator

FWIW, the recent patch in http-semantics solved the segfault that @edsko mentioned above, but does not help with this issue according to @MangoIV.

@MangoIV
Copy link
Author

MangoIV commented Jun 18, 2024

It’s possible that it may help - but I’m doubtful, it doesn’t seem very related, the traffic in wireshark actually does look weird (in that the server starts sending before the http2 client has sent its hello) and http2 actually does disagree with what nginx and wireshark (as in wireshark does actually seem to calculate correct window sizes when nginx sends the window update frame) are seeing. I can try nonetheless.

@MangoIV
Copy link
Author

MangoIV commented Jun 18, 2024

I have tried out a patch together with some other long shots wrt improving atomicity but that didn't do it.

@kazu-yamamoto
Copy link
Owner

@MangoIV Could you give me a pcap file (taken by tcpdump or so) and its key log file?
I need to understand details.

@kazu-yamamoto
Copy link
Owner

I'm beginning to understand the situation.
@MangoIV If you specify the following settings to your client, does your issue disappear?

defaultClientConfig { settings = defaultSettings { initialWindowSize = 65535} }

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

I’m pretty sure it would, yes, without testing. (I have tested it, this does work)

@kazu-yamamoto
Copy link
Owner

  • nginx immediately updates the window size to 2^31 - 1 by sending a WINDOW_UPDATE frame, assuming the default window size

I guess that this is connection TX for client (sid == 0 is for connection). This is good.

  • http2 client starts setting up the window size to its standard window size (the 2^18 constant in the http2 library) and after that processes the window update

This is stream RX for client.

  • the window update is too large as http2 client assumes 2^18 as the current window size
  • Stream RX for client is now 2^18.
  • Stream TX for client is 2^16 -1 (if server does not specify in Settings).
  • WindowUpdate from server should be added to stream TX for client. So, something is wrong.

@kazu-yamamoto
Copy link
Owner

kazu-yamamoto commented Jun 19, 2024

The initial connection TX for client cannot be modified by server at the beginning.
The initial value for connection is fixed to 64K.
WindowUpdate (sid == 0) from server only can change it.

@kazu-yamamoto
Copy link
Owner

2^18 is RX for client.
WindowUpdate (2^31 -1 -2^16 + 1) from server is TX for client.
They must not be added together.

@kazu-yamamoto
Copy link
Owner

OK. I understand. My settings is also set to peer's settings with IORef in newContext. That's bad.

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

Thank you so much for taking so much time for this, this is really, really nice :)

Should I add some more information? E.g. how wireshark sees it when the thing crashes or logs I added to http2 before we crash?

@kazu-yamamoto
Copy link
Owner

I will push one commit very soon.
It would be nice if you could test it.

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

sure, I will do that! Thank you so much <3

@kazu-yamamoto
Copy link
Owner

branch: https://github.com/kazu-yamamoto/http2/tree/peer-initial-window-size
commit: 80921ad

This should fix your issue.

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

@kazu-yamamoto but this is just setting to 64K, right? does this fix the issue or is it just a workaround? I.e. I have trouble to understand whether having 200k would be a legal initialWindowSize.

Perhaps this is the issue though - in my tests the SETTINGS frame that would notify the server about the changed initial window size didn't show up, I thought this was just a timing issue but if we don't ever send a SETTINGS frame upon changing the setting, then we probably have the root of the issue? Does that sound correct?

@kazu-yamamoto
Copy link
Owner

but this is just setting to 64K, right? does this fix the issue or is it just a workaround? I.e. I have trouble to understand whether having 200k would be a legal initialWindowSize.

This is a complete fix, not workaround.
You can change initialWindowSize in Settings as you like and "window update for stream 1 is overflow" should not happen.

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

Alright, thank you - I guess I will try it out then. I have a couple of what I suspect to be chesterton's fences - e.g. for some atomicity things - that I have encountered and I would like to ask you questions about them, should I open an extra issue for that?

@kazu-yamamoto
Copy link
Owner

kazu-yamamoto commented Jun 19, 2024

Perhaps this is the issue though - in my tests the SETTINGS frame that would notify the server about the changed initial window size didn't show up, I thought this was just a timing issue but if we don't ever send a SETTINGS frame upon changing the setting, then we probably have the root of the issue? Does that sound correct?

I confirmed that initialWindowSize 2^18 is announced from a client.

@kazu-yamamoto
Copy link
Owner

Alright, thank you - I guess I will try it out then. I have a couple of what I suspect to be chesterton's fences - e.g. for some atomicity things - that I have encountered and I would like to ask you questions about them, should I open an extra issue for that?

Yes, please open new issues.
I will try to help you.

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

lovely, thank you so much <3

@MangoIV
Copy link
Author

MangoIV commented Jun 19, 2024

thank you for the patch, I think that fixes it!

@MangoIV MangoIV closed this as completed Jun 19, 2024
@kazu-yamamoto
Copy link
Owner

Thank you so much!

@kazu-yamamoto
Copy link
Owner

A new version has been released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants