-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: net_pkt_append: Take into account MSS/MTU when adding data to a packet #119
Conversation
@jukkar, @tbursztyka : Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually not a proper way to handle it.
As you noticed in your commit message, appending data might be made "before" it knows the actual header size (protocol and the l2 reserve, which in case of 15.4 can vary, see ipv6.c: update_ll_reserve() for instance).
The pb, is that you can't really assume anything about that header size before hand.
Imo, one of the right way to fix it when it's time to send the actually net_pkt: depending on MTU, protocol header, data length...
It would either "slice" the data in as many necessary net_pkt (all being chained to first one), but that piles a lot of complexity and generate a lot of stress on buffer management side (it would be really easy to get locked up). I don't think this is a good one.
Better: send would update protocol header and stuff, all according to MTU. net_context_send would then return the amount of data it could send, and the net_pkt would be updated so the un-sent data would be still there (aka: allocated, etc..) ready to be sent as well by the user (or with some more data appended to it etc...)
Now, because this problem is not enough fun by itself: there are 2 "MTUs" to handle. The usual hw one, but also the protocol one (like IPv6 in 6lo for instance).
Also, because of this lack of MTU handling, buffer fragments always get the reserve applied... which in case of ethernet's MTU for instance make 1500/FRAG_SIZE fragments, on which - for instance - the ethernet header size is "reserved". So you loose 14 bytes * (fragments - 1) -> 1 being the first one which need this ethernet header. Because MTU is not handled, fragment have been mixing up memory management and actually packet fragmentation in a mtu-way kind of thing (it "works" for 15.4, but as soon as you mix up different bearers in the same system, it's under-optimized obviously). And the last thing you want is to pile up the right MTU handling logic in the driver itself so it would "know" that, after first frag all other's frag reserve data would need to be skipped. (well it's what some ethernet driver do, to overcome current mtu handling limitation, but that's should not be the way).
So this problem needs to be addressed, but globally and properly so it would fix all the issues in one logic.
As a generic reply, well, that's why I keep throwing "WIP" and "DRAFT" net patches (#81, #49, have more in queue) - because I don't know enough of Z IP stack to come up with a proper solution, so dump them as "hints" for you guys... In this case, it's clear that MTU matters should have been thought about from the start (like, when packet is requested, all proto packets should be allocated before letting user add data to it). Retrofitting MTU handling on top of what we have now leads to non-general, or ugly result. Anyway, that's a generic "methodological" reply, let me ponder about the options you mention, though intuitively, they err on "complex" side, and thus the actual implementation may err on "ugly" side... |
As @tbursztyka mentioned we have two MTU issues here:
For the 1. we have a IPv6 fragmentation patches but nothing for IPv4 atm. This patch is about the point 2. I like the solution but as it is very difficult to know in advance how big the protocol headers are, so the solution gets tricky.
What do you think? |
Ok, so I did a moderately thorough pass thru things you talk about when preparing this patch, let's retrace these steps and see what I missed.
So, net_pkt_get_tx() eventually calls:
net_if_get_ll_reserve() gets an ipv6 addr argument exactly because of funky stuff like 15.4 which has varsize reserve, (not necessarily based on the address, it may identify peer state).
So, I noticed that 15.4 reserve may vary, but the fact that it may need adjustment beyond the initial alloc is completely nuts, sorry ;-). Why don't we just alloc conservatively largest possible reserve (and let 15.4 keep being an underused protocol, apparently due to all its funky features like that, which make it hard and inefficient to use)? |
Sorry, it's putting it all upside down. How currently Z native API works is: user gets packet from system, user fills it, user ask system to send it. The above involves system in "filling" part, which kills all simplicity and clarity of the current approach, and adds intermingled mix of stuff into the system. So yeah, I don't think that's viable. There's much easier way: user calls socket(), then user calls send() on it. Voila! This patch works perfectly with BSD Sockets approach, and I can keep it a part of socket code. (Then native API just stays broken in that respect.)
It could send or it has sent?
So, we're back to "user munges packet data, but system needs to munge it too" approach. This one perhaps not as intermingled as the choice above, but still, why? Why, if currently the system just frees the sent packet and is done with it. Why, if there will be sockets which just work in a natural way everyone knows. |
Sorry, but that's not the problem of this patch ;-). It only means that the Great Buffer Management Rework, started with your wonderful net_pkt refactor, should not stop there, but continue. Btw, is it really like you tell it? Because I thought the "user data" (which is not user data) is what's being repeated in each and every net_buf, while ll_reserve is done just once, on the first net_buf? This reminds me what should be next step of G.B.M.R. - renaming user-data-which-is-not-user-data to system data or meta data. |
Yeah, some researcher who will survey design of Zephyr IP stack may write "... given all the suboptimalities we discussed, one may wonder whether it made sense to go for net fragments based design at all." But we're with it already, and it's at least different from how many other stacks do it, so it may as well be an innovation. But see GH-3547 , point 2. I was really surprised to see it doing that. Like, we go out of our way for zero-copy reception, but with transmission, just roll bytes around? That features should be optional and off by default IMHO, unless of course 6lowpan mandates it (very unwise of them then). |
Well, not the things this patch tries to address. Let's try to sort it all out.
So, this patch addresses p.1 and only p.1, as all other points aren't really related to MTU as it's defined for IP networking. So, other points should be (and apparently are) handled elsewhere, while in the current design, net_pkt_append() is the only place to handle IP MTU. And as argued above, changing that design will be bigger problem than what we have now. |
Intuitively, it should not be like that. For LL headers, the largest required size should be conservatively allocated. Otherwise, is it possible for a user to control IP extension headers of each packet? If no, then IP-level header sizes also should be known at the time we allocate a new packet. |
There is no true 0-copy thingy here. It was imo a wrong marketing wording made on top of this stack. It's definitely improved over what existed before, but it's just not possible to make it 100% 0-copy, so I prefer not to use this word in net stack. fragment design is not a bad idea, the only current big problem is that instead of being just a memory management scheme, it got mixed up with packet fragmentation, as it did not took care of MTU, though the info is available on net_if. Surely net_pkt still needs work to address all of it, I could dive again in it as soon as I have time. [edit]
No, because on 15.4 for instance, and on the smallest/most optimized fragment size which is 128, you really don't want to always allocate the largest because it can be rather large on that tech, if I am not mistaken the largest ever would reduce your payload size to 85 bytes, at best. |
As a final point to this discussion, we may look how others do it. In lwIP, there're distinct APIs for UDP vs TCP. Z uses consistent API, and that's one of distinctive points - it's closer to POSIX in that respect. Of course, that comes at a price - TCP is inherently "streaming" API, while Z forces to use explicit packet-based model for both UDP and TCP. In lwIP, packet reception is the same packet-based, but on send side, it tries to provide more stream-based API. There's a So, again, Z's native API is both more low-level and more consistent. I would think those are its selling points and we should leave it that way. If we want more, then well, BSD Sockets are coming. But it's still good idea to fix obvious issues in the current API, like lack of handling of IP MTU (other MTUs are handled elsewhere). That's exactly what this patch does. If there're concerns that it doesn't leave enough room for extension of LL or IP proto headers, well, let's add another static "reserve" to subtract from MTU. Say, we reserve size of LL, proto headers, and 20 (40, 80) bytes more for their extension. That's not very ideal, but very practical and keep complexity at bay. (But I think that to the time we let user add data to a packet, the size of headers should be already known, and any extra reserve should be accounted before that, e.g. use the largest size of LL header). |
Well, IMHO, it's a good aim to strive for 0-copiedness in the native API, that's where it can differentiate from Sockets API which will be easier to use anyway.
Thanks. I think for 1.8, there was a great progress, and further internal refactorings could be restarted in 1.9. But for 1.8 it would be really nice to bust externally-facing issues, like this (not really critical, if we can't agree on a way to do it, well...), or really critical bugs in TCP handling (#137, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been pondering this problem and it looks like the only practical solution is to split the too large packet just before we are about to send it.
So for example in ipv6.c net_ipv6_finalize_raw() creates IPv6 packet and it knows the IP header size at this point and can thus check if we overflow the MTU. If we do overflow it, then it is relatively easy to split the large packet into smaller pieces and clone the IPv6 header at this point. We should do the splitting before the net_pkt_compact() is called, otherwise it is more tricky to figure out where to split the pkt.
I can actually investigate how easy this solution would be.
Yes, as a quick workaround, doing it in finalize function could work I think. However as a grand final solution, it's not really going to be the end of the story. Let me detail. We are actually messing up about this MTU right now, and for the reason we know: design issue, easy workaround, etc... Plus, MTU can be used to split an ip packet with ethernet, but it's not true with 15.4 where 6lo is in place (and then it's up to 15.4-6lo to split: aka fragment the ipv6 packet). Now to make things worse (heh, it always does, doesn't it?) 15.4 might need to reserve space, after the payload and not for the checksum: for the authentication data when 15.4 is configured to be encrypted+authentified. Currently, I handled this case in 15.4 with a hack: I just use ll reserve space header + auth data size, and then when it's time to encrypt/auth I memmoves here and there. It works... but it's really sub-optimized. As you can see, there are much more case than just simple mtu and ip split. So for now, imo the IP split on *_finalize() is fine, but there is still rooms for design improvements to address all the issues above. I actually planned to do so for a while, it's just not the biggest priority right now so I'll probably do it for 1.9. |
Yes, I updated the title of this ticket to make it's clear it's not a priority until 1.9 work start, let's try to ponder about the issue in the meantime and let's later try to discuss the general approach, before trying to implement anything more complex than this patch. Thanks. |
I do not think this MTU work will get ready for 1.8 (even with some hacks). So if you have something already planned to 1.9, lets wait for that then. |
what is the status of this? |
As we didn't reach agreement how this should be handled on the level of native Zephyr networking API, this patch effectively went only into Sockets API. Closing then. |
recheck |
As far as I can tell, for TX packet, net_pkt->appdatalen isn't used until packet construction is finished and ready to be sent. So, it can be used to cache "remaining data size in the packet" (initialized in net_pkt_get_tx()). |
(This is pending to be updated to use MSS for the TCP case, instead of just MTU). |
So, #1330 actually has:
I.e. also doesn't do it exactly right. With all the progress we have on these issues, let's go for proper fix. I started doing one in #4657 . Also included here, to make build pass. The original patch is updated to use send_mss for TCP case. |
d697089
to
240f3c7
Compare
If we were asked to add 10KB to a packet, adding it won't help - such packet won't be even sent by hardware on our side, and if it is, it will be dropped by receiving side. So, make sure we never add more data than MTU as set for the owning interface. This actually gets a bit tricky, because we need also to account for protocol header space. Typically, when net_pkt_append() is called, protocol headers aren't even added to packet yet (they are added in net_context_send() currently), so we have little choice than to assume the standard header length, without any extensions. Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
Now the check happens on the level of the core IP stack, in net_pkt_append(). Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
@jukkar : This now should be ready for review/merge. @tbursztyka : Based on the mailing list discussion, and the fact that #4243 relies on this, can you please re-review this? |
[convert.sh] Fixed convert.sh to be more reliable on different OS's
If we were asked to add 10KB to a packet, adding it won't help -
such packet won't be even sent by hardware on our side, and if
it is, it will be dropped by receiving side. So, make sure we
never add more data than MTU as set for the owning interface.
This actually gets a bit tricky, because we need also to account
for protocol header space. Typically, when net_pkt_append() is
called, protocol headers aren't even added to packet yet (they
are added in net_context_send() currently), so we have little
choice than to assume the standard header length, without any
extensions.
Signed-off-by: Paul Sokolovsky paul.sokolovsky@linaro.org