net: net_pkt_append: Take into account MSS/MTU when adding data to a packet #119

pfalcon · 2017-05-08T13:48:39Z

If we were asked to add 10KB to a packet, adding it won't help -
such packet won't be even sent by hardware on our side, and if
it is, it will be dropped by receiving side. So, make sure we
never add more data than MTU as set for the owning interface.
This actually gets a bit tricky, because we need also to account
for protocol header space. Typically, when net_pkt_append() is
called, protocol headers aren't even added to packet yet (they
are added in net_context_send() currently), so we have little
choice than to assume the standard header length, without any
extensions.

Signed-off-by: Paul Sokolovsky paul.sokolovsky@linaro.org

pfalcon · 2017-05-08T13:49:10Z

@jukkar, @tbursztyka : Please review.

tbursztyka

That's actually not a proper way to handle it.

As you noticed in your commit message, appending data might be made "before" it knows the actual header size (protocol and the l2 reserve, which in case of 15.4 can vary, see ipv6.c: update_ll_reserve() for instance).

The pb, is that you can't really assume anything about that header size before hand.

Imo, one of the right way to fix it when it's time to send the actually net_pkt: depending on MTU, protocol header, data length...
It would either "slice" the data in as many necessary net_pkt (all being chained to first one), but that piles a lot of complexity and generate a lot of stress on buffer management side (it would be really easy to get locked up). I don't think this is a good one.

Better: send would update protocol header and stuff, all according to MTU. net_context_send would then return the amount of data it could send, and the net_pkt would be updated so the un-sent data would be still there (aka: allocated, etc..) ready to be sent as well by the user (or with some more data appended to it etc...)

Now, because this problem is not enough fun by itself: there are 2 "MTUs" to handle. The usual hw one, but also the protocol one (like IPv6 in 6lo for instance).

Also, because of this lack of MTU handling, buffer fragments always get the reserve applied... which in case of ethernet's MTU for instance make 1500/FRAG_SIZE fragments, on which - for instance - the ethernet header size is "reserved". So you loose 14 bytes * (fragments - 1) -> 1 being the first one which need this ethernet header. Because MTU is not handled, fragment have been mixing up memory management and actually packet fragmentation in a mtu-way kind of thing (it "works" for 15.4, but as soon as you mix up different bearers in the same system, it's under-optimized obviously). And the last thing you want is to pile up the right MTU handling logic in the driver itself so it would "know" that, after first frag all other's frag reserve data would need to be skipped. (well it's what some ethernet driver do, to overcome current mtu handling limitation, but that's should not be the way).

So this problem needs to be addressed, but globally and properly so it would fix all the issues in one logic.

pfalcon · 2017-05-08T15:03:28Z

That's actually not a proper way to handle it.

As a generic reply, well, that's why I keep throwing "WIP" and "DRAFT" net patches (#81, #49, have more in queue) - because I don't know enough of Z IP stack to come up with a proper solution, so dump them as "hints" for you guys...

In this case, it's clear that MTU matters should have been thought about from the start (like, when packet is requested, all proto packets should be allocated before letting user add data to it). Retrofitting MTU handling on top of what we have now leads to non-general, or ugly result.

Anyway, that's a generic "methodological" reply, let me ponder about the options you mention, though intuitively, they err on "complex" side, and thus the actual implementation may err on "ugly" side...

jukkar · 2017-05-10T07:54:10Z

As @tbursztyka mentioned we have two MTU issues here:

Protocol MTU like IPv6 (1280 bytes)
HW MTU, like ethernet (1500 bytes) and 802.15.4 (128 bytes)

For the 1. we have a IPv6 fragmentation patches but nothing for IPv4 atm.

This patch is about the point 2. I like the solution but as it is very difficult to know in advance how big the protocol headers are, so the solution gets tricky.
So we have probably two options:

Partition the data while starting to send it like described in this patch. Would it be possible to figure out the number of bytes protocol headers occupy beforehand? So similar thing that update_ll_reserve() is doing i.e., set callbacks along the stack which would return amount of bytes that would be added by that layer.
Partition the data while sending it. This is actually what the IPv6 fragmentation patches are doing. This causes some more pressure to the buffer count but could probably be done.

What do you think?

pfalcon · 2017-05-10T12:55:50Z

@tbursztyka :

Ok, so I did a moderately thorough pass thru things you talk about when preparing this patch, let's retrace these steps and see what I missed.

and the l2 reserve, which in case of 15.4 can vary

So, net_pkt_get_tx() eventually calls:

net_pkt_get_reserve(slab, net_if_get_ll_reserve(iface, addr6),
                                  timeout);

net_if_get_ll_reserve() gets an ipv6 addr argument exactly because of funky stuff like 15.4 which has varsize reserve, (not necessarily based on the address, it may identify peer state).

see ipv6.c: update_ll_reserve() for instance

So, I noticed that 15.4 reserve may vary, but the fact that it may need adjustment beyond the initial alloc is completely nuts, sorry ;-). Why don't we just alloc conservatively largest possible reserve (and let 15.4 keep being an underused protocol, apparently due to all its funky features like that, which make it hard and inefficient to use)?

pfalcon · 2017-05-10T13:13:13Z

mo, one of the right way to fix it when it's time to send the actually net_pkt: depending on MTU, protocol header, data length... It would either "slice" the data in as many necessary net_pkt (all being chained to first one)

Sorry, it's putting it all upside down. How currently Z native API works is: user gets packet from system, user fills it, user ask system to send it. The above involves system in "filling" part, which kills all simplicity and clarity of the current approach, and adds intermingled mix of stuff into the system. So yeah, I don't think that's viable. There's much easier way: user calls socket(), then user calls send() on it. Voila! This patch works perfectly with BSD Sockets approach, and I can keep it a part of socket code. (Then native API just stays broken in that respect.)

Better: send would update protocol header and stuff, all according to MTU. net_context_send would then return the amount of data it could send

It could send or it has sent?

and the net_pkt would be updated so the un-sent data would be still there (aka: allocated, etc..) ready to be sent as well by the user (or with some more data appended to it etc...)

So, we're back to "user munges packet data, but system needs to munge it too" approach. This one perhaps not as intermingled as the choice above, but still, why? Why, if currently the system just frees the sent packet and is done with it. Why, if there will be sockets which just work in a natural way everyone knows.

pfalcon · 2017-05-10T13:17:39Z

Also, because of this lack of MTU handling, buffer fragments always get the reserve applied... which in case of ethernet's MTU for instance make 1500/FRAG_SIZE fragments, on which - for instance - the ethernet header size is "reserved". So you loose 14 bytes * (fragments - 1) -> 1 being the first one which need this ethernet header.

Sorry, but that's not the problem of this patch ;-). It only means that the Great Buffer Management Rework, started with your wonderful net_pkt refactor, should not stop there, but continue.

Btw, is it really like you tell it? Because I thought the "user data" (which is not user data) is what's being repeated in each and every net_buf, while ll_reserve is done just once, on the first net_buf?

This reminds me what should be next step of G.B.M.R. - renaming user-data-which-is-not-user-data to system data or meta data.

pfalcon · 2017-05-10T13:26:08Z

it "works" for 15.4, but as soon as you mix up different bearers in the same system, it's under-optimized obviously

Yeah, some researcher who will survey design of Zephyr IP stack may write "... given all the suboptimalities we discussed, one may wonder whether it made sense to go for net fragments based design at all." But we're with it already, and it's at least different from how many other stacks do it, so it may as well be an innovation.

But see GH-3547 , point 2. I was really surprised to see it doing that. Like, we go out of our way for zero-copy reception, but with transmission, just roll bytes around? That features should be optional and off by default IMHO, unless of course 6lowpan mandates it (very unwise of them then).

pfalcon · 2017-05-10T13:35:47Z

As @tbursztyka mentioned we have two MTU issues here:
Protocol MTU like IPv6 (1280 bytes)
HW MTU, like ethernet (1500 bytes) and 802.15.4 (128 bytes)

Well, not the things this patch tries to address. Let's try to sort it all out.

MTU is Maximum Transmission Unit, https://en.wikipedia.org/wiki/Maximum_transmission_unit .
Here we talk about MTU of "native" IP hardware, like Ethernet. Stuff like 15.4, BLE has its MTU too, but it's not native IP hardware. IP layering on it is handled by 6lowpan, consequently matching 15.4/BLE MTU with IP packets is 6lowpan responsibility, not IP stack's. We even should not call that thing MTU, better max frame size or something.
There's no "Protocol MTU like IPv6 (1280 bytes)". 1280 is the minimum required to be supported IPv6 datagram size. It doesn't mean that IPv6 IP packet can't be smaller for example. It means that any IP packet of up to that size will guaranteedly reach destination, or we have broken IPv6 network. IP packets longer that that, are not guaranteed to come thru.

So, this patch addresses p.1 and only p.1, as all other points aren't really related to MTU as it's defined for IP networking. So, other points should be (and apparently are) handled elsewhere, while in the current design, net_pkt_append() is the only place to handle IP MTU. And as argued above, changing that design will be bigger problem than what we have now.

pfalcon · 2017-05-10T13:39:27Z

I like the solution but as it is very difficult to know in advance how big the protocol headers are, so the solution gets tricky.

Intuitively, it should not be like that. For LL headers, the largest required size should be conservatively allocated. Otherwise, is it possible for a user to control IP extension headers of each packet? If no, then IP-level header sizes also should be known at the time we allocate a new packet.

tbursztyka · 2017-05-10T13:43:42Z

There is no true 0-copy thingy here. It was imo a wrong marketing wording made on top of this stack. It's definitely improved over what existed before, but it's just not possible to make it 100% 0-copy, so I prefer not to use this word in net stack.

fragment design is not a bad idea, the only current big problem is that instead of being just a memory management scheme, it got mixed up with packet fragmentation, as it did not took care of MTU, though the info is available on net_if.

Surely net_pkt still needs work to address all of it, I could dive again in it as soon as I have time.

[edit]

Intuitively, it should not be like that. For LL headers, the largest required size should be conservatively allocated.

No, because on 15.4 for instance, and on the smallest/most optimized fragment size which is 128, you really don't want to always allocate the largest because it can be rather large on that tech, if I am not mistaken the largest ever would reduce your payload size to 85 bytes, at best.

pfalcon · 2017-05-10T14:56:28Z

What do you think?

As a final point to this discussion, we may look how others do it. In lwIP, there're distinct APIs for UDP vs TCP. Z uses consistent API, and that's one of distinctive points - it's closer to POSIX in that respect. Of course, that comes at a price - TCP is inherently "streaming" API, while Z forces to use explicit packet-based model for both UDP and TCP. In lwIP, packet reception is the same packet-based, but on send side, it tries to provide more stream-based API. There's a tcp_write(void *buf, int len) call, you don't deal with packets explicitly. On lwIP's side, that's said to write to "send buffer", the implementation is black-boxed (this "send buf" can be as well a packet). However, it's still user responsibility to check how much it can write (tcp_sndbuf() call), it's error to write more. Voila, everything else is done by lwIP. This includes Nagle algorithm (won't send a handful of bytes written, will wait for more, unless timeout comes first), retransmissions, etc.

So, again, Z's native API is both more low-level and more consistent. I would think those are its selling points and we should leave it that way. If we want more, then well, BSD Sockets are coming. But it's still good idea to fix obvious issues in the current API, like lack of handling of IP MTU (other MTUs are handled elsewhere). That's exactly what this patch does. If there're concerns that it doesn't leave enough room for extension of LL or IP proto headers, well, let's add another static "reserve" to subtract from MTU. Say, we reserve size of LL, proto headers, and 20 (40, 80) bytes more for their extension. That's not very ideal, but very practical and keep complexity at bay. (But I think that to the time we let user add data to a packet, the size of headers should be already known, and any extra reserve should be accounted before that, e.g. use the largest size of LL header).

pfalcon · 2017-05-10T15:00:48Z

It was imo a wrong marketing wording made on top of this stack. It's definitely improved over what existed before, but it's just not possible to make it 100% 0-copy, so I prefer not to use this word in net stack.

Well, IMHO, it's a good aim to strive for 0-copiedness in the native API, that's where it can differentiate from Sockets API which will be easier to use anyway.

Surely net_pkt still needs work to address all of it, I could dive again in it as soon as I have time.

Thanks. I think for 1.8, there was a great progress, and further internal refactorings could be restarted in 1.9. But for 1.8 it would be really nice to bust externally-facing issues, like this (not really critical, if we can't agree on a way to do it, well...), or really critical bugs in TCP handling (#137, etc.).

jukkar

I have been pondering this problem and it looks like the only practical solution is to split the too large packet just before we are about to send it.
So for example in ipv6.c net_ipv6_finalize_raw() creates IPv6 packet and it knows the IP header size at this point and can thus check if we overflow the MTU. If we do overflow it, then it is relatively easy to split the large packet into smaller pieces and clone the IPv6 header at this point. We should do the splitting before the net_pkt_compact() is called, otherwise it is more tricky to figure out where to split the pkt.
I can actually investigate how easy this solution would be.

tbursztyka · 2017-05-16T06:18:56Z

Yes, as a quick workaround, doing it in finalize function could work I think.
If you investigation can be made before 1.8, maybe it should be good to go then.

However as a grand final solution, it's not really going to be the end of the story. Let me detail.

We are actually messing up about this MTU right now, and for the reason we know: design issue, easy workaround, etc...
For instance on ethernet, the real hardware MTU is not only the max size of the payload (1500), it is the header + max payload + checksum = 1518 bytes.
See the checksum at the end, until now, and for a good reason (hardware do that job for us) we skipped it totally. But who knows where zephyr could be used? As a communication chip firmware, it could have to deal with this checksum by itself, etc...

Plus, MTU can be used to split an ip packet with ethernet, but it's not true with 15.4 where 6lo is in place (and then it's up to 15.4-6lo to split: aka fragment the ipv6 packet).

Now to make things worse (heh, it always does, doesn't it?) 15.4 might need to reserve space, after the payload and not for the checksum: for the authentication data when 15.4 is configured to be encrypted+authentified.

Currently, I handled this case in 15.4 with a hack: I just use ll reserve space header + auth data size, and then when it's time to encrypt/auth I memmoves here and there. It works... but it's really sub-optimized.

As you can see, there are much more case than just simple mtu and ip split.

So for now, imo the IP split on *_finalize() is fine, but there is still rooms for design improvements to address all the issues above. I actually planned to do so for a while, it's just not the biggest priority right now so I'll probably do it for 1.9.

pfalcon · 2017-05-16T06:31:10Z

Yes, I updated the title of this ticket to make it's clear it's not a priority until 1.9 work start, let's try to ponder about the issue in the meantime and let's later try to discuss the general approach, before trying to implement anything more complex than this patch. Thanks.

jukkar · 2017-05-16T07:54:50Z

I do not think this MTU work will get ready for 1.8 (even with some hacks). So if you have something already planned to 1.9, lets wait for that then.

nashif · 2017-08-01T16:47:38Z

what is the status of this?

pfalcon · 2017-08-01T18:13:06Z

As we didn't reach agreement how this should be handled on the level of native Zephyr networking API, this patch effectively went only into Sockets API. Closing then.

pfalcon · 2017-10-25T12:29:24Z

recheck

pfalcon · 2017-10-25T15:06:33Z

As far as I can tell, for TX packet, net_pkt->appdatalen isn't used until packet construction is finished and ready to be sent. So, it can be used to cache "remaining data size in the packet" (initialized in net_pkt_get_tx()).

pfalcon · 2017-10-31T12:29:51Z

(This is pending to be updated to use MSS for the TCP case, instead of just MTU).

pfalcon · 2017-10-31T18:25:21Z

(This is pending to be updated to use MSS for the TCP case, instead of just MTU).

So, #1330 actually has:

+	/* FIXME: use mss for sending length instead of the recv one */
+	mss = net_tcp_get_recv_mss(ctx->tcp);

I.e. also doesn't do it exactly right. With all the progress we have on these issues, let's go for proper fix. I started doing one in #4657 . Also included here, to make build pass. The original patch is updated to use send_mss for TCP case.

If we were asked to add 10KB to a packet, adding it won't help - such packet won't be even sent by hardware on our side, and if it is, it will be dropped by receiving side. So, make sure we never add more data than MTU as set for the owning interface. This actually gets a bit tricky, because we need also to account for protocol header space. Typically, when net_pkt_append() is called, protocol headers aren't even added to packet yet (they are added in net_context_send() currently), so we have little choice than to assume the standard header length, without any extensions. Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>

Now the check happens on the level of the core IP stack, in net_pkt_append(). Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>

pfalcon · 2017-11-09T20:50:55Z

@jukkar : This now should be ready for review/merge.

@tbursztyka : Based on the mailing list discussion, and the fact that #4243 relies on this, can you please re-review this?

[convert.sh] Fixed convert.sh to be more reliable on different OS's

nashif requested review from jukkar and tbursztyka May 8, 2017 13:50

tbursztyka requested changes May 8, 2017

View reviewed changes

jukkar added the net label May 12, 2017

jukkar reviewed May 15, 2017

View reviewed changes

pfalcon changed the title ~~net: net_pkt_append: Take into account MTU when adding data to a packet~~ [DRAFT, 1.9] net: net_pkt_append: Take into account MTU when adding data to a packet May 15, 2017

pfalcon mentioned this pull request May 16, 2017

net: context: Call send_cb for TCP after ACK is received #172

Closed

nashif added this to the v1.9 milestone May 17, 2017

pfalcon closed this Aug 1, 2017

nashif modified the milestones: v1.9, v1.9.0 Oct 3, 2017

pfalcon removed this from the v1.9.0 milestone Oct 25, 2017

pfalcon force-pushed the net-send-mtu branch from 47a9848 to f6b0f6d Compare October 25, 2017 15:12

pfalcon mentioned this pull request Oct 25, 2017

[WIP] RPL border router sample application #980

Closed

pfalcon force-pushed the net-send-mtu branch from f6b0f6d to fcf3f3e Compare October 31, 2017 18:22

pfalcon changed the title ~~[RFC] net: net_pkt_append: Take into account MTU when adding data to a packet~~ [RFC] net: net_pkt_append: Take into account MSS/MTU when adding data to a packet Nov 2, 2017

pfalcon mentioned this pull request Nov 3, 2017

net: tcp: Parse and store send MSS #4657

Merged

pfalcon force-pushed the net-send-mtu branch 4 times, most recently from d697089 to 240f3c7 Compare November 9, 2017 15:25

pfalcon force-pushed the net-send-mtu branch from 240f3c7 to 8b47dfc Compare November 9, 2017 18:20

net: sockets: sendto: Remove adhoc handling of max pkt payload size

ba53b3b

Now the check happens on the level of the core IP stack, in net_pkt_append(). Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>

pfalcon force-pushed the net-send-mtu branch from 8b47dfc to ba53b3b Compare November 9, 2017 18:57

pfalcon added this to the v1.10.0 milestone Nov 9, 2017

pfalcon changed the title ~~[RFC] net: net_pkt_append: Take into account MSS/MTU when adding data to a packet~~ net: net_pkt_append: Take into account MSS/MTU when adding data to a packet Nov 9, 2017

tbursztyka approved these changes Nov 10, 2017

View reviewed changes

jukkar merged commit 91e810f into zephyrproject-rtos:master Nov 10, 2017

jukkar mentioned this pull request Nov 10, 2017

[RFC] Split large TCP segments #1330

Closed

nagineni pushed a commit to nagineni/zephyr that referenced this pull request Nov 20, 2017

Merge pull request zephyrproject-rtos#119 from jprestwo/convert_fix

a3f9f9c

[convert.sh] Fixed convert.sh to be more reliable on different OS's

pfalcon mentioned this pull request Nov 21, 2017

net: pkt: Compute TX payload data length #5038

Merged

pfalcon mentioned this pull request May 21, 2018

Core Network Buffer management rework #7578

Closed

4 tasks

RoyAnupam mentioned this pull request Oct 7, 2020

BLE Connection fails to establish between two nRF52840-USB Dongles with Zephyr controller #29008

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: net_pkt_append: Take into account MSS/MTU when adding data to a packet #119

net: net_pkt_append: Take into account MSS/MTU when adding data to a packet #119

pfalcon commented May 8, 2017

pfalcon commented May 8, 2017

tbursztyka left a comment

pfalcon commented May 8, 2017

jukkar commented May 10, 2017

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017 •

edited

Loading

pfalcon commented May 10, 2017 •

edited by zephyrbot

Loading

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017

tbursztyka commented May 10, 2017 •

edited

Loading

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017

jukkar left a comment •

edited

Loading

tbursztyka commented May 16, 2017

pfalcon commented May 16, 2017

jukkar commented May 16, 2017

nashif commented Aug 1, 2017

pfalcon commented Aug 1, 2017

pfalcon commented Oct 25, 2017

pfalcon commented Oct 25, 2017

pfalcon commented Oct 31, 2017

pfalcon commented Oct 31, 2017

pfalcon commented Nov 9, 2017

net: net_pkt_append: Take into account MSS/MTU when adding data to a packet #119

net: net_pkt_append: Take into account MSS/MTU when adding data to a packet #119

Conversation

pfalcon commented May 8, 2017

pfalcon commented May 8, 2017

tbursztyka left a comment

Choose a reason for hiding this comment

pfalcon commented May 8, 2017

jukkar commented May 10, 2017

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017 • edited Loading

pfalcon commented May 10, 2017 • edited by zephyrbot Loading

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017

tbursztyka commented May 10, 2017 • edited Loading

pfalcon commented May 10, 2017

pfalcon commented May 10, 2017

jukkar left a comment • edited Loading

Choose a reason for hiding this comment

tbursztyka commented May 16, 2017

pfalcon commented May 16, 2017

jukkar commented May 16, 2017

nashif commented Aug 1, 2017

pfalcon commented Aug 1, 2017

pfalcon commented Oct 25, 2017

pfalcon commented Oct 25, 2017

pfalcon commented Oct 31, 2017

pfalcon commented Oct 31, 2017

pfalcon commented Nov 9, 2017

pfalcon commented May 10, 2017 •

edited

Loading

pfalcon commented May 10, 2017 •

edited by zephyrbot

Loading

tbursztyka commented May 10, 2017 •

edited

Loading

jukkar left a comment •

edited

Loading