-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP stack can't recover from a packet overload #7571
Comments
Can you include a copy of the .config for the app yiu have loaded? As well that the buffer setyings you've already tried raising? |
I tried bumping all values I could find in the networking area without any noticeable effect but I'm happy to try specific changes you'd like me to try. |
The key value for RX buffers would be: Also, enable the following debugs: Once added, you can use the following 2 commands for debugging in the console shell:
|
I think we're actually not freeing back up buffers on an error path, so they're staying allocated. |
I'll try that tomorrow in the office. |
I've put a full capture of the debugging log as well as the commands at the end here: https://gist.github.com/therealprof/6a7263c6633b461be79ad270636b0468 Increase of the RX buffers made no difference. |
Wow, there are packets in your log which are running > 1KB data size. Is the eth port in promiscuous mode? Can you run "net allocs" once you run out of buffers? |
No, just a regular office network.
It's at the bottom of the log. |
I think you are right here. Increasing the number bufs does not really help but just postpones the inevitable out-of-bufs issue. |
Are you actively sending something to zephyr device, or is it mostly some multicast / broadcast traffic? |
According to the log, you are not running out of buffers. The "net allocs" command tells that all the buffers are freed just fine. Also according to the log the device never fully run out of buffers |
My bad, there was out-of-buffers situations in the log: [net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer but according to the log, the system recovered that successfully. |
Well, yes, I'm trying to use LwM2M but most of the traffic is actually broadcasts in the network.
No it does not. The application (lwm2m-client) doesn't receive any traffic any more after one of those messages. Actually earlier, but when this message appears you know it has happened and it's time for a reboot. I am very convinced it is due to all the traffic in the network because the same device will run much longer in a different environment, e.g. my home. At the office it sometimes even fails to register at the lwm2m server but at the lastest 3 minutes after boot it is not communicating anymore. |
Hum RX are going lower and lower, and worse RDATA is a lot near 0 (thus why sometimes it is unable to get enough frags for the received packet). Looks like the host has not enough time to consume received packets. Sounds like RX queue has a too low priority maybe. @jukkar how can this be tweaked now? (we used to have one rx thread but now it's slightly different as we have rx queue(s). ) Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore). |
Note that, even if rx queue priority could fix your issue, there is no buffer bloat mitigation logic in Zephyr's net stack. So it is still possible to flood it (knowing that it runs on a board with a limited amount of RAM with a bearer like Ethernet) |
So the warnings disappear but the data flow still stops. Maybe there's some other foul play here. |
@tbursztyka I guess the core problem is that the ethernet driver can't know if the packet is wanted or unwanted until the fully formed(allocated) packet+buffers reaches the ip management / connection code ... which could certainly lead to flooding as you mentioned if the RX queue can't process the packets fast enough to free back up the net_pkt/net_bufs for more incoming packets. Examining the ethernet case: could we send the "destination MAC" from the very first buffer to a new L2 (ethernet specific) function which could determine if the entire frame could be tossed away instead of allocating any more buffers. The downside is that each ethernet driver would need to make that call prior to calling net_recv_data. |
@mike-scott What you describe is the promiscuous mode. Afaik most controllers, if not all in fact, filters packet on the dst mac: if it's not for them, they drop it. That said, if the feature is not enabled or badly configured on driver level, then that won't work. Could be the case here, I don't know. @therealprof The frag size tweaking is just to use memory in a bit better way (less frags so less overhead of net_buf struct), but there is still the issue of rx packet consumption. |
@tbursztyka 1KB broadcast packets seem large. I guess we would need some debugging to make sure these packets are valid. |
I meant to say that from memory point of view the system recovers this low memory issue just fine (according to provided logs). It is totally different matter what happens in other parts of the zephyr (be it in IP stack or application). |
I would argue that the system should work better with 128 byte buffer than with 512. This depends on the network traffic of course but with smaller buffers the network packets are probably better fit into net_buf and we do not run out of buffers easier. Note that we have a samples/net/throughput_server sample app for measuring UDP traffic throughput, you could try that one too and see what happens with your board. See the readme file in that sample for usage details. |
We cannot prevent flooding but we should recover from it of course. For testing this one can use the throughput_server app I mentioned earlier. |
In order to get some idea what goes wrong and in what part of the system, could you enable some debugging options and post the log somewhere.
That setting will print lot of stuff and will require lot of flash so we might need to tweak that a bit but we can start with this one. |
No worries,
However I can only test next Wednesday (if I don't forget), won't be at the office before. |
Was that for lwm2m_client? When I compile for nucleo_f429zi I get
Only thing I added to prj.conf file in that sample was the CONFIG_NET_LOG_GLOBAL=y |
Instead of waiting forever for a network buffer, have a timeout when allocating net_buf. This way we cannot left hanging for a long time waiting for a buffer and possibly deadlock the system. This commit only adds checks to core IP stack in subsys/net/ip Fixes zephyrproject-rtos#7571 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Instead of waiting forever for a network buffer, have a timeout when allocating net_buf. This way we cannot left hanging for a long time waiting for a buffer and possibly deadlock the system. This commit adds checks to L2 and network support libraries. Fixes zephyrproject-rtos#7571 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Instead of waiting forever for a network buffer, have a timeout when allocating net_buf. This way we cannot left hanging for a long time waiting for a buffer and possibly deadlock the system. This commit adds checks to L2 and network support libraries. Fixes #7571 Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
It's not proven that 7c7cfdd fixes this, reopening. |
@therealprof: Are you back from vacation now? Can you please retest this to see if there're any improvements? (With my usual tests, I don't see any so far, so we should keep pulling on this thing.) |
@pfalcon Yes I am. Unfortunately I don't have the Nucleo-F429ZI here with me in the "hostile" environment and the FRDM-K64F build is broken, too. :( |
and the FRDM-K64F build is broken, too. :(
Can you elaborate on this?
|
@therealprof, Btw, if you still experience issue, I recommend applying and enabling (CONFIG_SYS_LOG_DBG_ERR=y) my cute patch: #8769 , and see if there're any correlation with the state it gets into and logging printed. |
@pfalcon I tried on my FRDM_K64F in the office today (aka hostile networking environment) and it was running extremely unstable (lots of buffer problems and faults: "Imprecise data bus error") though I'm not convinced it is the same issue as before. I guess I need to crank up the logging and try again on Monday. |
Can you please give more details on how to reproduce it? I assume it's still lwm2m_client sample. I have neither busy Ethernet framework nor even LwM2M server here, and I don't get faults running from today's master. Perhaps, the faults are exactly dependent on the network traffic you have? Any clarifications/step to reproduce would be appreciated.
|
@pfalcon The new crashes I've been seeing seem to be mostly unrelated to general network traffic, at least a Nucleo board will continue to run for a whole day just fine. However the FRDM will reliably crash with all the time and I can easily speed it up with LwM2M operations:
Note that there's a few seconds delay before the last aborting then the 0x15210 is sys_dlist_remove It is a bit nasty to debug in normal mode, I'm only getting:
And of course once I set options for "better debugging experience" and fire it up in the debugger it chucks along just fine. |
I'm running this on 3b80998, BTW. |
I'm running 3b80998 too, as can be seen from log above. But for me, memcpy is at 0x5836. I wonder if going for that "reproducible build" stuff would be actually useful for us. You run SDK 0.9.3, don't you? |
(Just had my laptop battery crash, so will write many small msgs.) I also see that faults happen with real LwM2M interaction, I don't have setup to reproduce it. But such kinds of faults look suspiciously stack-related, so I'd suggest bumping stack size, and maybe not just main, but other too. |
Nope, never have, never will unless you're going to release it for macOS... |
Ok, so we're in a typical situation of non-reproducibility, on multiple levels ;-). I guess going as far as exchanging binaries won't give much good, so we should just assume that this issue is not fixed for 1.13, and keep it open. When you'll be able to test your original scenario with Nucleo-F429ZI, please post the results. I'll be approach the same issue from my side (e.g. #3132, #7831). |
I guess this will need lowering of priority to not serve as a release blocker. |
@pfalcon Oh sorry, as mentioned above my Nucleo seems to run stable now in a heavy network. |
Oops, might have read that by diagonal. Well, great news then, @jukkar's efforts weren't in vain! I can only suggest to close this ticket then, other issues like frdm_k64f faults can be pursued elsewhere. |
Sure, works for me. You might want to remove the LwM2M label too... |
I've connected my Nucleo-F429ZI to a reasonably busy network (i.e. a constant flow of broadcast messages which also end up at the network interface of the MCU) and it very soon ends up receiving more packets than it can handle but the main problem is that even if the load lightens the network application doesn't receive any packets anymore and renders the system useless.
After the first occurrence of this error message the application is dead:
I tried increasing buffers and stack but to no avail. Not sure which information I can provide to aid in debugging but it is 100% reproducible (and hence very annoying ;) ):
The text was updated successfully, but these errors were encountered: