-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
examples/ota: add documentation on how to deliver update via ethos #15
base: wip/rebase/ota_work_branch
Are you sure you want to change the base?
examples/ota: add documentation on how to deliver update via ethos #15
Conversation
Answering here to your questions: vagrant@vagrant:~$ ifconfig tap0
tap0 Link encap:Ethernet HWaddr d2:86:c8:f2:26:b3
inet6 addr: fe80::d086:c8ff:fef2:26b3/64 Scope:Link
inet6 addr: fe80::1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3 errors:0 dropped:0 overruns:0 frame:0
TX packets:15 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:186 (186.0 B) TX bytes:1642 (1.6 KB) vagrant@vagrant:~$ route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.0.2.2 0.0.0.0 UG 0 0 0 enp0s3
10.0.2.0 0.0.0.0 255.255.255.0 U 0 0 0 enp0s3
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 Apparently I have no routes for ipv6 addresses. Did you run some of the tuntap scripts provided in RIOT? |
For completeness: vagrant@vagrant:~/RIOT$ sudo sh dist/tools/ethos/start_network.sh /dev/ttyACM0 tap0 2001:db8::/64
net.ipv6.conf.tap0.forwarding = 1
net.ipv6.conf.tap0.accept_ra = 0
----> ethos: sending hello.
----> ethos: activating serial pass through.
----> ethos: hello reply received
----> ethos: hello received
main(): This is RIOT! (Version: 2018.07-devel-2297-g359ee-HEAD)
RIOT OTA update over CoAP example application
firmware: running from slot 1
Firmware magic_number: 0x544f4952
Firmware Version: 0x40920186
Firmware start address: 0x00001100
Firmware APPID: 0
Firmware Size: 66940
Firmware HASH: 52 43 83 1a d8 d4 d5 b0 29 db 39 c1 14 9d f0 d5 2f da 19 bf 77 f1 3d a5 14 f3 5d a3 cd 38 ed b4
Firmware chksum: 0x4749f0b9
Firmware signature: b9 1c 6b 32 84 fe e5 ce fb fd 86 a1 90 5e 31 c0 cf a1 9e 81 07 a5 c7 da d7 ea bb 12 b3 09 ce 48 fd 29 2d 60 d6 67 b4 fc ff 7f e5 9a 81 d8 be a9 96 05 33 87 50 3b 72 50 32 52 f5 3e e6 bf aa 06
Waiting for address autoconfiguration...
Configured network interfaces:
Iface 6 HWaddr: 25:A2 Channel: 26 Page: 0 NID: 0x23
Long HWaddr: 79:67:37:4E:0E:B8:A5:A2
TX-Power: 0dBm State: IDLE max. Retrans.: 3 CSMA Retries: 4
AUTOACK ACK_REQ CSMA MTU:1280 HL:64 IPHC
Source address length: 8
Link type: wireless
inet6 addr: fe80::7b67:374e:eb8:a5a2 scope: local VAL
inet6 group: ff02::1
Iface 7 HWaddr: 00:2E:CE:65:09:42
MTU:1500 HL:64 Source address length: 6
Link type: wired
inet6 addr: fe80::22e:ceff:fe65:942 scope: local
inet6 group: ff02::1 |
Sorry, for IPv6 the command is different: vagrant@vagrant:~$ route -n6
Kernel IPv6 routing table
Destination Next Hop Flag Met Ref Use If
2001:db8::/64 fe80::2 UG 1024 0 0 tap0
fd00:dead:beef::1/128 :: !n 256 0 0 lo
fe80::/64 :: U 256 0 0 enp0s3
fe80::/64 :: U 256 1 5 tap0
::/0 :: !n -1 1 110 lo
::1/128 :: Un 0 3 23 lo
fd00:dead:beef::1/128 :: Un 0 1 0 lo
fe80::/128 :: Un 0 1 0 lo
fe80::1/128 :: Un 0 1 0 lo
fe80::a00:27ff:fed1:2804/128 :: Un 0 1 0 lo
fe80::d086:c8ff:fef2:26b3/128 :: Un 0 2 12 lo
ff00::/8 :: U 256 0 0 enp0s3
ff00::/8 :: U 256 1 3 tap0
::/0 :: !n -1 1 110 lo |
Hmm looks quite correct and similar to mine. The only important difference being that your device has 2 network interfaces (Iface 6 is the radio, Iface 7 the ethos). Could you try to temporarily disable the 6 one? |
My setup:
Pc side:
|
Yep, that was the trick... However we need then to be able to specify which interface is configured to do the tunneling, usually ethos is used to set up a border router, so the default routing goes through the wireless interface. |
Sorry, my board (the saml21-xpro) does not have the radio that is why I did not notice that. I will try to check if I can find how it can be forced eventually. |
Yes, definitely. Back to the main topic, it seems something is not working... vagrant@vagrant:~/RIOT$ sudo sh dist/tools/ethos/start_network.sh /dev/ttyACM0 tap0 2001:db8::/64
net.ipv6.conf.tap0.forwarding = 1
net.ipv6.conf.tap0.accept_ra = 0
----> ethos: sending hello.
----> ethos: activating serial pass through.
----> ethos: hello reply received
----> ethos: hello received
main(): This is RIOT! (Version: 2018.10-devel-849-g2e0d9-vagrant-HEAD)
RIOT OTA update over CoAP example application
firmware: running from slot 1
Firmware magic_number: 0x544f4952
Firmware Version: 0x41005359
Firmware start address: 0x00001100
Firmware APPID: 0
Firmware Size: 54752
Firmware HASH: 66 80 ca 04 67 5e 1c 2d 4a f1 e3 61 42 10 77 c5 7e 67 c2 c9 f6 5b 1c 09 ad 01 c6 b7 60 03 eb eb
Firmware chksum: 0x8fe042fb
Firmware signature: 58 b5 34 5c d9 72 c0 45 5f 30 95 42 22 af 38 63 33 4e 58 76 77 a8 b7 7d a8 86 1a 9b 03 d9 aa 3f c3 3b 66 b3 86 6a 4c 78 eb 8e d4 da 79 3f 6e d6 f1 2d 6a b3 46 f8 18 a2 15 8c 06 9e 42 12 d9 0c
Waiting for address autoconfiguration...
Configured network interfaces:
Iface 5 HWaddr: 00:2e:ce:65:09:42
MTU:1500 HL:64 Source address length: 6
Link type: wired
inet6 addr: fe80::22e:ceff:fe65:942 scope: local VAL
inet6 group: ff02::1
inet6 group: ff02::1:ff65:942
ota: received bytes 0-64
ota: initializing update to target slot 2
ota: received bytes 0-64
ota: initializing update to target slot 2
ota: received bytes 64-128
ota: received bytes 128-192
ota: received bytes 192-256
ota: verifying metadata ...
ota: verification successful
ota: received bytes 192-256 of 55008 (left=54752)
coap_ota_handler(): ignoring already received block
ota: received bytes 256-320 of 55008 (left=54688)
ota: processing bytes 256-319 page 521-0x20900
ota: received bytes 320-384 of 55008 (left=54624)
.
.
.
.
coap_ota_handler(): ignoring already received block
ota: received bytes 54912-54976 of 55008 (left=32)
coap_ota_handler(): ignoring already received block
ota: received bytes 54976-55008 of 55008 (left=0)
coap_ota_handler(): ignoring already received block
ota: image hash incorrect! Do you also have lots of |
Yes I had many of that "ignoring" prints, but it always succeded at the end. I supposed it was due to slow link and udp timeouts. |
I have now the same problems too and after some research I noticed that just a first part of flash is written. To me it was working because by luck (actually unluck) the flash already had the image from manual run before and if you rebuild with just APP_VER modified it will by chance have the same image. If I always totally erase the flash beforehand I have the same problems as you describe. |
The "ignoring already received block" is erroneous, actually the block is not "already received" but "out of sequence". Indeed if you look:
You see there is a packet (576 to 640) that gets lost! then after that it will not write anything to flash as it is out of sequence. But it is not "already received" but just out of sequence. Indeed the check in the code is:
We go in the "else" but not because we already received, but just because it is != because it is different, in this case block1.offset==640 while _state.writer.offset=576 So now the main problem is that a packet is sometimes lost. We could also of course do a better logging there. |
Just an update: if you use COAP packet size (-b option) of 16 or 32 bytes, so slow down the transfer, then it works correctly! (can you confirm in case please?) I will check now in Ethos/COAP buffer sizes, must be getting lost there. |
e25454a
to
6dea07b
Compare
I updated the doc with the smaller size on the command line. I'm still unsure where the packet gets lost: ETHOS has a 2k buffer, STDIO UART 64 (tried to make it bigger, didn't change). Maybe on lower level indeed. I'll try to investigate further in case it may be a generic problem even if using a smaller COAP packet solves here the issue. I would still propose to change a bit the coap_ota_handler handling of erroneous packets as by previous comment. |
@bergzand can you take a look to the coap_ota_handler? Are you using the same for block2 transfers? |
@fedepell unfortunately it didn't work neither with blocks of 32 nor 16 bytes :-( I'll try to erase completely my device and test again. You're testing this branch isn't it? |
@kYc0o : very sorry for that :( Have no idea what's wrong then. For sure we have a different board but still would expect to work.
|
I'm trying to debug the issue this morning. I created a fake firmware with all zeros, except the first byte of each 64 block with an integer growing, so I can track exactly the packets. I added some logging code in the COAP ota handler.
Now one very interesting thing: I tried to enable the debug in nanocoap.c to trace the lost packet with:
And this tells me much more information. But was is very interesting is.... with debug enabled there are no more packet lost! (@kYc0o maybe you can cofirm out of curiosity if happens the same on your side?). Actually just putting a single line to trace:
in nanocoap.c just before:
Will mask the problem Now I will try to continue to find what changes without debugging... |
Another maybe interesting note: setting the serial speed to 9600 (you have to do it both in Makefile of ota application and in the start_network.sh script) will also mask down the problem. |
It seems the packet is corrupted on the serial/ethos side. A sane packet of the sequence (for 64 bytes payload data) is 172 bytes long, for example (see the 04 just before the long sequence of zeros is my counter growing):
While the packet that gets "lost" is shorter, sometimes 1 byte sometimes more, for example in this run I got it 170:
This packet is then discarded and you get the out of sequence problem. Of course the main issue to address now would be to see why this is lost on serial and I will check more (but not today anymore :( ). Of course the fact that slower baud rate / packet size make the problem go away or less visibile makes also sense. But generic question: don't we anyhow address the possibility of a COAP UDP packet being lost (as in this case) and ask to retry the sending? I think this is important when developing an OTA update. Now it seems to me that if a packet is lost the process will collapse (actually continue going on discarding packets). |
Just as additional info the screenshot from wireshark session. You can see packet number 10 is the one that gets lost. You can see that there is no ACK/Continue for it (goes from 9 to 11) and there is no trial to restore or, at that point, just to close the whole session which is anyway useless at that point. |
Wow! Thank you so much for the detailed information, it's very useful. Now, that makes me think that there might be some bugs in the nanocoap block1 implementation, since it wasn't thoroughly tested. I'd like to have advice from @kaspar030 and @kb2ma who are more familiar with the code. Do you guys see something strange? I'll make more tests on my side taking into account your suggestions, and also try to capture some traces from wireshark. Thank you again for all the effort! |
Let me make sure I understand the problem. The coap-client (coap-cli?) process is pushing firmware blockwise (block1) to a nanocoap server implemented in sys/net/application_layer/ota/coap.c, right? How is coap-client implemented? One or more packets from coap-client are lost on the way to nanocoap. However, neither coap-client nor coap.c do anything about it. I see two problems.
Let me know if I've missed something, and we can take it from there. |
@kb2ma: Thanks a lot for the feedback. I'm no way an expert in this but I'll try to explain what I can see:
Now if the packet is indeed "already received" (so a duplicate) I can agree it can be just ignored, but as it is written now also if it is a future packet (so we are missing one as in the described case) it will now be ignored. |
It's libcoap. In my case in the develop branch with the latest commit. @kb2ma I asked for your knowledge if something is going wrong on the coap or nanocoap side, since I'm aware you're not familiar with the code in ota/coap.c. Now that @fedepell explains, I guess the problem is on that part. Anyways if you have some hints it would be highly appreciated! |
OK, just "@" me if you have specific questions. It sounds like @fedepell is on the right track. On the RIOT server side it looks like it's an application issue to track when it does not receive the next expected block -- or maybe there is some CRC/hash validation at the end. I'm confused on the client side though. libcoap recognizes when it does not get a response to the verification packet (due to the delay) and retries. However, it does not seem to recognize when it does not get a response to the missing packet (#10 in the screenshot), and doesn't retry. Maybe libcoap is not looking at the block number in the response. Also, I'm surprised the request/response block numbers don't match up in the Wireshark screenshot. I would expect to see the ACK for block #10 right after the CON for block #10. Maybe nanocoap also responds to all of the retries for the delayed verification block? I can't see enough of the Wireshark output to say. If you want to forward the Wireshark pcap I can take a closer look and compare to RFC 7959. |
@kb2ma : thanks a lot for your help! Please find attached the pcap. You will notice that at packet 3 there is no response for some time so coap-client will resend the packets. In that time RIOT is doing signature verification so it's stopped for some time as this process is rather slow, so coap-client is resending. (I had to zip the pcapng file as GH will not support that extension) Just a side note to help following the packets: except the first packet which contains a update information and second that contains the header data, all the other are filled with zeros in data except the first byte which is an integer growing each time. I did this to simplify my tracking of lost data over the channel, so don't worry if the update looks quite "zeroish" ;) |
Also pinging @bergzand for any ideas since he implemented block2. |
Somehow this rings a bell. @bergzand didn't we see this before? |
Thanks for the pcap, @fedepell. Here are my reactions. Pkt. 13 -- libcoap retries msgid 29549. A CoAP server should respond with an empty ACK if it can't send a piggybacked ACK with the full response in a timely manner. See 7252, sec 5.2.2. nanocoap (and gcoap) should do this, but don't. This also might help libcoap correlate the responses. Pkt. 21 -- nanocoap ACKs msgid 29549 again. This is appropriate. 7252, sec 4.5: "The recipient SHOULD acknowledge each duplicate copy of a Confirmable message using the same Acknowledgement or Reset message but SHOULD process any request or response in the message only once." Pkt. 22 -- libcoap sends msgid 29551 for block 5. This is wrong IMO. It has not received the ACK for 29550 yet. I wonder if it is misinterpreting Pkt. 21 by only matching on token (length is zero, so everything matches) rather than also verifying msgid. libcoap really should disregard Pkt. 21 and the other duplicate ACKs for 29549. See 7252, sec. 5.3.2. -- an ACK MUST match the message ID and token. Ideas:
|
@kb2ma: Thanks for the detailed analysis! I did a session now with high verbosity and your number 3 seems to be confirmed by the output here (my comments in spaced lines with [ ] ):
Hope things are readable with this ugly formatting! |
Let me catch up to the backlog here
This is all still using only the block1 code right? Did I miss something in reading this PR?
Very familiar. @fedepell If I remember correctly (it's a few months since I worked on this) what can happen is that after the verification is done, a CoAP ACK is transmitted back to the coap-client. The coap handler starts processing the packet in the queue. This is again bytes 192-256, since one of the duplicates transmitted by coap-client is still in a buffer somewhere. This packet is of course discarded as a duplicated by the handler and a response is transmitted back to the client to continue. At this point a "bug" occurs in libcoap where this continue is mismatched as a response to the newly transmitted bytes 256-320. Somehow this causes the coap handler on the node to never receive this packet (I don't remember exactly what happens here), while the coap-client application continues to bytes 320-384.
If my memory serves me right this is spot on! |
I think the RFC is also not completely clear on whether the client should match block numbers in requests and responses. |
Just to wrap up how the problem of coap-client is generated:
When a packet is lost (step 2) it will not be retransmitted as the extra ACK from a previously retransmitted packet will be "used" to match the lost packet and the transfer will anyway continue without the retransmission. Now the first point always happens since RIOT/OTA is not responding while it's doing the verification (so usually, at least on my SAML, 3-4 extra ACKs are generated). The second case happens very often when we use Ethos as there seems to be some other underlying problem there (my feeling is that when flash is being written the uart buffer either gets full or something even nastier). Of course this case may happen in general, also over wireless or gprs or such (OTA indeed) UDP packets may be delayed and lost. With ethos now we just have a quite easy way to reproduce this delay+loss. |
I've had some concurrency issues with ethos before (RIOT-OS#9890), but there the whole ethos quits functioning. Maybe you could try and check if that solves some of the stability issues for you. Note that during the cryptographic operations, the whole coap server on the node is unresponsive, but packet are still received (since that happens in different threads) |
@fedepell, your wrap-up sounds right to me. I think there are two things to follow-up on with CoAP.
|
Well, we considered that. The problem is that without serious refcatoring of the code, this is gonna be difficult. From within the handler, nanocoap doesn't know it's peer... |
You're right, I had not considered that. gcoap has the same limitation because it uses the same coap_handler_t callback for request handling. As a hack/experiment, we could add a 'remote_client' sock_udp_ep_t* attribute to coap_pkt_t until we device a general solution. For the general solution, it might make sense to update the coap_handler_t callback to accept a more structured context object rather than the current resource context void*. A structured context object (call it coap_response_memo_t) could include the current resource.context as well as the remote peer sock_udp_ep_t*, and whatever else we need to add in the future. In the other direction, gcoap uses the gcoap_request_memo_t for a client to remember what it sent to the server. In fact, RIOT-OS#9857 plans to add gcoap_request_memo_t to the gcoap response handler callback. |
Thanks a lot @kb2ma @kaspar030 and @bergzand for your insights! Let me know if I'm wrong, but for me (and for our current use case) the "problem" is that we pause for too long the block1 transfer and thus we provoke an undesired behaviour which leads to packet loss and sequence mismatching. Though I acknowledge the problems found, I think our first patch would be to process this first part of the binary (a.k.a metadata) separately from the firmware e.g. in two separate transfers. Anyways, it is how SUIT does and I think we'll follow that direction by default. To summarize, how bad is to make a CoAP block1 transfer wait for a "long time" before it's considered a timeout? I know that in our case is not a timeout, since as @bergzand said the other threads continue to receive the packets, however we have the problems described here. |
Just one question. @kb2ma said:
If this was true, so coap-client would continue with the next id just when it received the previous one, then the whole process should work correctly, no? So my question is: isn't what kb2ma mention necessary? While I must admit I didn't read (yet) the RFC it sounds strange to me that we have some IDs if then nobody is matching them at all with an ACK. |
In the pcap @fedepell sent, the ACK from nanocoap is stamped 21 seconds after the initial PUT from libcoap. During that time period libcoap has sent 3 retransmissions. IMO, it's a mistake to rely on the retransmit mechanism to recover for a delay of this length. The user expects a firmware update to be reliable, and this is flaky from the outset without even considering that the wireless link itself may be challenging.
I agree that this is a good choice. Let's take the time to implement separate confirmable responses and understand what's happening with libcoap. Then we can recombine these two steps again into a single, reliable transfer. |
I see two problems in the pcap you sent:
If either of these two shortcomings were addressed, the requests/responses should stay synchronized -- aside from the separate issue of RIOT not sending the ACK for one block (block 14 in the pcap you sent). |
@kb2ma : Thanks a lot for the info!
I would say (generally speaking) both should be fixed as to make both (libcoap on one side and nanocoap on the other) theoretically interoperable with any other coap counterpart, no? My question was actually pointing at this: fixing one would suffice at least for now to continue and it sounds to me that fixing libcoap could be easier than nanocoap. (maybe naive thinking :) ) As for block 14: that is the packet that RIOT never gets, due to data lost on the channel. That is what actually breaks the complete update (wireshark log being done PC side), if that wasn't missing it would just keep gluing blocks together correctly despite acks being out of sync. There, if everything worked, the communication should have stopped, packet should have been resent and so on. |
Thinking about this a little more. Even if there are two separate transfers, isn't there still an issue with the processing delay? Certainly it would be better if this occurred while preparing the ACK for the last packet of the first transfer. I would still expect libcoap to retransmit the last packet though while waiting for the ACK. [ just edited last sentence --libcoap, not nanocoap ] |
Yes, I think that fixing either one would be sufficient. I don't know which would be easier. I suggest sharing your pcap with the libcoap project. My experience is that @obgm is responsive to good questions. Copy me, and I'll jump in if necessary. :-) |
Done (obgm/libcoap#270) ! ;) |
@kYc0o if you want to test with libcoap/coap-client from obgm/libcoap#271 it works much better now. |
I'll try to take a look this weekend. |
Sorry for the long wait, but was very busy lately. Changes upstream were merged, so I'll try to test asap with current upstream master of libcoap. |
@kYc0o Be aware that the libcoap changes are still in the develop branch, not master. |
Oh! Thanks @obgm, any ETA for the change merged into master or next release? |
We are currently preparing 4.2.0-RC3, final release hopefully still in 2018. |
As the base branch for this PR is quite outdated, I guess there's maybe no point to merge it. Tests pass so far. @fedepell are you still using something from here? |
@kYc0o : no problem, we could discard this branch indeed. Once OTA gets implemented in master I could add this doc as I think it is very useful for fast testing (or for boards like my SAML21-XPRO without wireless/net) but for the time being there is not much to do I guess. |
Contribution description
Add documentation on how the OTA procedure can be tested delivering the update via serial interface using the ethos driver
Issues/PRs references
See comments in RIOT-OS#9969
Replaces #12 as that became a bit messy because of rebasing :)