Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLE commissioning fails after unauthenticated session changes #9738

Closed
bzbarsky-apple opened this issue Sep 15, 2021 · 8 comments · Fixed by #9792
Closed

BLE commissioning fails after unauthenticated session changes #9738

bzbarsky-apple opened this issue Sep 15, 2021 · 8 comments · Fixed by #9792
Milestone

Comments

@bzbarsky-apple
Copy link
Contributor

Problem

Using chip-tool to do BLE commissioning of an m5stack fails. On-network commissioning works, which is why CI is not affected.

Commissioner log:
�[0;32m[1631738075871] [54239:55735772] CHIP: [CTL] Read local id 0x000000000001B669, remote id 0x417EBD869B65B3AE�[0m
�[0;32m[1631738075872] [54239:55735772] CHIP: [CTL] Generating NOC�[0m
�[0;32m[1631738075872] [54239:55735772] CHIP: [CTL] Generating ICAC�[0m
�[0;34m[1631738075873] [54239:55735772] CHIP: [DL] _Init�[0m
�[0;34m[1631738075873] [54239:55735772] CHIP: [IN] UDP::Init bind&listen port=5541�[0m
�[0;34m[1631738075873] [54239:55735772] CHIP: [IN] UDP::Init bind&listen port=5541�[0m
�[0;34m[1631738075873] [54239:55735772] CHIP: [IN] TransportMgr initialized�[0m
�[0;34m[1631738075873] [54239:55735772] CHIP: [DIS] Init fabric pairing table with server storage�[0m
�[0;32m[1631738075873] [54239:55735772] CHIP: [ZCL] Using ZAP configuration...�[0m
�[0;32m[1631738075873] [54239:55735772] CHIP: [ZCL] deactivate report event�[0m
�[0;32m[1631738075873] [54239:55735772] CHIP: [DIS] Verifying the received credentials�[0m
�[0;32m[1631738075874] [54239:55735772] CHIP: [DIS] Added new fabric at index: 1, Initialized: 1�[0m
�[0;32m[1631738075874] [54239:55735772] CHIP: [DIS] Assigned compressed fabric ID: 0x68E9BD15864D25AB, node ID: 0x000000000001B669�[0m
�[0;32m[1631738075874] [54239:55735772] CHIP: [CTL] Joined the fabric at index 1. Compressed fabric ID is: 0x68E9BD15864D25AB�[0m
�[0;32m[1631738075874] [54239:55735775] CHIP: [CTL] Generated random node id: 0x5DA5575188D8D04C�[0m
�[0;32m[1631738075875] [54239:55735775] CHIP: [BLE] NewConnection�[0m
�[0;34m[1631738075880] [54239:55735777] CHIP: [BLE] CBManagerState: ON�[0m
�[0;34m[1631738075882] [54239:55735775] CHIP: [SC] Assigned local session key ID 11�[0m
�[0;32m[1631738075882] [54239:55735775] CHIP: [IN] Build plaintext message 0x7fc279009258 to 0x0000000000000000 of type 32 and protocolId 0 on exchange 31334 with MessageId 467623754.�[0m
�[0;32m[1631738075882] [54239:55735775] CHIP: [IN] Sending plaintext msg 0x7fc279009258 to 0x0000000000000000 at utc time: 7218694211 msec�[0m
�[0;32m[1631738075882] [54239:55735775] CHIP: [IN] Sending msg on generic transport�[0m
�[0;34m[1631738075882] [54239:55735775] CHIP: [IN] Message appended to BLE send queue�[0m
�[0;34m[1631738075882] [54239:55735775] CHIP: [SC] Sent PBKDF param request�[0m
�[0;32m[1631738075935] [54239:55735775] CHIP: [BLE] Connecting to device with discriminator: 3840�[0m
�[0;34m[1631738076364] [54239:55735775] CHIP: [IN] BleConnectionComplete: endPoint 0x10b285880�[0m
�[0;32m[1631738076533] [54239:55735777] CHIP: [BLE] subscribe complete, ep = 0x10b285880�[0m
�[0;32m[1631738076534] [54239:55735777] CHIP: [BLE] peripheral chose BTP version 3; central expected between 2 and 3�[0m
�[0;32m[1631738076534] [54239:55735777] CHIP: [BLE] using BTP fragment sizes rx 20 / tx 128.�[0m
�[0;32m[1631738076534] [54239:55735777] CHIP: [BLE] local and remote recv window size = 3�[0m
�[0;34m[1631738076534] [54239:55735777] CHIP: [IN] BLE EndPoint 0x10b285880 Connection Complete�[0m
�[0;32m[1631738077253] [54239:55735777] CHIP: [EM] Received message of type 0x21 with vendorId 0x0000 and protocolId 0x0000 on exchange 31334�[0m
�[1;31m[1631738077253] [54239:55735777] CHIP: [IN] Failed to decode EncryptedPacketBufferHandle header with error: ../../../examples/chip-tool/third_party/connectedhomeip/src/transport/raw/MessageHeader.cpp:154: CHIP Error 0x000000A7: Version mismatch�[0m
�[1;31m[1631738077253] [54239:55735777] CHIP: [EM] CHIP MsgId:1BDF5F4A not in RetransTable�[0m
�[1;31m[1631738080869] [54239:55735775] CHIP: [IN] Failed to decode EncryptedPacketBufferHandle header with error: ../../../examples/chip-tool/third_party/connectedhomeip/src/transport/raw/MessageHeader.cpp:154: CHIP Error 0x000000A7: Version mismatch�[0m
�[0;32m[1631738080869] [54239:55735775] CHIP: [IN] Sending plaintext msg 0x7fc279009258 to 0x0000000000000000 at utc time: 7218699197 msec�[0m
�[0;32m[1631738080869] [54239:55735775] CHIP: [IN] Sending msg on generic transport�[0m
�[0;34m[1631738080869] [54239:55735775] CHIP: [EM] Retransmit MsgId:00000000 Send Cnt 1�[0m
�[1;31m[1631738081124] [54239:55735777] CHIP: [IN] Failed to decode EncryptedPacketBufferHandle header with error: ../../../examples/chip-tool/third_party/connectedhomeip/src/transport/raw/MessageHeader.cpp:154: CHIP Error 0x000000A7: Version mismatch�[0m
�[0;32m[1631738081124] [54239:55735777] CHIP: [IN] Sending plaintext msg 0x7fc279009258 to 0x0000000000000000 at utc time: 7218699453 msec�[0m
�[0;32m[1631738081124] [54239:55735777] CHIP: [IN] Sending msg on generic transport�[0m
�[0;34m[1631738081124] [54239:55735777] CHIP: [EM] Retransmit MsgId:00000000 Send Cnt 2�[0m
�[1;31m[1631738081377] [54239:55735778] CHIP: [IN] Failed to decode EncryptedPacketBufferHandle header with error: ../../../examples/chip-tool/third_party/connectedhomeip/src/transport/raw/MessageHeader.cpp:154: CHIP Error 0x000000A7: Version mismatch�[0m
�[0;32m[1631738081377] [54239:55735778] CHIP: [IN] Sending plaintext msg 0x7fc279009258 to 0x0000000000000000 at utc time: 7218699706 msec�[0m
�[0;32m[1631738081377] [54239:55735778] CHIP: [IN] Sending msg on generic transport�[0m
�[0;34m[1631738081378] [54239:55735778] CHIP: [EM] Retransmit MsgId:00000000 Send Cnt 3�[0m
�[1;31m[1631738081637] [54239:55735777] CHIP: [IN] Failed to decode EncryptedPacketBufferHandle header with error: ../../../examples/chip-tool/third_party/connectedhomeip/src/transport/raw/MessageHeader.cpp:154: CHIP Error 0x000000A7: Version mismatch�[0m
�[1;31m[1631738081637] [54239:55735777] CHIP: [EM] Failed to Send CHIP MsgId:00000000 sendCount: 3 max retries: 3�[0m
�[0;32m[1631738105879] [54239:55735778] CHIP: [TOO] Pairing Failure: ../../../examples/chip-tool/third_party/connectedhomeip/src/controller/CHIPDeviceController.cpp:1522: CHIP Error 0x00000032: Timeout�[0m
�[1;31m[1631738105880] [54239:55735772] CHIP: [TOO] Run command failure: ../../../examples/chip-tool/third_party/connectedhomeip/src/controller/CHIPDeviceController.cpp:1522: CHIP Error 0x00000032: Timeout�[0m
�[0;34m[1631738105880] [54239:55735772] CHIP: [CTL] Shutting down the commissioner�[0m
�[0;34m[1631738105880] [54239:55735772] CHIP: [CTL] Shutting down the controller�[0m
�[1;31m[1631738105880] [54239:55735772] CHIP: [DL] Inet Layer shutdown�[0m
�[1;31m[1631738105880] [54239:55735772] CHIP: [DL] BLE layer shutdown�[0m
�[0;34m[1631738105880] [54239:55735772] CHIP: [IN] Clearing BLE pending packets.�[0m
�[0;32m[1631738105880] [54239:55735772] CHIP: [BLE] Auto-closing end point's BLE connection.�[0m
�[1;31m[1631738105880] [54239:55735772] CHIP: [DL] System Layer shutdown�[0m

On the commissionee side, it receives the PBKDF param request, sends PBKDF param response, then nothing good happens after that.

This was broken by #9269.

I stepped through what happens on the commissioner when it gets the PBKDF param response, on the merge changeset for #9269 (102c6e2). We find the right exchange, call ExchangeContext::HandleMessage, land in ExchangeMessageDispatch::OnMessageReceived, see there is a piggybacked ack, land in ReliableMessageMgr::CheckAndRemRetransTable and return false. That causes the caller to error out, and then everything errors out up the callstack.

The reason CheckAndRemRetransTable returns false is that in EncryptedPacketBufferHandle::GetMsgId we fail to parse the stored data to extract the message counter. And that happens because there place where the message data actually starts for *this is 5 bytes into the packet buffer (based on where the actual bytes for the message counter appear).

I tried modifying SecureSessionMgr::SendPreparedMessage like so:

--- a/src/transport/SecureSessionMgr.cpp
+++ b/src/transport/SecureSessionMgr.cpp
@@ -191,6 +191,7 @@ CHIP_ERROR SecureSessionMgr::SendPreparedMessage(SessionHandle session, const En
     PacketBufferHandle msgBuf = preparedMessage.CastToWritable();
     VerifyOrReturnError(!msgBuf.IsNull(), CHIP_ERROR_INVALID_ARGUMENT);
     VerifyOrReturnError(!msgBuf->HasChainedBuffer(), CHIP_ERROR_INVALID_MESSAGE_LENGTH);
+    msgBuf = msgBuf.CloneData();
 
     if (mTransportMgr != nullptr)
     {

and that makes the problem go away. So something after that point is adding bytes to the front of the buffer that then confuse us when we later try to parse the buffer. In the case I was looking at just now those bytes are: 0x0d 0x00 0x00 0x3d 0x00.

What confuses me is that at first glance the old code had the same basic structure as now: SessionEstablishmentExchangeDispatch::SendPreparedMessage just passed on the message to the transport. Now it calls mSessionMgr->SendPreparedMessage(), which does the same thing....

I can dig at this some more on Friday, but I am off tomorrow, so if someone else wants to pick this up in the meantime...

@pan-apple @kghost @tcarmelveilleux @turon @carol-apple @sagar-apple

Proposed Solution

Figure out what's going on.

@cecille
Copy link
Contributor

cecille commented Sep 16, 2021

This also appears to happen on IP commissioning.

@bzbarsky-apple
Copy link
Contributor Author

This also appears to happen on IP commissioning.

Interesting. I could not reproduce it with IP commissioning on ::1 (against the "linux" all-clusters-app), which makes sense because that configuration is tested in CI... I wonder what the difference between the successful and failing IP commissioning scenarios.

@kghost
Copy link
Contributor

kghost commented Sep 17, 2021

I haven't reproduced the problem, but I found some suspicious code in BtpEngine::HandleCharacteristicSend

characteristic = mTxBuf->Start();
characteristic -= header_size;
mTxBuf->SetStart(characteristic);

kTransferProtocolMaxHeaderSize is 5. So 5 bytes may have perpend to the packet.

0x0d 0x00 0x00 0x3d 0x00

  1. 0x0d: HeaderFlags kCommandMessage | kEndMessage | kStartMessage
  2. 0x00: GetAndRecordRxAckSeqNum
  3. 0x00: GetAndIncrementNextTxSeqNum
  4. 0x3d 0x00: Little endian encode of mTxLength = 61

@bzbarsky-apple
Copy link
Contributor Author

Oh, I know what's going on most immediately! We are doing MRP over BTP. We are not supposed to be doing that! And when we try to do it, it's broken as it has been all along (because that code path never got taken, because it's not supposed to be taken).

So the real question is why we are doing MRP over BTP.

@Damian-Nordic
Copy link
Contributor

Damian-Nordic commented Sep 17, 2021

I'll copy my finding from slack. I found that in https://github.com/project-chip/connectedhomeip/blob/master/src/messaging/ExchangeContext.cpp#L115 state is nullptr so the code can't figure out that the peer address is BLE and it assumes MRP is needed. Not sure how it worked before.

@bzbarsky-apple
Copy link
Contributor Author

Yep, that's what I just found. Before the state came back as non-null, which is how it worked. And that meshes with this being broken by #9269 somehow.

@franck-apple franck-apple added this to the Test Event 6 milestone Sep 17, 2021
@bzbarsky-apple
Copy link
Contributor Author

In other places in #9269 where GetPeerConnectionState was happening it looks like we added branches on IsSecure() and don't call GetPeerConnectionState when !IsSecure. But that's not the right thing to do here, because we do in fact want to do MRP for CASE setup, which is not secure, but can happen over UDP.

@bzbarsky-apple
Copy link
Contributor Author

Ah, I see how this might need to work. Patch coming up, I hope. But more generally, it seems like we need to audit all GetPeerConnectionState consumers? @kghost

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 17, 2021
We were incorrectly doing MRP over BTP, because there was no peer
connection state for our insecure session when doing a SPAKE
handshake.

The various header include additions were needed to fix header files
that used types without declaring them and hence depended on other
headers being included before them.

Fixes project-chip#9738
woody-apple pushed a commit that referenced this issue Sep 17, 2021
We were incorrectly doing MRP over BTP, because there was no peer
connection state for our insecure session when doing a SPAKE
handshake.

The various header include additions were needed to fix header files
that used types without declaring them and hence depended on other
headers being included before them.

Fixes #9738
mleisner pushed a commit to mleisner/connectedhomeip that referenced this issue Sep 20, 2021
We were incorrectly doing MRP over BTP, because there was no peer
connection state for our insecure session when doing a SPAKE
handshake.

The various header include additions were needed to fix header files
that used types without declaring them and hence depended on other
headers being included before them.

Fixes project-chip#9738
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants