Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help in figuring out why sometimes on/off command is not sent to all bind devices (TZ-1431) #517

Closed
theorlangur opened this issue Dec 31, 2024 · 9 comments
Labels

Comments

@theorlangur
Copy link

Question

If it's possible I'd like to get help on the issue I'm having.

  • I have a presence sensor configured as a router device.
  • presence sensor is bound directly to 1 light with own on/off client cluster (and to a coordinator as well)
  • presence sensor has also a server on/off cluster
  • presence sensor under cirtain conditions gets an on command from a coordinator that in turn triggers the sensor to issue on command to own binds (aforementioned light and a coordinator itself)

The problem

most of the time it works and 2 commands are issued: to the light and to the coordinator.
However sometimes I can see that only 1 on command to a coordinator is issued, sometimes no commands at all.

Additional info:

  • the issue doesn't seem to appear when on command is generated upon internal conditions (one of the own sensors are active)
  • the command is sent with address_mode set to ESP_ZB_APS_ADDR_MODE_DST_ADDR_ENDP_NOT_PRESENT with esp_zb_zcl_on_off_cmd_req API call
  • there's a callback on command send status assigned via esp_zb_zcl_command_send_status_handler_register and that one gets called without errors.
  • there's another piece of code reacting to ESP_ZB_CORE_CMD_DEFAULT_RESP_CB_ID. This one doesn't receive the response from the lights (naturally)

I suspect it has something to do with how "busy" the lower level zigbee machinery is.
How could I troubleshoot it? Why mostly command is sent to all bound devices with no problems and sometimes it doesn't get sent at all or just partially?
Apparently lower level code is not able to send the command. Is there some dedicated callback to know when that happens? At the moment I'm catching this situation with a dedicated internal timer that times out.
Does it make sense to manually send dedicated commands to all relevant bound devices? (in other words not with ESP_ZB_APS_ADDR_MODE_DST_ADDR_ENDP_NOT_PRESENT but with e.g. ESP_ZB_APS_ADDR_MODE_16_ENDP_PRESENT)?

I'm attaching also a wireshark capture of the described situation.
The presence sensor has a short address of 0xce27
The target light in question has a short address of 0xe979
I was applying the filter _ws.col.def_src == "0xce27" || _ws.col.def_dst == "0xce27"
Packet 50 denotes a command from a coordinator to 0xce27 that is supposed to trigger on command from a presence sensor 0xce27 to the light 0xe979.
In this case the command is sent to the coordinator in a packet 87 with a sequence number 191. This sequence number I also see via other means (I report back such cases to zigbee2mqtt via separate attributes) and I can see that I get the send status and waiting for the response.
At packets 199, 200, 201 one can see a repeated attempt after a timeout which this time succeeds.
timeout_tsn191_ce27_to_e979.pcapng.zip

Thanks in advance!

Additional context.

No response

@github-actions github-actions bot changed the title Need help in figuring out why sometimes on/off command is not sent to all bind devices Need help in figuring out why sometimes on/off command is not sent to all bind devices (TZ-1431) Dec 31, 2024
@theorlangur
Copy link
Author

theorlangur commented Dec 31, 2024

an update
after changing a logic a bit (I was not properly processing command send status before) I'm getting ESP_ERR_TIMEOUT on those cases when I'm attempting to send a command.
Is there something to research more about it or should I just accept the fact that sometimes sending command just fails?
Are there any possible signs/indicators I could use to judge whether it makes sense to send the command right now or wait (be notified?) until a better moment..?

@xieqinan
Copy link
Contributor

xieqinan commented Jan 2, 2025

How could I troubleshoot it? Why mostly command is sent to all bound devices with no problems and sometimes it doesn't get sent at all or just partially?

I don’t think the low-level busy state will be triggered in your application since there are only three devices in your network. I suggest we debug this issue by analyzing the sniffer data first. However, I’m unable to parse the .pcap file without the network key. Could you please share the network key with me? Additionally, I would appreciate it if you could provide the complete .pcap file, capturing from the commissioning process to the point of failure.

Does it make sense to manually send dedicated commands to all relevant bound devices? (in other words not with ESP_ZB_APS_ADDR_MODE_DST_ADDR_ENDP_NOT_PRESENT but with e.g. ESP_ZB_APS_ADDR_MODE_16_ENDP_PRESENT)?

You can try testing it using the ESP_ZB_APS_ADDR_MODE_16_ENDP_PRESENT method, it make sense.

Is there something to research more about it or should I just accept the fact that sometimes sending command just fails?
Are there any possible signs/indicators I could use to judge whether it makes sense to send the command right now or wait (be notified?) until a better moment..?

I believe the SDK can fulfill your application requirements. However, let’s identify the root cause of the failure first.

By the way, is the light device a sleep device?

@theorlangur
Copy link
Author

theorlangur commented Jan 2, 2025

@xieqinan , thanks for the reply
I realized I've forgotten to provide some possibly important info on the matter:

  • presence sensor in question is based on the esp32h2 chip
  • the network this device is operating in consists of 74 devices (34 routers, 40 end devices)
  • the light device that is controlled via a direct binding is a router device (IKEA's STOFTMOLN ceiling/wall lamp WW24, link)

low-level busy state

is there an official way to know about it? is there a way to be notified when the low-level is 'ready' again?

just to re-iterate a bit:
as I mentioned in my update post, I'm getting now a command send status with an ESP_ERR_TIMEOUT, so I can see that the lower level was not able to send the command for whatever reason. And that's fine in itself (I guess the network could be congested at times after all), but I'd like to rule out some inefficiencies of working with a ZigBee stack on my part and possibly react to such events in a quicker, more reliable way.

@theorlangur
Copy link
Author

@xieqinan , so here's a complete pcap starting with commisioning to the first timed out command. However this time I can see from the log that the command is actually sent but not responded by the light. (There's also 2nd shorter attachment at the end that illustrates an original problem)

Presence sensor: 0x53b5
Light that is being directly controlled: 0x063e (unlike in my original description that's a different one, but also from IKEA, also a router)

Attached sniff includes the following potential points of interest:

  • in packet 2950 there's a bind request to 0x53b5 to bind to 0x063e
  • in packet 3139 there's a bind request from 0x53b5 to 0x063e to bind it to 0x53b5 (in order to get reports about on/off state of the lights)
  • example of normal flow: in packet 5148 there's an 'external trigger' signal from coordinator, in packets 5179, 5184 there are on commands to the coordinator and 0x063e lights with responses to those commands in packets 5195, 5206
  • in packet 16547 coordinator sends the 'external trigger' signal to 0x53b5 in form of an on command
  • in packet 16580 , 16634 presence sensor sends on command with a TSN 107 and doesn't receive reply to it
  • in packet 16967 it attempts to send it again (interestingly enough with the same TSN of 107), this time with success as in packet 16981 0x063e responds.

I'm attaching also another shorter one where presence sensor manages to send on command to a coordinator but not to 0x063e (only after a timeout). In packet 171 there's an 'external trigger' from coordinator, as a response to which the presence sensor issues own on command and we see that it's sent in packet 280 to the coordinator but only 0.5s later to 0x063e. I don't know what it means, but the on command with TSN 65 is sent only much later at a packet 779.

timeout_tsn65.pcapng.zip

timeout_tsn107_53b5_063e.pcapng.zip

@theorlangur
Copy link
Author

@xieqinan , wanted to ask if you had a chance to look into this and if maybe you have some advice regarding a better command sending routine?
So far having a dedicated running timeout when sending commands and retrying on failure or timeout did the trick, but I wonder if there's a better way
Thx

@xieqinan
Copy link
Contributor

xieqinan commented Jan 14, 2025

@theorlangur ,

Can I simplify your problem as follows?

In a large network containing 74 devices, a unicast frame is not acknowledged by the destination device initially. However, after approximately 3 seconds, the same frame is retransmitted and acknowledged.

If so, I believe this is normal behavior. Due to the limited bandwidth of IEEE802.15.4, frames can occasionally be missed by devices in traffic network. Zigbee incorporates MAC and APS retransmission mechanisms to mitigate the likelihood of missed frames. For RxWhenIdle devices, APS retransmits frames that do not receive an APS ACK after 3 seconds.

@theorlangur
Copy link
Author

@xieqinan , not in this case but anecdotally it sometimes doesn't send the command. But that doesn't matter all that much. As I understand this may be normal and my current approach with an explicit timeout actually works the way I'm ok with.
Probably the last question as a response to your last comment: is there a way to configure somewhere these 3 seconds re-send attempt? Or is it now accessible/configurable from the application level?
thx!

@xieqinan
Copy link
Contributor

@theorlangur

is there a way to configure somewhere these 3 seconds re-send attempt? Or is it now accessible/configurable from the application level?

The timeout duration is a constant value defined within the stack, so no application-level option is provided to modify it.

@theorlangur
Copy link
Author

@xieqinan , then I think it concludes this topic.
Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants