Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mutex for queue processing. #24

Merged
merged 2 commits into from
Mar 31, 2024

Conversation

Fabian-Schmidt
Copy link

As part of esphome/esphome#5373 I identified that the method AsyncEventSourceClient::_runQueue is not multi threading safe.
Multiple threads sending messages will result in the same message being transmitted multiple times to the client.
This PR adds a mutex to prevent this.

Copy link

@descipher descipher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with ESP32S, resolves the duplicate line issues.

@GelidusResearch
Copy link

Did some load testing and I have observed exceptions in the process. Used a WEB based ESP trace dump processor however it does not output anything. Not sure why. Will need to investigate further. Happens when the logger is under pressure but it could be a number of things such as the WEB server, HA API, uart or mqtt.

@Fabian-Schmidt
Copy link
Author

Did some load testing and I have observed exceptions in the process. Used a WEB based ESP trace dump processor however it does not output anything. Not sure why. Will need to investigate further. Happens when the logger is under pressure but it could be a number of things such as the WEB server, HA API, uart or mqtt.

Can you share the test code so I can reproduce it locally?
Do you think this issue was introduced by this change?

@GelidusResearch
Copy link

Can you share the test code so I can reproduce it locally? Do you think this issue was introduced by this change?

Observed with this code https://github.com/GelidusResearch/grps when the number component values are rapidly fired in the WEB UI it sends out significant logging and HA API publish events. The exception occurs at random points under this pressure. I don't think this change is the root cause but I have not ruled it out. Removing the log queuing suppressed the exception frequency but that's not and indication either since there are other issues present like API pressure. Really need to see the where it was in the stack dump but thats not working atm.

@descipher
Copy link

descipher commented Oct 23, 2023

Update: Tested this more, when the serial logging outputs to the local port and or the Async Web Server we do not observe any exceptions with a queue length of up to 24. At a queue length of 32 it will exception. I have not decoded the trace yet. The queue max is hit and a queue limit error is logged at 24. At 32 no limit error is logged.

@Fabian-Schmidt
Copy link
Author

Fabian-Schmidt commented Oct 23, 2023

I can see the log output ERROR: Too many messages queued when using a log queue of length 32. But I am not able reproduce an exception or crash. My simple config is as follows:

esphome:
  name: basic

esp32:
  board: esp32dev
  framework:
    type: arduino

logger:
  level: VERBOSE
  log_queue_length: 32

interval:
  - interval: 1s
    then: 
      lambda: |-
        int num_executions = 0;
        for (int i = 0; i < 50; i++) {
          ESP_LOGI("Test", "I am at execution number %d", num_executions);
          id(s01).publish_state(num_executions);
          id(s02).publish_state(num_executions);
          id(s03).publish_state(num_executions);
          num_executions++;
        }

number:
- platform: template
  id: s01
  name: s01
  optimistic: true
  min_value: 0
  max_value: 100
  step: 1
- platform: template
  id: s02
  name: s02
  optimistic: true
  min_value: 0
  max_value: 100
  step: 1
- platform: template
  id: s03
  name: s03
  optimistic: true
  min_value: 0
  max_value: 100
  step: 1

web_server:
  port: 80

wifi:
  ssid: !secret wifi_ssid
  password: !secret wifi_password

When the ESP32 is crashing very different exceptions very time it is possible that it is an out of memory situation.
The log queue is using up to 33*512 byte of memory ~16.5kb (log_queue_length * tx_buffer_size). So lowering tx_buffer_size can shrink the memory consumption.

@descipher
Copy link

Do you have an HA instance available, It could be the api socket. I will try a test without and api socket.

@descipher
Copy link

Ok finally captured the stack dump.

PC died just after _runQueue, looks like the Q pointer is out of bounds to me.

Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.

Core  1 register dump:
PC      : 0x400e5869  PS      : 0x00060c30  A0      : 0x800e58cd  A1      : 0x3ffb2540
A2      : 0x3ffdb80c  A3      : 0x00000006  A4      : 0x3ffdb818  A5      : 0x3ffdd2a0  
A6      : 0x0a0d0a0d  A7      : 0x7d223035  A8      : 0x00000000  A9      : 0x00000000  
A10     : 0x3ffdf9dc  A11     : 0x00000006  A12     : 0x0000005a  A13     : 0x3ffdd098
A14     : 0x00000000  A15     : 0x000a0d0a  SAR     : 0x0000000a  EXCCAUSE: 0x0000001c
EXCVADDR: 0x0000000a  LBEG    : 0x40089344  LEND    : 0x4008935a  LCOUNT  : 0xffffffff  


Backtrace:0x400e5866:0x3ffb25400x400e58ca:0x3ffb2560 0x400e590b:0x3ffb2580 0x400e5965:0x3ffb25a0 0x400da15e:0x3ffb25e0 0x40167729:0x3ffb2620 0x400d7ead:0x3ffb2640 0x400d683d:0x3ffb2690 0x400d80ba:0x3ffb26c0 0x400d97f1:0x3ffb2700 0x400da7dd:0x3ffb2720 0x400da832:0x3ffb2770 0x40167579:0x3ffb2790 0x401675d5:0x3ffb27b0 0x400dee38:0x3ffb27d0 0x400e130a:0x3ffb2800 0x400ec3f1:0x3ffb2820 

0x400e5869: LinkedList::Iterator::operator++() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\StringArray.h:60
0x400e5869: AsyncEventSourceClient::_runQueue() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\AsyncEventSource.cpp:243
0x400e5866: LinkedList::Iterator::operator++() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\StringArray.h:60
0x400e5866: AsyncEventSourceClient::_runQueue() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\AsyncEventSource.cpp:243
0x400e58ca: AsyncEventSourceClient::_queueMessage(AsyncEventSourceMessage*) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\AsyncEventSource.cpp:195
0x400e58ca: AsyncEventSourceClient::_queueMessage(AsyncEventSourceMessage*) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\AsyncEventSource.cpp:181
0x400e590b: AsyncEventSourceClient::write(char const*, unsigned int) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\AsyncEventSource.cpp:230
0x400e5965: AsyncEventSource::send(char const*, char const*, unsigned int, unsigned int) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\.piolibdeps/presence-1/ESPAsyncWebServer-esphome/src\AsyncEventSource.cpp:321
0x400da15e: esphome::web_server::WebServer::on_number_update(esphome::number::Number*, float) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/components/web_server\web_server.cpp:744
0x40167729: operator() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/core\controller.cpp:59
0x40167729: _M_invoke at c:\users\mlaspina.laspina\.platformio\packages\toolchain-xtensa-esp32\xtensa-esp32-elf\include\c++\8.4.0\bits\std_function.h:297
0x400d7ead: std::function::operator()(float) const at c:\users\mlaspina.laspina\.platformio\packages\toolchain-xtensa-esp32\xtensa-esp32-elf\include\c++\8.4.0\bits\std_function.h:687
0x400d7ead: esphome::CallbackManager::call(float) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/core\helpers.h:482
0x400d7ead: esphome::number::Number::publish_state(float) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/components/number\number.cpp:13
0x400d683d: esphome::ld2420::LD2420GateSelectNumber::control(float) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/components/ld2420/number\gate_config_number.cpp:31
0x400d80ba: esphome::number::NumberCall::perform() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/components/number\number_call.cpp:114
0x400d97f1: operator() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/components/web_server\web_server.cpp:768
0x400d97f1: _M_invoke at c:\users\mlaspina.laspina\.platformio\packages\toolchain-xtensa-esp32\xtensa-esp32-elf\include\c++\8.4.0\bits\std_function.h:297
0x400da7dd: std::function::operator()() const at c:\users\mlaspina.laspina\.platformio\packages\toolchain-xtensa-esp32\xtensa-esp32-elf\include\c++\8.4.0\bits\std_function.h:687
0x400da7dd: esphome::web_server::WebServer::loop() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/components/web_server\web_server.cpp:154
0x40167579: esphome::Component::call_loop() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/core\component.cpp:76
0x401675d5: esphome::Component::call() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/core\component.cpp:98
0x400dee38: esphome::Application::loop() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src/esphome/core\application.cpp:74
0x400e130a: loop() at X:\github\descipher\esphome.configs\.esphome\build\presence-1\src\main.cpp:592
0x400ec3f1: loopTask(void*) at X:\github\descipher\esphome.configs\.esphome\build\presence-1\c:/user/descipher/.platformio/packages/framework-arduinoespressif32/cores/esp32\main.cpp:50

@descipher
Copy link

My test is invalid, mutex was not in the code, must have been overwritten. Re-applying and retesting.

@descipher
Copy link

My bad, with the mutex back in place it's functional. Sorry.

@Fabian-Schmidt
Copy link
Author

Fabian-Schmidt commented Oct 23, 2023

I have the feeling the hole code is not perfect.
I can crash my ESP32 by just reloading the web site 5-6 times quickly.
So I think the goal currently is not perfect code...but just a bit better.

@descipher
Copy link

Yep, same page here. One small step at a time. Thanks for working on it.

@Fabian-Schmidt
Copy link
Author

Hi @jesserockz, do you have time to review this PR?

@mathieucarbou
Copy link

FYI I came across another implementation here (https://github.com/me-no-dev/ESPAsyncWebServer/pull/887/files), explained here (Aircoookie/WLED#3382 (comment))

@jesserockz jesserockz merged commit 9c2034d into esphome:master Mar 31, 2024
@Fabian-Schmidt Fabian-Schmidt deleted the MultiThreadingIssue branch April 8, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants