console_task should repeat command extraction if there are still commands in the buffer #243

hg42 · 2018-03-16T01:57:18Z

I had a lot of problems with the lpc1768 code, because of shutdowns.

Every variant I tried didn't work in one or another way.

usb serial (from cruwaller) -> shutdowns while printing (early)
uart serial with interrupt -> no connection at all (timeout)
uart with polling (console_check_input) -> timeout in connection or while sending config

after many debugging sessions it melted down to a very simple change in console_task:

instead of extracting a single command:

...
    int8_t ret = command_find_block(receive_buf, rpos, &pop_count);
    if (ret > 0)
        command_dispatch(receive_buf, pop_count);
    if (ret)
        console_pop_input(pop_count);
...

it should immediately retry if there is still a command in the buffer:

...
    int8_t ret = 1;
    while(ret) {
        ret = command_find_block(receive_buf, rpos, &pop_count);
        if (ret > 0)
            command_dispatch(receive_buf, pop_count);
        if (ret)
            console_pop_input(pop_count);
    }
...

This solved all my former problems, so all my code variants are working now (ahem...not really, usb is not tested yet).

Reasoning:
A polling time of 100ms (periodic_event, which seems to be the rate at which the console buffer is examined) is not enough when receiving messages at higher rates.
I think identify data and config data are sent in multiple chunks (without waiting?).
I am not sure why, but I think while printing there are situations when multiple messages add up in the receive buffer.
In both cases a single wait of 100ms can lead to a fatal overrun.
So, processing of bytes in the receive buffer should be as fast as possible.

I guess, some of the other issues with shutdowns and timing could also fade away when changing this code.

More background:
While testing, I also solved this either

by reducing the periodic_event time to 100us, or
by adding an unconditional sched_wake_tasks(); in console_task() (= continuous rescheduling)

Both lead to much faster processing of the receive buffer.

So, as an alternative there could be an extra queue for high priority tasks which could be executed at a higher rate. But I think this is not necessary right now and adds more complexity and probably unnecessary processing time.

I am adding this as issue, because it should probably be changed at all occurrences of console_task and my code base is very different.
Also, may be you have a better idea how to do this (as usual :-)).

Apart from that I have no clue, why the AVR doesn't seem to suffer from this, despite it has much less processing power.
From what I heard the SAM (Due) works, too. And it's code is very similar to the lpc code.
May be my use case is a bit more demanding...but with the lpc1768, it was totally impossible to print without this change.

The text was updated successfully, but these errors were encountered:

hg42 · 2018-03-16T02:01:07Z

I think, even if there are other possibilities, this simple code change is cheap and the fastest possible, so it seems to be the right thing to do...

KevinOConnor · 2018-03-16T13:02:58Z

On Fri, Mar 16, 2018 at 01:57:20AM +0000, Harald wrote: instead of extracting a single command:

[...]

it should immediately retry if there is still a command in the buffer:

I would not recommend that. It could lead to starvation of other tasks. In particular, it could (under high load) starve the watchdog task. All of the Klipper tasks are careful to issue a sched_wake_tasks() if they need to be run again. So, going back out and servicing any other tasks takes no appreciable time before we're back in the command dispatch code again. [...]

Apart from that I have no clue, why the AVR doesn't seem to suffer from this, despite it has much less processing power. From what I heard the SAM (Due) works, too. And it's code is very similar to the lpc code. May be my use case is a bit more demanding...but with the lpc1768, it was totally impossible to print without this change.

This isn't the place to report issues on code not in the Klipper repo. If you can reproduce an issue in the main Klipper code, I'd be happy to take a look at it. If it can't be reproduced on the main Klipper code then it's not appropriate for here. (Feel free to ask questions about how the main klipper code works - irc or the mailing list is probably more appropriate than a github issue though.)

…

-Kevin

hg42 · 2018-03-16T15:53:49Z

I thought of starvation, too. So I also tried other possibilities, e.g.:

if(receive_pos)
  sched_wake_tasks();

at the end of console_task() which should do what you describe.
While this improved the situation a bit, it wasn't fast enough.
I think if more messages are in the buffer, they cannot wait.

It seems to be a balancing act between reading the receive buffer before it fills up and the rate of periodic_event.
If the periodic_event rate is fast enough, the buffer processing can be slower.

All of the Klipper tasks are careful to issue a sched_wake_tasks() if they need to be run again

This code is taken from your original sources, e.g. from https://github.com/KevinOConnor/klipper/blob/master/src/sam3x8e/serial.c#L113 :

// Process any incoming commands
void
console_task(void)
{
    uint8_t pop_count;
    uint32_t rpos = readl(&receive_pos);
    int8_t ret = command_find_block(receive_buf, rpos, &pop_count);
    if (ret > 0)
        command_dispatch(receive_buf, pop_count);
    if (ret)
        console_pop_input(pop_count);
}
DECL_TASK(console_task);

The same code is in AVR. Only rpos is read from 8bit instead of 32bit.
This does not check if there is another message in the buffer, in which case it should probably call sched_wake_tasks().
In a worst case the other message has to wait until periodic_event is triggered again.
The lpc1768 code is very similar to the Arduino Due code which seems to work (but I don't have one, so I cannot test under same conditions).

I am not sure what happens exactly, may be this:
If MESSAGE_SYNC is received, the receive interrupt calls sched_wake_tasks(). So a message should be processed via task scheduling.
But if multiple messages are received before processing begins, then only one message will be processed and the others remain in the buffer.
I think this could be solved by using counters instead of a flag. It's more difficult, though, because each task can have multiple rescheduling requests. One run_tasks() processes all tasks. So simply counting the number of requests isn't working. Probably each task should have it's own counter.

I created this issue in your repo, because the code is the same. The difference is only in the timing. That means other implementations will probably get similar problems.

Btw. I tested the LPC USB code used by cruwaller today and it behaves exactly like before (shutdown after a few moves).
And for the serial UART code using polling, the while loop isn't sufficient, because the FIFO is only 14 bytes long. This only works with a frequent periodic_event or continuous scheduling. However the while loop definitely reduces the timing pressure.

FYI: with the UART code using interrupts + while loop or polling + continuous scheduling the LPC1768 can reach 1000 mm/s (naturally only with steppers, no mechanical parts). And at this high speed the acceleration still sounds smooth, so I think the timing is still correct. I tested this with a real model and with 100mm square movements.

hg42 · 2018-03-16T16:36:40Z

I just tested to limit the loop to a few iterations:

    for(int i = 0; i < 2; i++) {
        uint8_t pop_count = 0;
        int8_t ret = command_find_block(receive_buf, receive_pos, &pop_count);
        if (ret > 0)
            command_dispatch(receive_buf, pop_count);
        if (ret)
            console_pop_input(pop_count);
        else
            break;
    }

A single retry (i< 2) seems to work, even at high speed.

Apart from that, a loop probably doesn't prevent scheduling, because I assume the message receive time is less than the message processing time.

KevinOConnor · 2018-03-16T17:00:06Z

On Fri, Mar 16, 2018 at 08:53:51AM -0700, Harald wrote: > All of the Klipper tasks are careful to issue a sched_wake_tasks() if they need to be run again This code is taken from your original sources, e.g. from https://github.com/KevinOConnor/klipper/blob/master/src/sam3x8e/serial.c#L113 : ``` // Process any incoming commands void console_task(void) { uint8_t pop_count; uint32_t rpos = readl(&receive_pos); int8_t ret = command_find_block(receive_buf, rpos, &pop_count); if (ret > 0) command_dispatch(receive_buf, pop_count); if (ret) console_pop_input(pop_count); } DECL_TASK(console_task); ``` The same code is in AVR. Only rpos is read from 8bit instead of 32bit. This does not check if there is another message in the buffer, in which case it should probably call sched_wake_tasks().

It does check - the call to sched_wake_tasks() is in console_pop_input().

I created this issue in your repo, because the code is the same. The difference is only in the timing. That means other implementations will probably get similar problems.

I can see nothing wrong with the code, and you've repeatedly indicated the main Klipper code does not show the problem. All indicators point to an issue in external code. I'm really puzzled why you would report this as a defect here.

…

-Kevin

hg42 · 2018-03-17T14:32:04Z

I agree, that I didn't experience this issue with RAMPS.
Unfortunately I don't have any other CPU from your range, so I cannot test with e.g. SAM32. It works for @cruwaller.

You are also right, that console_pop_input will reschedule the task(s) when a command is removed from the buffer (which probably is the correct place).
The problem is

our code also includes this
and also the sched_wakeup_tasks() in the interrupt handler
and they are called (checked with LEDs)
the whole stategy parallels how you do it with the SAM32

but my observations were

unconditionally adding a sched_wakeup_tasks() to console_task does process additional messages from the buffer and prevents overflow
this seems to be illogical
this might eventually indicate some flaw in command_find_block (I guess)

I will recheck these observations, hopefully today or tomorrow...

So what to do...

I would prefer to keep this open for other developers porting klipper to other CPUs to see which experiences they have and remind them that there might be "something"
I'll happily close it, if it is proven wrong
if instead you want it to be closed, I would want to do it myself, so I can reopen it when I have more data

I understand that you tend to keep your issues clean.
However, they can also remind of something that is otherwise hidden to developers, even if you doubt it is true. Neither you nor me have a true proof currently, so I would keep it open.
This depends on how you use the issue list. The purpose could be signaled by attaching labels to it (but I didn't use them with github, yet, so I am not sure what is possible).
So you could add "obscure" to it :-)

hg42 · 2018-03-17T21:08:40Z

so...I re-visited all my tests and found what's going wrong.

I basically had these three test situations and tried to draw conclusions from them, but the reasons were different than I thought:

usb serial (from cruwaller) -> shutdowns while printing (early)
-> fails for other reasons (still not found)
uart serial with interrupt -> no connection at all (timeout)
-> fails because the console_sendf function doesn't work
this was the most misleading one
uart with polling (console_check_input) -> timeout in connection or while sending config
-> doesn't work because the FIFO is too short for 100ms polling

Additionally I was misinterpreting a part of console_pop_input (thinking wrong way because of the variable name needcopy, you only see what you already think :-) ).
In reality the need to copy also means that there are more (eventually incomplete) messages in the buffer. Only in this case the sched_wake_tasks() is called.
That's why the (LED) pattern was unexpected for me, because I assumed it would be called for every message.

I apologize for taking your time and will close this issue.

hg42 changed the title ~~console_task should repeat command extrcation if the buffer is still full~~ console_task should repeat command extraction if there are still commands in the buffer Mar 16, 2018

hg42 closed this as completed Mar 17, 2018

github-actions bot locked and limited conversation to collaborators Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

console_task should repeat command extraction if there are still commands in the buffer #243

console_task should repeat command extraction if there are still commands in the buffer #243

hg42 commented Mar 16, 2018

hg42 commented Mar 16, 2018 •

edited

Loading

KevinOConnor commented Mar 16, 2018 via email

hg42 commented Mar 16, 2018

hg42 commented Mar 16, 2018

KevinOConnor commented Mar 16, 2018 via email

hg42 commented Mar 17, 2018 •

edited

Loading

hg42 commented Mar 17, 2018

console_task should repeat command extraction if there are still commands in the buffer #243

console_task should repeat command extraction if there are still commands in the buffer #243

Comments

hg42 commented Mar 16, 2018

hg42 commented Mar 16, 2018 • edited Loading

KevinOConnor commented Mar 16, 2018 via email

hg42 commented Mar 16, 2018

hg42 commented Mar 16, 2018

KevinOConnor commented Mar 16, 2018 via email

hg42 commented Mar 17, 2018 • edited Loading

hg42 commented Mar 17, 2018

hg42 commented Mar 16, 2018 •

edited

Loading

hg42 commented Mar 17, 2018 •

edited

Loading