Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

console_task should repeat command extraction if there are still commands in the buffer #243

Closed
hg42 opened this issue Mar 16, 2018 · 7 comments

Comments

@hg42
Copy link

hg42 commented Mar 16, 2018

I had a lot of problems with the lpc1768 code, because of shutdowns.

Every variant I tried didn't work in one or another way.

  • usb serial (from cruwaller) -> shutdowns while printing (early)
  • uart serial with interrupt -> no connection at all (timeout)
  • uart with polling (console_check_input) -> timeout in connection or while sending config

after many debugging sessions it melted down to a very simple change in console_task:

instead of extracting a single command:

...
    int8_t ret = command_find_block(receive_buf, rpos, &pop_count);
    if (ret > 0)
        command_dispatch(receive_buf, pop_count);
    if (ret)
        console_pop_input(pop_count);
...

it should immediately retry if there is still a command in the buffer:

...
    int8_t ret = 1;
    while(ret) {
        ret = command_find_block(receive_buf, rpos, &pop_count);
        if (ret > 0)
            command_dispatch(receive_buf, pop_count);
        if (ret)
            console_pop_input(pop_count);
    }
...

This solved all my former problems, so all my code variants are working now (ahem...not really, usb is not tested yet).

Reasoning:
A polling time of 100ms (periodic_event, which seems to be the rate at which the console buffer is examined) is not enough when receiving messages at higher rates.
I think identify data and config data are sent in multiple chunks (without waiting?).
I am not sure why, but I think while printing there are situations when multiple messages add up in the receive buffer.
In both cases a single wait of 100ms can lead to a fatal overrun.
So, processing of bytes in the receive buffer should be as fast as possible.

I guess, some of the other issues with shutdowns and timing could also fade away when changing this code.

More background:
While testing, I also solved this either

  • by reducing the periodic_event time to 100us, or
  • by adding an unconditional sched_wake_tasks(); in console_task() (= continuous rescheduling)

Both lead to much faster processing of the receive buffer.

So, as an alternative there could be an extra queue for high priority tasks which could be executed at a higher rate. But I think this is not necessary right now and adds more complexity and probably unnecessary processing time.

I am adding this as issue, because it should probably be changed at all occurrences of console_task and my code base is very different.
Also, may be you have a better idea how to do this (as usual :-)).

Apart from that I have no clue, why the AVR doesn't seem to suffer from this, despite it has much less processing power.
From what I heard the SAM (Due) works, too. And it's code is very similar to the lpc code.
May be my use case is a bit more demanding...but with the lpc1768, it was totally impossible to print without this change.

@hg42 hg42 changed the title console_task should repeat command extrcation if the buffer is still full console_task should repeat command extraction if there are still commands in the buffer Mar 16, 2018
@hg42
Copy link
Author

hg42 commented Mar 16, 2018

I think, even if there are other possibilities, this simple code change is cheap and the fastest possible, so it seems to be the right thing to do...

@KevinOConnor
Copy link
Collaborator

KevinOConnor commented Mar 16, 2018 via email

@hg42
Copy link
Author

hg42 commented Mar 16, 2018

I thought of starvation, too. So I also tried other possibilities, e.g.:

if(receive_pos)
  sched_wake_tasks();

at the end of console_task() which should do what you describe.
While this improved the situation a bit, it wasn't fast enough.
I think if more messages are in the buffer, they cannot wait.

It seems to be a balancing act between reading the receive buffer before it fills up and the rate of periodic_event.
If the periodic_event rate is fast enough, the buffer processing can be slower.

All of the Klipper tasks are careful to issue a sched_wake_tasks() if they need to be run again

This code is taken from your original sources, e.g. from https://github.com/KevinOConnor/klipper/blob/master/src/sam3x8e/serial.c#L113 :

// Process any incoming commands
void
console_task(void)
{
    uint8_t pop_count;
    uint32_t rpos = readl(&receive_pos);
    int8_t ret = command_find_block(receive_buf, rpos, &pop_count);
    if (ret > 0)
        command_dispatch(receive_buf, pop_count);
    if (ret)
        console_pop_input(pop_count);
}
DECL_TASK(console_task);

The same code is in AVR. Only rpos is read from 8bit instead of 32bit.
This does not check if there is another message in the buffer, in which case it should probably call sched_wake_tasks().
In a worst case the other message has to wait until periodic_event is triggered again.
The lpc1768 code is very similar to the Arduino Due code which seems to work (but I don't have one, so I cannot test under same conditions).

I am not sure what happens exactly, may be this:
If MESSAGE_SYNC is received, the receive interrupt calls sched_wake_tasks(). So a message should be processed via task scheduling.
But if multiple messages are received before processing begins, then only one message will be processed and the others remain in the buffer.
I think this could be solved by using counters instead of a flag. It's more difficult, though, because each task can have multiple rescheduling requests. One run_tasks() processes all tasks. So simply counting the number of requests isn't working. Probably each task should have it's own counter.

I created this issue in your repo, because the code is the same. The difference is only in the timing. That means other implementations will probably get similar problems.

Btw. I tested the LPC USB code used by cruwaller today and it behaves exactly like before (shutdown after a few moves).
And for the serial UART code using polling, the while loop isn't sufficient, because the FIFO is only 14 bytes long. This only works with a frequent periodic_event or continuous scheduling. However the while loop definitely reduces the timing pressure.

FYI: with the UART code using interrupts + while loop or polling + continuous scheduling the LPC1768 can reach 1000 mm/s (naturally only with steppers, no mechanical parts). And at this high speed the acceleration still sounds smooth, so I think the timing is still correct. I tested this with a real model and with 100mm square movements.

@hg42
Copy link
Author

hg42 commented Mar 16, 2018

I just tested to limit the loop to a few iterations:

    for(int i = 0; i < 2; i++) {
        uint8_t pop_count = 0;
        int8_t ret = command_find_block(receive_buf, receive_pos, &pop_count);
        if (ret > 0)
            command_dispatch(receive_buf, pop_count);
        if (ret)
            console_pop_input(pop_count);
        else
            break;
    }

A single retry (i< 2) seems to work, even at high speed.

Apart from that, a loop probably doesn't prevent scheduling, because I assume the message receive time is less than the message processing time.

@KevinOConnor
Copy link
Collaborator

KevinOConnor commented Mar 16, 2018 via email

@hg42
Copy link
Author

hg42 commented Mar 17, 2018

I agree, that I didn't experience this issue with RAMPS.
Unfortunately I don't have any other CPU from your range, so I cannot test with e.g. SAM32. It works for @cruwaller.

You are also right, that console_pop_input will reschedule the task(s) when a command is removed from the buffer (which probably is the correct place).
The problem is

  • our code also includes this
  • and also the sched_wakeup_tasks() in the interrupt handler
  • and they are called (checked with LEDs)
  • the whole stategy parallels how you do it with the SAM32

but my observations were

  • unconditionally adding a sched_wakeup_tasks() to console_task does process additional messages from the buffer and prevents overflow
  • this seems to be illogical
  • this might eventually indicate some flaw in command_find_block (I guess)

I will recheck these observations, hopefully today or tomorrow...

So what to do...

  • I would prefer to keep this open for other developers porting klipper to other CPUs to see which experiences they have and remind them that there might be "something"
  • I'll happily close it, if it is proven wrong
  • if instead you want it to be closed, I would want to do it myself, so I can reopen it when I have more data

I understand that you tend to keep your issues clean.
However, they can also remind of something that is otherwise hidden to developers, even if you doubt it is true. Neither you nor me have a true proof currently, so I would keep it open.
This depends on how you use the issue list. The purpose could be signaled by attaching labels to it (but I didn't use them with github, yet, so I am not sure what is possible).
So you could add "obscure" to it :-)

@hg42
Copy link
Author

hg42 commented Mar 17, 2018

so...I re-visited all my tests and found what's going wrong.

I basically had these three test situations and tried to draw conclusions from them, but the reasons were different than I thought:

  • usb serial (from cruwaller) -> shutdowns while printing (early)
    -> fails for other reasons (still not found)

  • uart serial with interrupt -> no connection at all (timeout)
    -> fails because the console_sendf function doesn't work
    this was the most misleading one

  • uart with polling (console_check_input) -> timeout in connection or while sending config
    -> doesn't work because the FIFO is too short for 100ms polling

Additionally I was misinterpreting a part of console_pop_input (thinking wrong way because of the variable name needcopy, you only see what you already think :-) ).
In reality the need to copy also means that there are more (eventually incomplete) messages in the buffer. Only in this case the sched_wake_tasks() is called.
That's why the (LED) pattern was unexpected for me, because I assumed it would be called for every message.

I apologize for taking your time and will close this issue.

@hg42 hg42 closed this as completed Mar 17, 2018
@github-actions github-actions bot locked and limited conversation to collaborators Dec 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants