Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CircuitPython 4.x Performance issue with the pygamer vs. MicroPython #1933

Closed
shazz opened this issue Jun 9, 2019 · 14 comments
Closed

CircuitPython 4.x Performance issue with the pygamer vs. MicroPython #1933

shazz opened this issue Jun 9, 2019 · 14 comments
Milestone

Comments

@shazz
Copy link

shazz commented Jun 9, 2019

Hi,

I did a simple loop test in python to compare with my STM32F4 based MicroPython board and the results are quite unexpected. CircuitPython 4.x is 16x slower than MicroPython for an equivalent CPU frequency.

Details available here:
https://forums.adafruit.com/viewtopic.php?f=60&t=152372&p=753830#p753830

Test code:

def test():
    s = time.monotonic
    count = 0
    r = range(0, 10000000)

    startTime = s()
    for count in r:
        pass
    endTime = s()

    duration = endTime - startTime

    return duration

Results:

- 18 766ms on Meowbit/MP at 56 MHz (using pyb.millis)
- 12 506ms on my Meowbit/MP at 84 MHz (using pyb.millis)
- 138 390ms on my Pygamer/CP at 120 MHz (using time.monotonic)
- 86 000ms on my Pygamer/CP at 200 MHz (using time.monotonic)

C module reference:

#define NOP asm volatile(" nop \n\t")

    uint64_t start = ticks_ms;  
    uint32_t count = 0;  
    uint64_t end = 0; 

    for(count=0; count<10000000; count++) { NOP; }

    end = ticks_ms;  

    uint32_t delta = end - start;
    printf("Delta: %lu\n", delta);

Results:

- 715ms at 56MHz on Meowbit/MP
- 477ms at 84MHz on Meowbit/MP
- 334ms at 120MHz on pygamer/CP
- 200ms at 200MHz on pygamer/CP

Real issue or I did something wrong ?

@tannewt
Copy link
Member

tannewt commented Jun 11, 2019

This doesn't surprise me because performance hasn't been our focus. I'll probably take a glance for simple improvements in the next couple weeks but we'd totally take PRs that improve things.

@dhalbert
Copy link
Collaborator

dhalbert commented Jun 11, 2019

I did some research on this. I looked at the stm32 port settings in MicroPython. In our ports, we use MICROPY_VM_HOOK_LOOP and MICROPY_VM_HOOK_RETURN to call run_background_tasks() very frequently. This happens after every jump opcode and every return. We do this to service USB and other background tasks that need to use polling instead of interrupts. The stm32 port does not do this.

If do not define the HOOK macros to call run_background_tasks() , your timing test and a similar one done with explicit increment and test code run run 3-4 times faster.

I could only test this in CircuitPython 3.1.2. It works at least for a short while without calling background tasks frequently, but eventually the REPL connection goes away. CircuitPython 4.0.1 doesn't work at all if I eliminate the background task call.

That accounts for a large fraction of your performance difference, but not all, and further research would uncover more.

@shazz
Copy link
Author

shazz commented Jun 11, 2019

Thanks Dan! Interesting findings! "This happens after every jump opcode and every return" looks pretty scary :)

so everything happens here, ans especially usb_background(), right?

#define MICROPY_VM_HOOK_LOOP run_background_tasks();
#define MICROPY_VM_HOOK_RETURN run_background_tasks();

void run_background_tasks(void) {
    // Don't call ourselves recursively.
    if (running_background_tasks) {
        return;
    }
    assert_heap_ok();
    running_background_tasks = true;

    #if (defined(SAMD21) && defined(PIN_PA02)) || defined(SAMD51)
    audio_dma_background();
    #endif
    #if CIRCUITPY_DISPLAYIO
    displayio_refresh_displays();
    #endif

    #if CIRCUITPY_NETWORK
    network_module_background();
    #endif
    filesystem_background();
    usb_background();
    running_background_tasks = false;
    assert_heap_ok();

    last_finished_tick = ticks_ms;
}

And yep MICROPY_VM_HOOK_* are called in a lot of places...

@dhalbert
Copy link
Collaborator

Further testing: I have turned on MICROPY_OPT_COMPUTED_GOTO which is a performance enhancement in the VM interpreter (adds about 1700 bytes ti firmware). This further speeds up the basic timing loop by a factor of 5. Together these seem to account for pretty much all the speed differences.

The basic test I am using (to avoid any issues about how range() is implemented:

import time
def test():
    i = 100000
    startTime = time.monotonic()
    while i > 0:
        i -= 1
    endTime = time.monotonic()
    return endTime - startTime

print(test())

@shazz
Copy link
Author

shazz commented Jun 11, 2019

Interesting ! I'll try tonight on my pygamer to enable MICROPY_OPT_COMPUTED_GOTO and disable CIRCUITPY_DISPLAYIO. I'll post my results here.

@shazz
Copy link
Author

shazz commented Jun 12, 2019

@dhalbert, with the last CircuitPython source code, when I add #define MICROPY_OPT_COMPUTED_GOTO (1) I got this error at compilation time:

In file included from ../../py/vm.c:129:0:
../../py/vmentrytable.h:27:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
 #if __clang__
     ^~~~~~~~~
../../py/vmentrytable.h:116:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
 #if __clang__
     ^~~~~~~~~
cc1: all warnings being treated as errors

make: *** [../../py/mkrules.mk:82: build-pygamer/genhdr/qstr.i.last] Error 123

Anything I should do ?

@dhalbert
Copy link
Collaborator

Change both #if __clang__ to #ifdef __clang__. The original is a bit sloppy and we turn on warnings on that.

@dhalbert
Copy link
Collaborator

Note the PR: #1934

@shazz
Copy link
Author

shazz commented Jun 12, 2019

Thanks!

So I ran again the same test with MICROPY_OPT_COMPUTED_GOTO enabled and CIRCUITPY_DISPLAYIO disabled:

Results: 35.3ms

So a huge improvement but still some room for optimization to reach the ~10ms on an equivalent 120MHz STM32F4 board with MicroPython.

Good job @dhalbert !

@dhalbert
Copy link
Collaborator

Thanks! The next step is more work, which is to figure out how to call the background tasks only as often as necessary. That will get the next 2-3x speedup, I think.

@dhalbert
Copy link
Collaborator

We really appreciate your taking the time to do this comparison and point out the big disparity.

@shazz
Copy link
Author

shazz commented Jun 12, 2019

Just tried at 200Mhz: 21.2 ms.

Ah ah, no issue :) It forces me to go into the details to get a better understanding of CircuitPython, I like that. But a beer in the Walking City downtown would be appreciated too :)

@dhalbert
Copy link
Collaborator

Closing in favor of #2142.

@ladyada
Copy link
Member

ladyada commented Dec 4, 2019

trying with
Adafruit CircuitPython 5.0.0-beta.0-109-g15886b150 on 2019-12-03; Adafruit Feather M4 Express with samd51j19

got 18.765 seconds @ 120 MHz, so a 7.4x improvement over original 138.39 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants