Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heap Trace] Perf: use hash map to speed up leaks mode 34x (IDFGH-9425) #10793

Conversation

chipweinberger
Copy link
Contributor

@chipweinberger chipweinberger commented Feb 17, 2023

Previous PR: #10521

This finishes the perf work on heap trace standalone. Leaks mode is now fast enough to be enabled in production, or during development!

Adds esp_err_t heap_trace_set_hashmap(heap_trace_hashmap_entry_t *entries_buffer, size_t num_entries);

Motivation:

  • leaks mode should be fast enough to be enabled in production
  • users can leave this mode enabled after boot & print leaks automatically (or save a dump to disk) when a low ram threshold is hit
  • if enabled before boot, leaks mode is a simple tool to find ways to save heap memory. This PR makes that use case fast.

Testing:

  • tested on an esp32s3.

Perf:

Most importantly! using a map makes record add & remove O(1), so we don't see a performance cliff when an old allocation is freed. 34x faster!

SPIRAM (NUMTABLE 2500, NUMITER 2, NUMENTRY 6000, NUMREC 3000)
total 41ms // no trace
total 3401ms // standard trace
total 109ms // with map

For internal ram and a more typical NUMTABLE 500, we are 1.8x faster with the hashmap.

internal memory:
total 665ms // no trace
total 1814ms // standard trace
total 1065ms // with map

For SPIRAM, also NUMTABLE 500, I'm surprised its not as dramatic. But still ~1.4x faster.

SPIRAM:
total 683ms // no trace
total 2118ms // standard trace
total 1491ms // with map

    const int NUMREC = 600;
    const int NUMENTRY = 600;
    const int NUMTABLE = 500;
    const int NUMITER = 100;

    uint32_t caps = MALLOC_CAP_INTERNAL;
    heap_trace_record_t* r = heap_caps_calloc(NUMREC, sizeof(heap_trace_record_t), caps);
    heap_trace_hashmap_entry_t* e = heap_caps_calloc(NUMENTRY, sizeof(heap_trace_hashmap_entry_t), caps);
    void** table = (void**) calloc(NUMTABLE, sizeof(void*));

    for (int i = 0; i < 3; i++) {

        int64_t t0 = esp_timer_get_time();

        if (i == 1 || i == 2) {
            heap_trace_init_standalone(r, NUMREC);
            if (i == 2) {heap_trace_set_hashmap(e, NUMENTRY);}
            heap_trace_start(HEAP_TRACE_LEAKS);
        }

        for(int j = 0; j < NUMITER; j++) {
            for (int k = 0; k < NUMTABLE; k++) {table[k] = heap_caps_malloc(1000,caps);}
            for (int k = 0; k < NUMTABLE; k++) {free(table[k]);}
        }

        int64_t t1 = esp_timer_get_time();
        printf("total %llims\n", (t1 - t0) / 1000);
        if (i == 1 || i == 2) {
            heap_trace_stop();
            heap_trace_dump();
        }
    }

@chipweinberger chipweinberger force-pushed the user/chip/heap-trace-hashmap branch from aaa21ca to 9feb0b4 Compare February 17, 2023 22:34
@chipweinberger
Copy link
Contributor Author

@SoucheSouche =)

@espressif-bot espressif-bot added the Status: Opened Issue is new label Feb 17, 2023
@github-actions github-actions bot changed the title [Heap Trace] Perf: use hash map to speed up leaks mode [Heap Trace] Perf: use hash map to speed up leaks mode (IDFGH-9425) Feb 17, 2023
@chipweinberger chipweinberger force-pushed the user/chip/heap-trace-hashmap branch 5 times, most recently from 166973e to cbc5a59 Compare February 18, 2023 01:00
@chipweinberger chipweinberger changed the title [Heap Trace] Perf: use hash map to speed up leaks mode (IDFGH-9425) [Heap Trace] Perf: use hash map to speed up leaks mode 34x (IDFGH-9425) Feb 18, 2023
@chipweinberger chipweinberger force-pushed the user/chip/heap-trace-hashmap branch 2 times, most recently from ef0ba89 to b255bfd Compare February 18, 2023 01:20
@chipweinberger
Copy link
Contributor Author

chipweinberger commented Feb 18, 2023

I've noticed a weird issue, when NUMTABLE is > 1919, total_allocations in the summary does not go past 1919.

This happens with and without the hashmap, and regardless of SPIRAM vs INTERNAL. It seems worse when using INTERNAL. Could there be a locking issue on master? Maybe the lock we are using does not always work?

I wont have time to dig into this further. I don't think it is caused by this PR, and doesn't seem like the linked list PR would cause it either. We did not change the lock, and the total_allocations++ logic has not changed.

@chipweinberger chipweinberger force-pushed the user/chip/heap-trace-hashmap branch from b255bfd to 58d5fd2 Compare February 18, 2023 02:44
@espressif-bot espressif-bot added Status: Selected for Development Issue is selected for development and removed Status: Opened Issue is new labels Feb 22, 2023
@SoucheSouche
Copy link
Collaborator

Hi @chipweinberger, I will get to this PR today :) I will investigate the 1919 total allocs issue as I will review and test the feature.

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Selected for Development Issue is selected for development labels Feb 23, 2023
@chipweinberger
Copy link
Contributor Author

chipweinberger commented Feb 23, 2023

Very curious to know what you find! Hopefully you reproduce it.

@chipweinberger
Copy link
Contributor Author

This is my sdkconfig:
sdkconfig.txt

Comment on lines 186 to 225
// copy to destination
if (r_get) {
memcpy(r_out, r_get, sizeof(heap_trace_record_t));
} else {
// this should not happen since we already
// checked that index < records.count,
// but could be indicative of memory corruption
result = ESP_ERR_INVALID_STATE;
memset(r_out, 0, sizeof(heap_trace_record_t));
}
// We already checked that index < records.count,
// This could be indicative of memory corruption.
assert(rget != NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal that the copy of r_get to r_out was removed here ?

Copy link
Contributor Author

@chipweinberger chipweinberger Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good spot. That is a bug.

I only intended to remove the error code because I changed my mind about my original code - asserting on memory corruption is probably more appropriate here.

@SoucheSouche
Copy link
Collaborator

I see that you renamed a lot of variables. The thing is that we are using snake case for variable names is esp-idf so it would be nice if you could update them. Otherwise, I will have to commit on top of your commits to change the name again.

I leave a patch with the fixed variable names below feel free to use it :) I hope I didn't miss any.
heap_trace.patch

@SoucheSouche
Copy link
Collaborator

SoucheSouche commented Feb 27, 2023

map_add, map_remove, map_find and subsequently the static variables and function they utilize can be placed in flash by the compiler but are called from function that are specifically placed in IRAM. Everything should be placed in RAM but this means that - in the state of your PR - regardless of whether the hashmap is used or not, a substantial amount of RAM memory will be used to support the feature.

I think it would be wise to wrap the whole hashmap related logic under a config so that if the hashmap is not used, the impact of the heap trace on the RAM will remain the same.

If you incorporate the comments left above I can add a commit to handle the placement of the functions.

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Feb 27, 2023

@SoucheSouche Thanks for the comments

variable names

Yes I think "rget" is still considered valid snake case. "fflush", "printf", etc all use that format. Personally I find the "r_" prefix hurts readability here. If "rget" is not suitable, I would then prefer "get" to keep the variable name short. This is of course personal preference. Do you prefer the "r_" prefixes? I'll go with your choice.

large amount of IRAM

Yes, that's a good point. A kconfig makes sense. 👍🏻

@SoucheSouche
Copy link
Collaborator

I'd keep the r_ everywhere to be consistent across the file.

@chipweinberger
Copy link
Contributor Author

Okay. Updated the PR.

@chipweinberger chipweinberger force-pushed the user/chip/heap-trace-hashmap branch from 58d5fd2 to 75a4fc4 Compare February 28, 2023 07:29
@SoucheSouche
Copy link
Collaborator

SoucheSouche commented Feb 28, 2023

Concerning the 1919 allocations behavior, could it be that you simply run out of available heap?

I tried the code on esp32s2 (randomly picked board), I added a heap_caps_get_free_size(caps) after the table calloc and got ~217000 bytes free. With a NUMTABLE set to 1000, it means that during the first iteration, I will try to allocate 1000000 bytes.

This is what I get:

------- free size : 217656 ---------------

====== Heap Trace: 0 records (600 capacity) ======
====== Heap Trace Summary ======
Mode: Heap Trace Leaks
0 bytes 'leaked' in trace (0 allocations)
records: 0 (600 capacity, 215 high water mark)
hashmap: 600 capacity (21500 hits, 0 misses)
total allocations: 21500
total frees: 21500

Which means that 215000 bytes were successfully allocated. Given that for each allocation, there are a few overhead bytes related to the memory allocator algorithm, I would say that it adds up.

@SoucheSouche
Copy link
Collaborator

SoucheSouche commented Feb 28, 2023

if I reduce the allocation size to 16 bytes, Then I get the expected value of total allocations. It makes sense since only 16Kbytes were allocated during the first iteration which is way below 217KB

------- free size : 217656 ---------------

====== Heap Trace: 0 records (600 capacity) ======
====== Heap Trace Summary ======
Mode: Heap Trace Leaks
0 bytes 'leaked' in trace (0 allocations)
records: 0 (600 capacity, 600 high water mark)
hashmap: 0 capacity (0 hits, 0 misses)
total allocations: 100000
total frees: 100000
(NB: Internal Buffer has overflowed, so trace data is incomplete.)
================================

====== Heap Trace: 0 records (600 capacity) ======
====== Heap Trace Summary ======
Mode: Heap Trace Leaks
0 bytes 'leaked' in trace (0 allocations)
records: 0 (600 capacity, 600 high water mark)
hashmap: 600 capacity (60000 hits, 40000 misses)
total allocations: 100000
total frees: 100000
(NB: Internal Buffer has overflowed, so trace data is incomplete.)

@chipweinberger
Copy link
Contributor Author

could it be that you simply run out of available heap?

of course! 🤦🏻‍♂️ Thanks for investigating it!

@chipweinberger
Copy link
Contributor Author

@SoucheSouche latest status?

@chipweinberger
Copy link
Contributor Author

@SoucheSouche bump.

hoping this makes it into 5.1

@SoucheSouche
Copy link
Collaborator

Hi @chipweinberger, sorry for the late reply, I was away from work. However, your PR was reviewed while I was away and a suggestion was made for a more efficient implementation. I will have a look at it today and get back at you to let you know if we will go with the suggested implementation or if we will stick to yours.

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Mar 27, 2023

no problem I hope your time away was enjoyable!

more efficient implementation

😀 nice! Hopefully not too much more work for you. Hard to imagine what could be faster.

@SoucheSouche
Copy link
Collaborator

SoucheSouche commented Apr 3, 2023

Hi @chipweinberger!
So I modified your hash map implementation as follow:

  • hash_map array is now an array of linked list.
  • when adding an entry to the hash map, we compute the hash of the entry and add it to the corresponding list in hash_map. So entries with the same hash will just be added to the same linked list at the given index in the hash_map array instead of trying to find an available spot in the array and potentially ending up not adding the hash in the array.

This is removing collisions in the hash_map array and gives better performances for less RAM used overall.
In the case when HEAP_TRACE_ALL is used, it gives even better results :

For 1000000 allocations and free, the code with no tracing executes in 5779ms, The code tracing enabled and no hash map takes 40262ms, the code with your implementation of the has map (using 1000 entries) takes 20632ms and the code with the hash map (using 50 entries) using the linked list as entries executes in 8129ms (7381ms with 1000 entries in the hash_map).

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Apr 3, 2023

good approach!

Yes in my design the hash map must be big to remain fast.

Your approach seems like a nice simplification for the user.

Good work. I imagine that was tricky to code!

@SoucheSouche
Copy link
Collaborator

SoucheSouche commented Apr 3, 2023

The update was fairly straightforward and sat just well in your design. I mostly had to update the map_x functions.

The new implementation has been approved by 1 reviewer out of 2 so hopefully we should see your feature in the master branch soon. I will keep you updated as usual.

Also, unrelated update: I will start working on your console PRs for the duration of April, after what I will be out of work for an extended period.

@chipweinberger
Copy link
Contributor Author

chipweinberger commented Apr 4, 2023

Glad to hear it!

I'm glad Espressif has taken an interest in the Console PRs. Hopefully not too much work - they're fully functional but of course "polish" takes awhile too.

On that topic, It would be great to solve this issue: #9366 This issue started about CDC OTG console but in reality USB Serial/JTAG is what we need. I should probably open a separate issue. We just need an official way to determine if the USB Serial JTAG is connected. I created this PR as a workaround (#10097), but in reality that PR should be denied in favor of a way that uses the hardware registers. Someone did suggest some code, but it unclear if it is actually correct.

@espressif-bot espressif-bot added Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress Resolution: NA Issue resolution is unavailable labels Apr 5, 2023
@chipweinberger
Copy link
Contributor Author

merged: b699033

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants