Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main: use hash list for fds instead of fixed array #428

Closed
wants to merge 3 commits into from

Conversation

sreimers
Copy link
Member

@sreimers sreimers commented Jul 9, 2022

If an application uses multiple (maybe thousands) of libre event loop threads, the unused pre-allocated memory for fhs can be quite high. This PR uses a dynamic hash list for keeping fhs structs. This way EPOLL and KQUEUE have no hard fd limit anymore (re->maxfds limits only returned active event handlers per fd_poll). The fd limit for SELECT stays the same.

I did some benchmarks (between this PR and main) and performance differences for fd_listen and fd_close are negligible if the hash bucket size is well choosen (depends on re->maxfds, which can be set with fd_setsize() like before).

@sreimers sreimers force-pushed the main_refactor_fhs branch 2 times, most recently from b0954f8 to 9e9c8ef Compare July 10, 2022 07:04
@sreimers sreimers changed the title main: use hash list for fds instead of fixed array - WIP main: use hash list for fds instead of fixed array Jul 10, 2022
@sreimers sreimers force-pushed the main_refactor_fhs branch from 7409fa0 to 535659d Compare July 10, 2022 12:13
@sreimers sreimers marked this pull request as ready for review July 10, 2022 13:03
@Lastique
Copy link
Contributor

For some reason I was not notified of this PR. I haven't looked closely yet, but from the first glance the maxfds limit should not apply to the poll backend.

@sreimers sreimers marked this pull request as draft July 11, 2022 15:07
@sreimers
Copy link
Member Author

I have refactored the poll backend with a independent index value now.

@sreimers sreimers force-pushed the main_refactor_fhs branch from 4a7152c to dff8575 Compare July 14, 2022 11:04
@sreimers sreimers marked this pull request as ready for review July 14, 2022 15:08
@sreimers
Copy link
Member Author

If this PR does not meet your multithreading needs I think its better to close this one too (keeps libre simple) and you can maintain your own patch easier.

@sreimers sreimers closed this Jul 17, 2022
@Lastique
Copy link
Contributor

Lastique commented Jul 17, 2022

Just to be clear, I think with some work it could be made an acceptable solution. In particular, I don't mind against a hash table in principle, I just noted that it should be used with care.

@sreimers sreimers reopened this Jul 17, 2022
@sreimers sreimers force-pushed the main_refactor_fhs branch from 99e93bf to 880869c Compare July 18, 2022 05:38
@sreimers sreimers added this to the v2.7.0 milestone Jul 21, 2022
@sreimers sreimers marked this pull request as draft July 24, 2022 07:39
@sreimers sreimers force-pushed the main_refactor_fhs branch 2 times, most recently from 7c235d6 to 970aeca Compare July 25, 2022 15:29
@sreimers sreimers force-pushed the main_refactor_fhs branch 3 times, most recently from 91959bc to 7ff2086 Compare July 26, 2022 07:31
@sreimers sreimers force-pushed the main_refactor_fhs branch from cb1bc55 to 87bf668 Compare August 4, 2022 06:27
@sreimers sreimers marked this pull request as ready for review August 4, 2022 07:16
@alfredh
Copy link
Contributor

alfredh commented Aug 14, 2022

in general I am worried about such a large change in main

I think we can extend the windows mode to use binary search, and make it generic
so that 2 modules can be selected (fd lookup and binary search).

in any case we have to test that it works with a very large set of file descriptors,
in the order of 100K and verify correctness and performance.

@sreimers
Copy link
Member Author

I tried to keep the changes as minimal and fast as possible, but yes we should test this very well. This PR tries to fix two primary issues:

@Lastique do you see any remaining ABA problems? If you are worried about fhs_cache memory usage, I can make this optional.

I did some performance tests, but will do some more (with restund). The hashing overhead is negligible if maxfds is well chosen (like fd_setsize(-1)):

static uint32_t fhs_hash(struct re *re, re_sock_t fd)
{
	if (fd < re->maxfds)
		return (uint32_t)fd;
...

To avoid mem_zalloc (old behavior with allocated fhs array), the fhs_cache list can optional be pre-allocated (not implemented yet).

@alfredh
Copy link
Contributor

alfredh commented Jan 21, 2023

a good way to pack struct members is to put the largest elements first, example

32 bytes
16 bytes
8 bytes
4 bytes
2 bytes

the compiler will try to get small size and good performance

the "bool active" flag can be moved to "int flags"

is it an option to use a large sorted array and bsearch for lookup:

https://man7.org/linux/man-pages/man3/bsearch.3.html

@sreimers sreimers force-pushed the main_refactor_fhs branch 3 times, most recently from 52e9a22 to c168893 Compare January 26, 2023 15:12
@sreimers
Copy link
Member Author

sreimers commented Jan 28, 2023

is it an option to use a large sorted array and bsearch for lookup

I see no real benefit here, fhs_hash() lookup is O(1) if fd < maxfds:

static uint32_t fhs_hash(struct re *re, re_sock_t fd)
{
	if (fd < re->maxfds)
		return (uint32_t)fd;
...

This should be the common case, since also on windows fd is incremented
evenly (most of the time). optimized fast murmur hashing guaranties a even distribution for other scenarios
(higher fd numbers with lower maxfds setting). So worst case is maybe O(2) or O(3) if nfds > maxfds. As in the past
this can be optimized with fd_setsize(n).

Here are some performance measurements for a TCP Echo Server (maxfds=1024, active=50, idle=64):

- libuv	                93835 req/s
- libre main epoll      94462 req/s
- libre fhs_hash epoll  94875 req/s
- libre main select     115626 req/s
- libre fhs_hash select 115429 req/s

And some simple HTTP (no keepalive):

./wrk -c 512 -d 60 http://localhost:8888/

libre main epoll (maxfds=1024)

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    47.70ms   81.04ms 552.19ms   88.41%
    Req/Sec    10.44k     6.23k   16.79k    71.39%
  875516 requests in 1.00m, 53.44MB read
Requests/sec:  14574.85

User time (seconds): 1.57
System time (seconds): 17.96
Maximum resident set size (kbytes): 5788

libre main select (maxfds=1024)

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    56.11ms   96.04ms 556.63ms   87.37%
    Req/Sec    11.12k     6.04k   16.99k    75.99%
  873267 requests in 1.00m, 53.30MB read
Requests/sec:  14513.07

User time (seconds): 2.65
System time (seconds): 22.81
Maximum resident set size (kbytes): 5816

libre fhs_hash epoll (maxfds=1024)

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    48.77ms   81.55ms 529.53ms   88.27%
    Req/Sec    10.43k     5.74k   16.47k    74.01%
  875026 requests in 1.00m, 53.41MB read
Requests/sec:  14557.75

User time (seconds): 1.80
System time (seconds): 18.74
Maximum resident set size (kbytes): 5812

libre fhs_hash select (maxfds=1024)

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    55.48ms   95.58ms 567.17ms   87.46%
    Req/Sec    11.19k     6.18k   17.35k    75.27%
  884055 requests in 1.00m, 53.96MB read
Requests/sec:  14725.99

User time (seconds): 2.50
System time (seconds): 23.20
Maximum resident set size (kbytes): 5852

libre fhs_hash epoll (maxfds=64)

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    48.32ms   81.00ms 527.51ms   88.38%
    Req/Sec    10.48k     5.74k   17.78k    74.53%
  883962 requests in 1.01m, 53.95MB read
Requests/sec:  14623.82

User time (seconds): 1.86
System time (seconds): 18.99
Maximum resident set size (kbytes): 5844

libre fhs_hash select (maxfds=64)

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    55.81ms   95.89ms 566.70ms   87.40%
    Req/Sec    11.12k     6.21k   17.15k    74.93%
  871336 requests in 1.00m, 53.18MB read
Requests/sec:  14503.00

User time (seconds): 3.09
System time (seconds): 23.04
Maximum resident set size (kbytes): 5844

https://gist.github.com/sreimers/337c9ee783cbe6298d7352b338ed8c3c

Linux Kernel 6.1.7 - AMD EPYC 3251 8-Core Processor

a good way to pack struct members is to put the largest elements first, example

Right, but _Bool should be mostly 1 Byte.

$ pahole -Cfhs build/libre.a
struct fhs {
        struct le                  le;                   /*     0    32 */
        re_sock_t                  fd;                   /*    32     4 */
        int                        flags;                /*    36     4 */
        _Bool                      active;               /*    40     1 */

        /* XXX 7 bytes hole, try to pack */

        fd_h *                     fh;                   /*    48     8 */
        void *                     arg;                  /*    56     8 */

        /* size: 64, cachelines: 1, members: 6 */
        /* sum members: 57, holes: 1, sum holes: 7 */
};

@sreimers sreimers force-pushed the main_refactor_fhs branch 3 times, most recently from 8473afd to 4d34ec1 Compare January 30, 2023 19:25
@Lastique
Copy link
Contributor

a good way to pack struct members is to put the largest elements first, example

Right, but _Bool should be mostly 1 Byte.

I think, the suggestion was to use one bit in flags for the active flag. This would remove the _Bool and the following padding, 8 bytes in total, from the fhs size.

But in practice that wouldn't save any memory. On x86-64 dynamic memory allocations are 16-byte aligned, which means the real amount of memory reserved for one allocation must be divisible by 16. You're using mem_zalloc to allocate instances of fhs, which adds a mem header (16 bytes in non-debug mode). So the allocation you would be making would be 16 + 56 = 72 bytes, which will be rounded to 80 by the system memory allocator.

PS: Again, due to alignment only to 16 bytes, pahole is wrong about cachelines: 1. Not that it matters in this case, as the real allocation will be larger than a cache line anyway. Just keep in mind that allocation could get split between two cache lines due to misalignment.

@alfredh
Copy link
Contributor

alfredh commented Feb 5, 2023

The polling method currently can be changed in runtime. Is this an important usecase ?

If we change the polling method from runtime to compile-time config, it would simplify
some of the code. For example rebuild_fds would become obsolete.

in my simplistic view, the typical method would be:

  • select: Windows
  • epoll: Linux and Android
  • kqueue: Mac, BSD

@sreimers
Copy link
Member Author

sreimers commented Feb 5, 2023

If we change the polling method from runtime to compile-time config, it would simplify
some of the code.

Or maybe enough to allow it once before poll_setup and disallow runtime changes after setup.

@sreimers sreimers modified the milestones: v2.12.0, v2.13.0 Feb 6, 2023
re_sock_t fd; /**< File Descriptor */
int flags; /**< Polling flags (Read, Write, etc.) */
fd_h* fh; /**< Event handler */
void* arg; /**< Handler argument */
bool active; /**< In-use */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be possible to move storage of this struct to the application.
in this case there is no need for array/hash.

Copy link
Member Author

@sreimers sreimers Feb 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm not fully understand your suggestion, wouldn't this break all applications badly? Handling all the event details is not easy to implement. So we have to implement it twice (retest and baresip)?

I can think of some memory size optimizations for fhs:

  • instead of mem_zalloc we can use malloc/memset/free directly this saves 16 byte per fhs entry.
  • struct le can be replaced by a void *next for a simple linked list and custom hash lookup solution. This saves 24 byte per fhs entry and 8 byte per bucket. the fhs pointer could be re-calculated with offsetof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at struct tmr, the object is stored by the user

the idea was to do something similar, example:

int   fd_listen(struct fhs *fhs, re_sock_t fd, int flags, fd_h *fh, void *arg);
void  fd_close(struct fhs *fhs);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, but I think this is problematic for multithreading. If fhs is destroyed by the application while fd_poll is waiting a dangling fhs is used. But maybe we can simply mem_ref fhs to avoid this. I will think about this approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just gave this approach a try: #805 (not ready yet)

@alfredh
Copy link
Contributor

alfredh commented Feb 11, 2023

in this case, how many entries do we need in the epoll/kqueue struct ?

#ifdef HAVE_EPOLL
	struct epoll_event *events;  /**< Event set for epoll()             */
	int epfd;                    /**< epoll control file descriptor     */
#endif
#ifdef HAVE_KQUEUE
	struct kevent *evlist;
	int kqfd;
#endif

@sreimers
Copy link
Member Author

in this case, how many entries do we need in the epoll/kqueue struct ?

#ifdef HAVE_EPOLL
	struct epoll_event *events;  /**< Event set for epoll()             */
	int epfd;                    /**< epoll control file descriptor     */
#endif
#ifdef HAVE_KQUEUE
	struct kevent *evlist;
	int kqfd;
#endif

We can control this size with fd_setsize() and use DEFAULT_MAXFDS as default, but for select we need a array lookup for fhs.

@sreimers sreimers marked this pull request as draft February 14, 2023 11:51
@alfredh alfredh removed this from the v2.13.0 milestone Mar 5, 2023
@ahuj9
Copy link

ahuj9 commented Apr 11, 2023

Just a thought, could a library like wepoll (BSD-2-clause) be integrated into re for Windows to emulate HAVE_EPOLL? I have been using libre privately with this method for some time now.

@Lastique
Copy link
Contributor

I think this PR is taking way too long as it is. IMHO, the working refactored version should be merged first, and further optimizations and extensions can be worked on later in separate PRs.

@alfredh
Copy link
Contributor

alfredh commented May 15, 2023

hi Lastique,

if you are interested in this work, perhaps you can help to test it ?

I am not sure what is next regarding this work.

Personally I dont need it for any of my projects.

If we are not sure what to do, we could close it down and revisit it later...

Alfred

@Lastique
Copy link
Contributor

Lastique commented May 15, 2023

hi Lastique,

if you are interested in this work, perhaps you can help to test it ?

I am not sure what is next regarding this work.

Personally I dont need it for any of my projects.

If we are not sure what to do, we could close it down and revisit it later...

I am interested in a less limited and more efficient implementation of the main loop. Once this work is finished and released, I would try and use it instead of our own patch we are currently using.

Regarding what to do with the PR, I think it should be merged when it is finished. IMHO, the scope of this work should not be conflated with further optimizations - those can come later in separate PRs. Just get this thing to a finished working state and then merge.

@sreimers
Copy link
Member Author

closed in favor of #805

@sreimers sreimers closed this Aug 18, 2023
@sreimers sreimers deleted the main_refactor_fhs branch October 4, 2023 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants