Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iocp: fix crash, GetQueuedCompletionStatus() write freed WSAOVERLAPPED memory #4136

Merged
merged 28 commits into from
Feb 5, 2025

Conversation

jimying
Copy link
Contributor

@jimying jimying commented Nov 7, 2024

Try to fix issue #985. The idea is to call CancelIoEx() for the unregistering socket/key to cancel all pending operations of the key. However, as CancelIoEx() is basically asynchronous, this also makes the key unregistration asynchronous, so here are some consequences:

  • The memory for operation key must not be freed until the operation cancellation is completed, so instead of using app supplied operation key, the IOCP will use an internal copy.
  • Key must always have a group lock (can be supplied by app), which will handle key's resources release.

@nanangizz
Copy link
Member

Great! I assume you've tested this and it does fix the issue :)

I think we also need to run under somekind of stress test, e.g: using ioqueue test in pjlib-test, to make sure all memory pools (of pending ops) are properly released. Note that the after an ioqueue key is unregistered, the key will be put into the closing-key-list and soon into the free-key-list to be reused by another socket. We need to make sure that all pending op has been freed before the key is freed & reused.

Next, perhaps we can apply a little bit optimization, e.g: instead of mem-pool for each pending-op, perhaps mem-pool per ioqueue-key to avoid multiple alloc+free for multiple pending-op, using same mechanism as ioqueue key (employing additional list for keeping unused pending-op instances to be reused later).

@jimying
Copy link
Contributor Author

jimying commented Nov 7, 2024

Note:
this only fix the issue when PJ_IOQUEUE_HAS_SAFE_UNREG=1;
PJ_IOQUEUE_HAS_SAFE_UNREG=0 still has memory error

@nanangizz
Copy link
Member

Note: this only fix the issue when PJ_IOQUEUE_HAS_SAFE_UNREG=1; PJ_IOQUEUE_HAS_SAFE_UNREG=0 still has memory error

When PJ_IOQUEUE_HAS_SAFE_UNREG==1, the key won't be reused for 500ms (configurable via PJ_IOQUEUE_KEY_FREE_DELAY), this is a bit risky actually, e.g: on high CPU load, the cancellation may take longer?

@nanangizz
Copy link
Member

nanangizz commented Nov 14, 2024

Tried to run pjlib-test with this PR patch on VS2005 and pool debugging enabled (so pool uses normal malloc/free(), by setting PJ_POOL_DEBUG to 1 in config_site.h), I got an assertion:

 	_wassert(const wchar_t * expr=0x004d1e90, const wchar_t * filename=0x004d1cf0, unsigned int lineno=588) Line 212	C
 	pj_ioqueue_register_sock2(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, pj_grp_lock_t * grp_lock=0x00000000, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 588 + 0x2c bytes	C
 	pj_ioqueue_register_sock(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 665 + 0x1f bytes	C
 	unregister_test(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 554 + 0x23 bytes	C
 	udp_ioqueue_test_imp(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 1190 + 0x9 bytes	C
 	udp_ioqueue_test() Line 1255 + 0x9 bytes	C
 	test_inner() Line 171 + 0x2a bytes	C
 	test_main() Line 245 + 0x5 bytes	C

Not sure if this is the same issue, but this assertion does not happen when using ioqueue select.

@jimying
Copy link
Contributor Author

jimying commented Nov 14, 2024

@nanangizz no this patch, Is there this assert?

@nanangizz
Copy link
Member

@nanangizz no this patch, Is there this assert?

Yes, same assert without this patch.

@jimying
Copy link
Contributor Author

jimying commented Nov 17, 2024

Tried to run pjlib-test with this PR patch on VS2005 and pool debugging enabled (so pool uses normal malloc/free(), by setting PJ_POOL_DEBUG to 1 in config_site.h), I got an assertion:

 	_wassert(const wchar_t * expr=0x004d1e90, const wchar_t * filename=0x004d1cf0, unsigned int lineno=588) Line 212	C
 	pj_ioqueue_register_sock2(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, pj_grp_lock_t * grp_lock=0x00000000, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 588 + 0x2c bytes	C
 	pj_ioqueue_register_sock(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 665 + 0x1f bytes	C
 	unregister_test(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 554 + 0x23 bytes	C
 	udp_ioqueue_test_imp(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 1190 + 0x9 bytes	C
 	udp_ioqueue_test() Line 1255 + 0x9 bytes	C
 	test_inner() Line 171 + 0x2a bytes	C
 	test_main() Line 245 + 0x5 bytes	C

Not sure if this is the same issue, but this assertion does not happen when using ioqueue select.

I found the reason: key double unregister.
Fixed: when key->closing = 1, not unregister.
now test passed.

@jimying jimying marked this pull request as ready for review November 22, 2024 02:52
@nanangizz
Copy link
Member

Thanks @jimying .

Honestly I haven't got a chance to reproduce the original issue and test the proposed solution. I believe you are using this ioqueue in real world, experienced the issue, and find this solution does work, is that correct?

Next, here are few notes about the proposed solution:

  • As you've mentioned, the approach requires PJ_IOQUEUE_HAS_SAFE_UNREG to work, so I think this mode has to be enforced somehow.
  • Also as I've mentioned before, the safe key unregistration relies on timer (default is 500ms), there is a risk when CPU load is high, increasing the timeout is not ideal as the risk itself is actually still there, only reduced to some degree. Perhaps the unregistration should rely on zero-pending-operation instead, or combination of them somehow.
  • Memory pool per operation can be optimized. Note that a single ioqueue key can have many & rapid operations, e.g: receiving RTP packets, so creating & releasing pool for each operation does not sound very optimized. An idea is to have some pool of pending_op (just like pool of key), may be owned by ioqueue or by key.

Also, this ioqueue has been disabled for quite sometime and some improvement in the ioqueue area may not be integrated into this ioqueue, e.g: group lock for key. So please understand that there may still be some steps required to enable this ioqueue again :)

@jimying
Copy link
Contributor Author

jimying commented Nov 23, 2024

@nanangizz i write a simple demo to reproduce the crash issue in msys2, #4172

I have tested it, in old code, it can 100% reproduce the crash.

To test new code we can git cherry-pick the demo patch to this branch.

@nanangizz
Copy link
Member

Thanks @jimying.

@nanangizz nanangizz modified the milestones: release-2.15, release-2.16 Dec 3, 2024
@jimying
Copy link
Contributor Author

jimying commented Dec 13, 2024

enforcingPJ_IOQUEUE_HAS_SAFE_UNREG and zero pending-op

new commits do:

  1. remove macro PJ_IOQUEUE_HAS_SAFE_UNREG, only keep PJ_IOQUEUE_HAS_SAFE_UNREG=1 logic
  2. remove closing_list, when ref_count=0 return key to freelist.
    In order to achieve,add ref when alloc pending-op and dec ref when free it

The pool is owned by key/socket, instead of by ioqueue, to avoid possible infinite memory grow in ioqueue.
Update ioq_stress_test not to use the global group lock for key registration, as otherwise the keys won't be released until the global group lock is destroyed (i.e: after ioqueue destroy).
@nanangizz
Copy link
Member

I think this is ready for review @jimying , @sauwming , @trengginas.

pjlib/src/pj/ioqueue_winnt.c Outdated Show resolved Hide resolved
pjlib/src/pj/ioqueue_winnt.c Outdated Show resolved Hide resolved
pjlib/src/pj/ioqueue_winnt.c Show resolved Hide resolved
pjlib/src/pj/ioqueue_winnt.c Outdated Show resolved Hide resolved
pjlib/src/pjlib-test/ioq_iocp_unreg_test.c Show resolved Hide resolved
pjlib/src/pjlib-test/ioq_iocp_unreg_test.c Outdated Show resolved Hide resolved
pjlib/src/pjlib-test/ioq_stress_test.c Show resolved Hide resolved
- added info to clarify codes
- added copyright for test code
- minors.
@nanangizz
Copy link
Member

The last commit should cover all review comments above.

@jimying, re: copyright text, feel free to change the name :)
Btw, as mentioned in the top of the file, the test can repro the issue on Win 10 & MSVC2005, but cannot on Windows 11 & MSVC 2022. Also the test will be run for all ioqueue types (originally for iocp only), just in case.

@@ -0,0 +1,208 @@
/*
* Copyright (C) 2024 jimying at github dot com.
* Copyright (C) 2024 Teluu Inc. (http://www.teluu.com)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, Teluu is usually put above, for two reasons: 1. to signify that the original author has agreed to contribute it to Teluu (as per CLA), 2. to make it easier to update copyright year (i.e. only the latest/first copyright info will get updated, the rest will remain the same).

* operations must be cancelled. As cancelling ops is asynchronous,
* IOCP destroy may need to wait for the maximum time specified here.
*/
#define TIMEOUT_CANCEL_OP 5000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMEOUT_CANCEL_OP macro is unused. WAIT_KEY_MS (in pj_ioqueue_destroy()) the same value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually WAIT_KEY_MS should have been replaced by TIMEOUT_CANCEL_OP, so WAIT_KEY_MS is unused (and undefined).


pj_list_push_back(&ioqueue->free_list, key);
}
#endif
ioqueue->max_fd = pj_list_size(&ioqueue->free_list); // max_fd;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why compute again use pj_list_size()? better revert to ioqueue->max_fd= max_fd

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops right, forgot to revert.

pj_gettickcount(&timeout);
if (PJ_TIME_VAL_GTE(timeout, stop)) {
PJ_LOG(3, (THIS_FILE, "Warning, IOCP destroy timeout in waiting "
"for cancelling ops, after %dms, pending keys=%d",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor build warning: format '%d' expects argument of type 'int', but argument 4 has type 'pj_size_t'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@nanangizz nanangizz merged commit cbfbbc4 into pjsip:master Feb 5, 2025
40 of 41 checks passed
trengginas added a commit that referenced this pull request Feb 13, 2025
commit 1e2f121
Author: sauwming <ming@teluu.com>
Date:   Thu Feb 13 10:30:58 2025 +0800

    Fixed CI Mac failure (#4304)

commit 10b4d30
Author: sauwming <ming@teluu.com>
Date:   Thu Feb 13 09:06:54 2025 +0800

    Audio and video stream refactoring (#4300)

commit 6cbf0e6
Author: Riza Sulistyo <trengginas@users.noreply.github.com>
Date:   Wed Feb 12 15:23:37 2025 +0700

    Check if ice strans is valid before using it to send (#4301)

commit fdd4041
Author: Maciej Lisowski <39798354+MaciejDromin@users.noreply.github.com>
Date:   Tue Feb 11 12:48:17 2025 +0100

    Add missing OnTimerParam import in Android Example (#4299)

commit 4ded10f
Author: sauwming <ming@teluu.com>
Date:   Fri Feb 7 13:27:00 2025 +0800

    Fixed msg_data assertion in pjsua_acc_send_request() API (#4298)

commit 65c4bc9
Author: sauwming <ming@teluu.com>
Date:   Fri Feb 7 07:18:22 2025 +0800

    Fix pjsua sample app user agent (#4296)

commit c53ace9
Author: sauwming <ming@teluu.com>
Date:   Fri Feb 7 07:18:09 2025 +0800

    Fixed Java make clean error (#4297)

commit e6196ad
Author: LeonidGoltsblat <138720759+LeonidGoltsblat@users.noreply.github.com>
Date:   Fri Feb 7 02:15:10 2025 +0300

    Aligned memory allocation (#4277)

    * aligned memory allocaion

    * Fix alt API implementations (PJ_HAS_POOL_ALT_API)

    * pool test: add testing for bug in pj_pool_allocate_find with big alignment, and refactor to use unit test API

    * misc fixes on code review

    * pool_dbg alignment support + incompatible tests disabled for PJ_HAS_POOL_ALT_API

    ---------

    Co-authored-by: bennylp <bennylp@pjsip.org>

commit 2fff775
Author: sauwming <ming@teluu.com>
Date:   Thu Feb 6 13:11:01 2025 +0800

    Add API to register custom SDP comparison callback (#4286)

commit 99b4d1e
Author: sauwming <ming@teluu.com>
Date:   Thu Feb 6 13:10:38 2025 +0800

    Fixed issue with SDP version when reoffer is rejected (#4289)

commit 0252152
Author: Benny Prijono <bennylp@pjsip.org>
Date:   Thu Feb 6 11:09:20 2025 +0700

    Use cirunner to capture and analyze GitHub action CI crash (#4288)

    * Windows runner implementation

    * Set timeout

    * Remove initial implementation of ci-runner here (it is on separate repo now)

    * Remove crash handling (-n) in main.c of unit tests

    * Install cirunner to CI workflows

    * Adding crash to timestamp test

    * Fix missing cirunner in one of the job

    * Reinstall core_pattern on Linux

    * Removed intentional crash in timestamp_test()

    * Upload program and core dump on crash

    * Add crash code in uri_test.c

    * Removed injected crash in uri_test. Disable stdout/stderr buffering for unit tests

    * Minor: remove space left out by previous clean up

commit 3fcce51
Author: sauwming <ming@teluu.com>
Date:   Wed Feb 5 11:43:27 2025 +0800

    Fixed OpenSSL log error reading cert (#4291)

commit cbfbbc4
Author: jimying <yingqw.js@gmail.com>
Date:   Wed Feb 5 11:03:53 2025 +0800

    iocp: fix crash, GetQueuedCompletionStatus() write freed WSAOVERLAPPED memory (#4136)

commit 205baf0
Author: sauwming <ming@teluu.com>
Date:   Tue Feb 4 17:04:25 2025 +0800

    Fixed warnings in sip auth client (#4287)

commit abffe0d
Author: sauwming <ming@teluu.com>
Date:   Tue Feb 4 08:38:55 2025 +0800

    Fixed CI test failure (#4284)

commit 986fc78
Author: Johannes <johannes.westhuis@gmail.com>
Date:   Mon Feb 3 08:31:28 2025 +0100

    Share an auth session between multiple dialogs/regc (#4262)

commit 46111c4
Author: Nanang Izzuddin <nanang@teluu.com>
Date:   Mon Feb 3 11:36:54 2025 +0700

    Best effort avoid crash when media transport adapter not using group lock (#4281)

commit f986ad8
Author: Benny Prijono <bennylp@pjsip.org>
Date:   Fri Jan 31 09:45:19 2025 +0700

    Add link to coding style documentation (#4280)

commit 727ee32
Author: Nanang Izzuddin <nanang@teluu.com>
Date:   Fri Jan 31 09:08:02 2025 +0700

    Fix build error when PJ_LOG_MAX_LEVEL is zero (#4279)

    The `pj_log_get_log_func()` is not defined when PJ_LOG_MAX_LEVEL is set to zero.

    Thanks to Giorgio Alfarano for the report.

commit dae52f6
Author: Perry Ismangil <perry@teluu.com>
Date:   Thu Jan 30 08:43:54 2025 +0000

    Fixing typo (#4274)

    Acoustic

commit 1a4cd67
Author: sauwming <ming@teluu.com>
Date:   Thu Jan 30 15:21:49 2025 +0800

    Modify iOS sample apps dev team ID (#4278)

commit dfcfa13
Author: Tarteszeus <37761609+Tarteszeus@users.noreply.github.com>
Date:   Thu Jan 30 02:42:36 2025 +0100

    Add queried names to server address record, and add the address record in parameter for on_verify_cb callback (#4256)

commit f9e56d8
Author: Jan Tojnar <jtojnar@gmail.com>
Date:   Wed Jan 29 07:42:18 2025 +0100

    Fix duplicate function name in 100rel docs (#4275)

commit 960597e
Author: Nanang Izzuddin <nanang@teluu.com>
Date:   Wed Jan 29 13:36:34 2025 +0700

    Various works on SWIG Java (#4273)

    * Various works on SWIG Java

    1. Fix type mapping (SWIGTYPE_*):
       a. Map C "void*" & "void**" to Java long (was SWIGTYPE_p_void & SWIGTYPE_p_p_void which are not really usable), this should fix #4242.
       b. Map pjmedia_aud_dev_index to int.
       c. Map unsigned char[20] for SslCertInfo.serialNo to Java "short array"

       This also updates pjsua.i, e.g: tab->space, reorder things.

    2. Update swig_java_pjsua2.vcxproj:
       a. Rename config "Debug" & "Release" to "Debug-Dynamic" & "Release-Dynamic" in , as the project actually builds dynamic libs. Also fix the property sheet dependencies from *-static to *-dynamic.
       b. Update other settings, e.g: built tool version from 140 to 143.

    3. Update symbols.lst: added missing new types, tab->space, reorder alphabetically.

    * Update ci-win.yml
    * Add sample code for passing user data using utilTimerSchedule()
    * Add sample for cancelling timer

commit c36ed2c
Author: Benny Prijono <bennylp@pjsip.org>
Date:   Tue Jan 28 16:58:55 2025 +0700

    Minor modifications to Android build and samples to match new documentation (#4271)

    * To streamline the command, also clean swig and pjsua jni output directories when make distclean and realclean is called

    * Kotlin sample: add account, modify video size and bandwidth, and audio codec priorities to use AMR-WB

    * Android CLI app: fix armeabi hardcoded arch and also copy stdc++.so

commit bab33d6
Author: Noel Morgan <noel@vwci.com>
Date:   Tue Jan 28 01:51:50 2025 -0600

    Added support for updated RFC7866 content-type sub type with XML extension (#4270)

commit a89917e
Author: sauwming <ming@teluu.com>
Date:   Fri Jan 24 14:49:59 2025 +0800

    OpenSSL: Set ciphersuites only if not using BoringSSL (#4269)

commit 377a80c
Author: Nanang Izzuddin <nanang@teluu.com>
Date:   Fri Jan 24 13:48:31 2025 +0700

    Fix various compile errors & warnings in MSVC2005 (#4268)

commit de3f2e1
Author: sauwming <ming@teluu.com>
Date:   Fri Jan 24 10:20:28 2025 +0800

    Set CI vars in GH workflow file (#4263)

commit cdb1294
Author: sauwming <ming@teluu.com>
Date:   Thu Jan 23 11:01:27 2025 +0800

    Various fixes for Apple SSL backend (#4257)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants