Freeze fixes and v1 kludges #2545

kolyshkin · 2024-12-13T01:40:59Z

1. freeze_processes: fix logic

There are a few issues with the freeze_processes logic:

Commit 9fae23f grossly (by 1000x) miscalculated the number of
attempts required, as a result, we are seeing something like this:

(00.000340) freezing processes: 100000 attempts with 100 ms steps
(00.000351) freezer.state=THAWED
(00.000358) freezer.state=FREEZING
(00.100446) freezer.state=FREEZING
...close to 100 lines skipped...
(09.915110) freezer.state=FREEZING
(10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
(10.000563) freezer.state=FREEZING

For 10s with 100ms steps we only need 100 attempts, not 100000.

When the timeout is hit, the "failed to freeze cgroup" error is not
printed, and the log_unfrozen_stacks is not called either.
The nanosleep at the last iteration is useless (this was hidden by
issue 1 above, as the timeout was hit first).

Fix all these.

While at it,

Amend the error message with the number of attempts, sleep duration,
and timeout.
Modify the "freezing cgroup" debug message to be in sync with the
above error.

Was:

freezing processes: 100000 attempts with 100 ms steps

Now:

freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s

2. freeze_processes: implement kludges for cgroup v1

Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see 1, 2). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

criu/seize.c

kolyshkin · 2024-12-16T00:17:59Z

Addressed review comments; rebased.

criu/seize.c

Done using clang-format 19.1.5 with .clang-format obtained via scripts/fetch-clang-format.sh. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

There are a few issues with the freeze_processes logic: 1. Commit 9fae23f grossly (by 1000x) miscalculated the number of attempts required, as a result, we are seeing something like this: > (00.000340) freezing processes: 100000 attempts with 100 ms steps > (00.000351) freezer.state=THAWED > (00.000358) freezer.state=FREEZING > (00.100446) freezer.state=FREEZING > ...close to 100 lines skipped... > (09.915110) freezer.state=FREEZING > (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0 > (10.000563) freezer.state=FREEZING For 10s with 100ms steps we only need 100 attempts, not 100000. 2. When the timeout is hit, the "failed to freeze cgroup" error is not printed, and the log_unfrozen_stacks is not called either. 3. The nanosleep at the last iteration is useless (this was hidden by issue 1 above, as the timeout was hit first). Fix all these. While at it, 4. Amend the error message with the number of attempts, sleep duration, and timeout. 5. Modify the "freezing cgroup" debug message to be in sync with the above error. Was: > freezing processes: 100000 attempts with 100 ms steps Now: > freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

Cgroup v1 freezer has always been problematic, failing to freeze a cgroup. In runc, we have implemented a few kludges to increase the chance of succeeding, but those are used when runc freezes a cgroup for its own purposes (for "runc pause" and to modify device properties for cgroup v1). When criu is used, it fails to freeze a cgroup from time to time (see [1], [2]). Let's try adding kludges similar to ones in runc. Alas, I have absolutely no way to test this, so please review carefully. [1]: opencontainers/runc#4273 [2]: opencontainers/runc#4457 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2024-12-17T00:58:00Z

Refactored to fix some more issues.

kolyshkin · 2024-12-17T01:02:22Z

Commit 9fae23f grossly (by 1000x) miscalculated the number of attempts required
<...>

When the timeout is hit, the "failed to freeze cgroup" error is not printed, and the log_unfrozen_stacks is not called either.

As this is now fixed, we may find additional issues with the log_unfrozen_stacks (which, I guess, was never called before due to nr_attempts miscalculation).

kolyshkin · 2024-12-17T07:18:58Z

Testing this in opencontainers/runc#4559, no luck so far (can't reproduce the issue).

kolyshkin · 2024-12-18T23:50:42Z

Testing this in opencontainers/runc#4559

I have concluded the testing, these kludges definitely help (i.e. can easily reproduce the issue without the fix, and can not reproduce it with the fix). See opencontainers/runc#4559 for more details.

kolyshkin · 2025-01-07T01:25:01Z

@avagin PTAL

This was referenced Dec 13, 2024

flaky tests: TestUsernsCheckpoint, TestCheckpoint opencontainers/runc#4273

Open

libct/int: retry Checkpoint for cgroup v1 opencontainers/runc#4486

Closed

rst0git reviewed Dec 14, 2024

View reviewed changes

criu/seize.c Show resolved Hide resolved

rst0git reviewed Dec 14, 2024

View reviewed changes

criu/seize.c Outdated Show resolved Hide resolved

avagin reviewed Dec 15, 2024

View reviewed changes

criu/seize.c Outdated Show resolved Hide resolved

kolyshkin force-pushed the freeze-kludges branch from 2d3fb7b to 1f23a7e Compare December 16, 2024 00:17

kolyshkin force-pushed the freeze-kludges branch from 1f23a7e to 2e5b4b5 Compare December 16, 2024 00:22

avagin reviewed Dec 16, 2024

View reviewed changes

criu/seize.c Outdated Show resolved Hide resolved

kolyshkin force-pushed the freeze-kludges branch from 2e5b4b5 to 3bf115c Compare December 16, 2024 22:14

kolyshkin added 3 commits December 16, 2024 16:42

criu/seize.c: clang-format it

deda13b

Done using clang-format 19.1.5 with .clang-format obtained via scripts/fetch-clang-format.sh. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the freeze-kludges branch from 3bf115c to 868e9fa Compare December 17, 2024 00:56

kolyshkin mentioned this pull request Dec 17, 2024

[test/DNM] Checking if criu cgroup v1 kludges help opencontainers/runc#4559

Closed

kolyshkin mentioned this pull request Dec 18, 2024

page-xfer error during TestUsernsCheckpoint in runc CI #2551

Open

avagin merged commit 7c66617 into checkpoint-restore:criu-dev Jan 7, 2025
36 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeze fixes and v1 kludges #2545

Freeze fixes and v1 kludges #2545

kolyshkin commented Dec 13, 2024 •

edited

Loading

kolyshkin commented Dec 16, 2024

kolyshkin commented Dec 17, 2024

kolyshkin commented Dec 17, 2024 •

edited

Loading

kolyshkin commented Dec 17, 2024

kolyshkin commented Dec 18, 2024

kolyshkin commented Jan 7, 2025

Freeze fixes and v1 kludges #2545

Freeze fixes and v1 kludges #2545

Conversation

kolyshkin commented Dec 13, 2024 • edited Loading

1. freeze_processes: fix logic

2. freeze_processes: implement kludges for cgroup v1

kolyshkin commented Dec 16, 2024

kolyshkin commented Dec 17, 2024

kolyshkin commented Dec 17, 2024 • edited Loading

kolyshkin commented Dec 17, 2024

kolyshkin commented Dec 18, 2024

kolyshkin commented Jan 7, 2025

kolyshkin commented Dec 13, 2024 •

edited

Loading

kolyshkin commented Dec 17, 2024 •

edited

Loading