Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze fixes and v1 kludges #2545

Merged
merged 3 commits into from
Jan 7, 2025

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Dec 13, 2024

1. freeze_processes: fix logic

There are a few issues with the freeze_processes logic:

  1. Commit 9fae23f grossly (by 1000x) miscalculated the number of
    attempts required, as a result, we are seeing something like this:

(00.000340) freezing processes: 100000 attempts with 100 ms steps
(00.000351) freezer.state=THAWED
(00.000358) freezer.state=FREEZING
(00.100446) freezer.state=FREEZING
...close to 100 lines skipped...
(09.915110) freezer.state=FREEZING
(10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
(10.000563) freezer.state=FREEZING

For 10s with 100ms steps we only need 100 attempts, not 100000.

  1. When the timeout is hit, the "failed to freeze cgroup" error is not
    printed, and the log_unfrozen_stacks is not called either.

  2. The nanosleep at the last iteration is useless (this was hidden by
    issue 1 above, as the timeout was hit first).

Fix all these.

While at it,

  1. Amend the error message with the number of attempts, sleep duration,
    and timeout.

  2. Modify the "freezing cgroup" debug message to be in sync with the
    above error.

    Was:

    freezing processes: 100000 attempts with 100 ms steps

    Now:

    freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s

2. freeze_processes: implement kludges for cgroup v1

Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see 1, 2). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

criu/seize.c Show resolved Hide resolved
criu/seize.c Outdated Show resolved Hide resolved
criu/seize.c Outdated Show resolved Hide resolved
@kolyshkin
Copy link
Contributor Author

Addressed review comments; rebased.

criu/seize.c Outdated Show resolved Hide resolved
Done using clang-format 19.1.5 with .clang-format obtained via
scripts/fetch-clang-format.sh.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are a few issues with the freeze_processes logic:

1. Commit 9fae23f grossly (by 1000x) miscalculated the number of
   attempts required, as a result, we are seeing something like this:

> (00.000340) freezing processes: 100000 attempts with 100 ms steps
> (00.000351) freezer.state=THAWED
> (00.000358) freezer.state=FREEZING
> (00.100446) freezer.state=FREEZING
> ...close to 100 lines skipped...
> (09.915110) freezer.state=FREEZING
> (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
> (10.000563) freezer.state=FREEZING

   For 10s with 100ms steps we only need 100 attempts, not 100000.

2. When the timeout is hit, the "failed to freeze cgroup" error is not
   printed, and the log_unfrozen_stacks is not called either.

3. The nanosleep at the last iteration is useless (this was hidden by
   issue 1 above, as the timeout was hit first).

Fix all these.

While at it,

4. Amend the error message with the number of attempts, sleep duration,
   and timeout.

5. Modify the "freezing cgroup" debug message to be in sync with the
   above error.

   Was:

   > freezing processes: 100000 attempts with 100 ms steps

   Now:

   > freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: opencontainers/runc#4273
[2]: opencontainers/runc#4457

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

Refactored to fix some more issues.

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Dec 17, 2024

  1. Commit 9fae23f grossly (by 1000x) miscalculated the number of attempts required
    <...>
  2. When the timeout is hit, the "failed to freeze cgroup" error is not printed, and the log_unfrozen_stacks is not called either.

As this is now fixed, we may find additional issues with the log_unfrozen_stacks (which, I guess, was never called before due to nr_attempts miscalculation).

@kolyshkin
Copy link
Contributor Author

Testing this in opencontainers/runc#4559, no luck so far (can't reproduce the issue).

@kolyshkin
Copy link
Contributor Author

Testing this in opencontainers/runc#4559

I have concluded the testing, these kludges definitely help (i.e. can easily reproduce the issue without the fix, and can not reproduce it with the fix). See opencontainers/runc#4559 for more details.

@kolyshkin
Copy link
Contributor Author

@avagin PTAL

@avagin avagin merged commit 7c66617 into checkpoint-restore:criu-dev Jan 7, 2025
36 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants