Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: add parallel restore of BO content to accelerate restore #2527

Open
wants to merge 7 commits into
base: criu-dev
Choose a base branch
from

Conversation

wweewrwer
Copy link

TL;DR:

This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.

The target issue:

In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in CR_PLUGIN_HOOK__RESTORE_EXT_FILE. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.

The parallel restore approach in this PR:

The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook, CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents in CR_PLUGIN_HOOK__RESTORE_EXT_FILE and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs in CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.

Tests:

We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.

Results:

From disk From page cache
Sequential restore 1728ms 254ms
Parallel restore 1596ms 167ms
Speed up 7.6% 34.3%

Settings:

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Memory: DDR4, 2x8GB

GPU: AMD MI50

Disk: 512GB, Samsung SSD 860

Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1

Example program:

example.py: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.

import time
import os
import sys
import torch
import torchvision.models as models
import torchvision.transforms as transforms
torch.set_grad_enabled(False)

device = "cuda:0"

model = models.resnet18(weights='DEFAULT')
model = model.to(device)
model.eval()

batch_size = 1
channels = 3
height = 224
width = 224
input_tensor = torch.randn(batch_size, channels, height, width)
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_tensor)

while input()!="y":
    st = time.time()
    input_tensor = input_tensor.to(device)
    output = model(input_tensor)
    output = output.to("cpu")
    _, predicted_idx = torch.max(output, 1)
    torch.cuda.synchronize()
    ed = time.time()
    print("test time:",ed-st)
    sys.stdout.flush()

Steps:

  1. Install CRIU

    Follow the standard CRIU installation process. Ensure you set the environment variable CRIU_LIBS_DIR to the plugins/amdgpu path.

  2. Dump checkpoint image

    #In one shell
    python3 example.py
    #In another shell
    mkdir -p /tmp/criu-dump
    criu dump -t $(pgrep python3) -D /tmp/criu-dump -j --file-locks
    
  3. Restore from disk

    Test for sequential restore:

    #Clear page cache
    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    

    Test for parallel restore:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    
  4. Restore from page cache

    Install vmtouch for caching images:

    sudo apt install vmtouch
    

    Test:

    sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
    #Cache image in memory
    vmtouch -l criu-dump
    #Warm up environment 
    criu restore -D /tmp/criu-dump -j --file-locks
    #Begin to Test
    criu restore -D /tmp/criu-dump -j --file-locks
    cat stats-restore | crit decode --pretty | grep restore_time
    criu restore -D /tmp/criu-dump -j --file-locks --parallel
    cat stats-restore | crit decode --pretty | grep restore_time
    

criu/crtools.c Outdated Show resolved Hide resolved
criu/cr-restore.c Outdated Show resolved Hide resolved
criu/cr-restore.c Outdated Show resolved Hide resolved
@Ddnirvana
Copy link

Thanks for the above comments @avagin @rst0git , we are fixing and polishing the PR. Will update ASAP.

@rst0git
Copy link
Member

rst0git commented Nov 25, 2024

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

@Ddnirvana
Copy link

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

@rst0git No problem. We will add proper description in the next version.

Copy link
Contributor

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wweewrwer. Some minor nit picks, but overall the code looks good to me.

plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
plugins/amdgpu/amdgpu_socket_utils.c Outdated Show resolved Hide resolved
@wweewrwer
Copy link
Author

@rst0git @avagin @dayatsin-amd hi maintainers, thanks for your prior reviews and comments. We have fixed all the issues, as the following:

  1. Use the proper APIs to allocate (xmalloc, etc.)
  2. Enable the optimizations by default
  3. Change the name of hook
  4. Fix the issues to run in Podman containers
  5. Other fixes (line width, comments, etc.)
  6. Add descriptions in README to explain the optimizations.

Please let us know if you have any further comments

Copy link
Contributor

@dayatsin-amd dayatsin-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wweewrwer

@rst0git
Copy link
Member

rst0git commented Nov 28, 2024

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase?
https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@wweewrwer
Copy link
Author

wweewrwer commented Nov 29, 2024

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase? https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@rst0git Thanks for your comment! I have merged the fixup commits into the previous commits using git rebase. Please let me know if you have any further comments.

criu/cr-restore.c Outdated Show resolved Hide resolved
@wweewrwer wweewrwer force-pushed the parallel_restore branch 2 times, most recently from cb6b91d to 37e3813 Compare December 5, 2024 13:47
plugins/amdgpu/README.md Outdated Show resolved Hide resolved
@wweewrwer
Copy link
Author

@rst0git @avagin
Dear maintainers,

We have pushed the V4 version of the PR, completing all mentioned issues since the last version. Specifically, we: (1) support multiple commands (from a single process), (2) support multiple processes restore, and (3) fix other minor issues mentioned.

Details:

  • Replaced UDP with TCP to distinguish messages between different processes and commands.
  • Multiple-command support: Instead of receiving the command only once, the hook function now launches a dedicated thread to receive commands indefinitely until all tasks finish their restore stage. The main thread in this hook uses restore_wait_inprogress_tasks to determine when tasks have finished. Once completed, it sends an exit command to the parallel restore thread to stop receiving commands.
  • Multi-process support: In the case of multiple processes, they are restored in parallel (with different processes) by default, which will not benefit from the parallel optimizations. Therefore, we introduce a flag (called parallel_disabled) to only enable the optimization for single-process (which is the common case) as a fast path, and fallback to original restore otherwise.
  • Multi-GPU parallel restore support: In the original restore, when a process has multiple GPUs, the content on each GPU is restored in parallel. In this version, we have supported multi-GPU parallel restore utilizing the original design.
  • Other issues: Big thanks to Andrei and Radostin for other issues and suggestions, which are all fixed accordingly.

We have performed all the tests with the above changes. The PR can still bring 31% decrease for the restore latency in the case of single process, and achieves the same results for mutlti-process scenarios.

Please let me know if you have any further comments.

@wweewrwer
Copy link
Author

@rst0git @avagin Just a friendly reminder about the updates in this PR (in case maintainers miss the prior notifications)

@wweewrwer wweewrwer force-pushed the parallel_restore branch 2 times, most recently from 1f48cd3 to c7ca1b3 Compare December 16, 2024 11:04
@avagin
Copy link
Member

avagin commented Dec 16, 2024

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

@wweewrwer
Copy link
Author

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.

Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Below figure shows the issues that BO restore must be finished before the CPU memory state restore:

71b554be0823fc028c0fd9a8f9c554f

@wweewrwer
Copy link
Author

@rst0git @avagin Dear maintainers/reviewers, just want to know if there are any further issues/concerns about the latest version?

@dayatsin-amd
Copy link
Contributor

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

@wweewrwer
Copy link
Author

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

Sure. Thanks!

@Ddnirvana
Copy link

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

Dear David @dayatsin-amd , just want to know are there any progress/results about the internal regression test. Thank you again for the assistance and happy new year btw :)

@avagin avagin closed this Jan 7, 2025
@avagin avagin reopened this Jan 7, 2025
@avagin
Copy link
Member

avagin commented Jan 7, 2025

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.

Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one.

Additionally, I see two potential issues:

  • Sequential Restoration: This change seems to introduce a new bottleneck by restoring buffer objects sequentially. Could this cause performance problems for workloads with many buffer objects across multiple processes? It would be helpful to understand how this approach scales.
  • Plugin Hook Execution: Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic. This deviates from the expectation that multiple plugins should operate independently with equal capabilities.

@wweewrwer
Copy link
Author

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.
Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one.

Additionally, I see two potential issues:

  • Sequential Restoration: This change seems to introduce a new bottleneck by restoring buffer objects sequentially. Could this cause performance problems for workloads with many buffer objects across multiple processes? It would be helpful to understand how this approach scales.
  • Plugin Hook Execution: Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic. This deviates from the expectation that multiple plugins should operate independently with equal capabilities.

Thank you for your comment.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses.

In the case of GPU apps (AMD GPUs in our case), not everything in the restore blob is fast, and not all mappings are restored before switching into the restore blob. Some mappings are still restored within the restore blob (e.g., criu/pie/restorer.c:1807), and these operations take a significant amount of time.

In our tests, the restore time of the restore blob even takes longer than the GPU buffer objects restore time (this can be easily reproduced using the same test application and approach as in this pull request). Here is the breakdown of our test.

1

This is the motivation of the PR and we believe it is valuable to parallelize the restoration of GPU buffer objects with other CRIU restore operations for optimized restore latency.

It would be helpful to understand how this approach scales.

This PR will not lead to sequential restoration issue. The optimization focuses on the single-process situation (common case), as shown in the following table. In other scenarios, it will turn to the original method. This is achieved with the new parallel_disabled flag.

Our method Original method
Single process Y
Multiple process Y

And for a single process, its restore logic is the same as the original method except that it is offloaded to the main CRIU process.

Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic.

It will not introduce new dependency. Currently the main CRIU process must wait for other processes to complete their memory state restoration before proceeding with additional restoration logic. Besides, most of the code is in separate files and related to AMD GPUs for optimized performance, and will not break anything in the main path.

Please let me know if you have any further comments.

@avagin
Copy link
Member

avagin commented Jan 13, 2025

In the case of GPU apps (AMD GPUs in our case), not everything in the restore blob is fast, and not all mappings are restored before switching into the restore blob. Some mappings are still restored within the restore blob (e.g., criu/pie/restorer.c:1807), and these operations take a significant amount of time.

I forgot that we moved restore of anon-vma-s to the restorer (91388fc). Now, I understand the problem. Thanks.

It will not introduce new dependency. Currently the main CRIU process must wait for other processes to complete their memory state restoration before proceeding with additional restoration logic. Besides, most of the code is in separate files and related to AMD GPUs for optimized performance, and will not break anything in the main path.

I think you misunderstand me here. I don't mean code dependencies. The problem here is that the amd plugin completely occupied the main criu process. It means if there will be another plugin with a similar post-fork hook, these plugin will not work together. If you want to introduce a hook handling events from child processes, it should be running in context of a separate thread.

ps, I like all these detailed comments, but why all these details have not been explained on the commit messages. It would speed up the review process and in future, it would help to understand ideas behind the implementation.

@wweewrwer
Copy link
Author

Hi Andrei,

I think you misunderstand me here. I don't mean code dependencies. The problem here is that the amd plugin completely occupied the main criu process. It means if there will be another plugin with a similar post-fork hook, these plugin will not work together. If you want to introduce a hook handling events from child processes, it should be running in context of a separate thread.

Thanks for the clarification! Got the concern.

We can only launch a separate thread in POST_FORKING hook (but not blocking), and wait for the thread in a latter hook RESUME_DEVICES_LATE to avoid occupy the main CRIU process and other plugins in POST_FORKING.
2

Specifically, to avoid occupying the main CRIU process, we only launch a thread in POST_FORKING without waiting for it to complete. Instead, we wait for this thread in the next hook, RESUME_DEVICES_LATE. In RESUME_DEVICES_LATE, since every child process has completed its restore logic, no new commands will be sent to the thread. Therefore, we can then immediately send a stop command to the thread and wait for it to finish processing and terminate.

This design won't add extra waiting time to the main CRIU process, as when the GPU data is restored in child processes, the main CRIU process also needs to wait for the child processes to finish before calling the RESUME_DEVICES_LATE hook.

This is a not a big change, but can make the whole thing more clean and avoid the occupation issue.

We are making the change and will post updates once done. Please feel free to let us know if you have any further comments/suggestions on this.

ps, I like all these detailed comments, but why all these details have not been explained on the commit messages. It would speed up the review process and in future, it would help to understand ideas behind the implementation.

Big thanks for the suggestion. Will add more key information during the discussion/comments to the commit messages in the next version. And will also follow the principle in the future contributions :)

Currently, in the target process, device-related restore operations and
other restore operations almost run sequentially. When the target
process executes the corresponding CRIU hook functions, it can't perform
other restore operations.  However, for GPU applications, some device
restore operations have no logical dependencies on other common restore
operations and can be parallelized with other operations to speed up the
process.

Instead of launching a thread in child processes for parallelization,
this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU
process to handle these restore operations. This is because the
restoration of memory state in the restore blob is one of the most
time-consuming parts of all restore logic. The main CRIU process can
easily parallelize these operations, whereas parallelizing in threads
within child processes is challenging.

- POST_FORKING

*POST_FORKING: Hook to enable the main CRIU process to perform some
restore operations of plugins.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not
initialized. However, during the plugin restore procedure, there may be
some common file operations used in multiple hooks. This patch moves
`cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use
`fdstore` to place these file operations.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
criu/pstree.c Show resolved Hide resolved
plugins/amdgpu/README.md Outdated Show resolved Hide resolved
plugins/amdgpu/README.md Outdated Show resolved Hide resolved
Currently, parallel restore only focuses on the single-process
situation. Therefore, it needs an interface to know if there is only one
process to restore. This patch adds a `has_children` function in
`pstree.h` and replaces some existing implementations with this
function.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
When enabling parallel restore, the target process and the main CRIU
process need an IPC interface to communicate and transfer restore
commands. This patch adds a Unix domain TCP socket and stores this
socket in `fdstore`.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Currently the restore of buffer object comsumes a significant amount of
time. However, this part has no logical dependencies with other restore
operations. This patch introduce some structures and some helper
functions for the target process to offload this task to the main CRIU
process.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
This patch implements the entire logic to enable the offloading of
buffer object content restoration.

The goal of this patch is to offload the buffer object content
restoration to the main CRIU process so that this restoration can occur
in parallel with other restoration logic (mainly the restoration of
memory state in the restore blob, which is time-consuming) to speed up
the restore phase. The restoration of buffer object content usually
takes a significant amount of time for GPU applications, so
parallelizing it with other operations can reduce the overall restore
time.

It has three parts: the first replaces the restoration of buffer objects
in the target process by sending a parallel restore command to the main
CRIU process; the second implements the POST_FORKING hook in the amdgpu
plugin to enable buffer object content restoration in the main CRIU
process; the third stops the parallel thread in the RESUME_DEVICES_LATE
hook.

This optimization only focuses on the single-process situation (common
case). In other scenarios, it will turn to the original method. This is
achieved with the new `parallel_disabled` flag.

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>
@wweewrwer
Copy link
Author

The previous suggestions have been updated, and both the regression test and related tests have been completed. Please let us know if you have any further comments. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants