AMDGPU: add parallel restore of BO content to accelerate restore #2527

wweewrwer · 2024-11-22T10:30:08Z

TL;DR:

This pull request extends CRIU to support parallel restore of AMDGPU buffer object content alongside other restore operations to accelerate the restoration.

The target issue:

In the current restore procedure of AMDGPU applications, the content of the AMDGPU buffer object (BO) is restored synchronously in CR_PLUGIN_HOOK__RESTORE_EXT_FILE. This procedure usually takes a significant amount of time, and during this time the target process cannot perform any other restore operations. However, this restoration has no logical dependencies with other restore operations. Parallelizing this part with other restore operations can speed up the restoration.

The parallel restore approach in this PR:

The core idea of these patch series is to offload the restore of the BO content from the target process to the main CRIU process (the main CRIU process refers to the parent process, and the target process refers to the child process created during the fork). To achieve this, we introduce a new hook, CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS, in the main CRIU process. For the AMDGPU plugin, the target process will no longer restore BO contents in CR_PLUGIN_HOOK__RESTORE_EXT_FILE and just send the relevant BOs to the main CRIU process. the main CRIU process will receive the corresponding BOs in CR_PLUGIN_HOOK__RESTORE_ASYNCHRONOUS and begin the restoration. Meanwhile, the target process can continue with other parts of the restoration without being blocked by the BO content restoration. The full design of the idea can also be referred with the ACM SoCC'24 paper: On-demand and Parallel Checkpoint/Restore for GPU Applications.

Tests:

We evaluated the performance according to the following settings. The results show that parallel restore can speed up by 34.3% when images cached in the page cache, and 7.6% when restoring from disk.

Results:

	From disk	From page cache
Sequential restore	1728ms	254ms
Parallel restore	1596ms	167ms
Speed up	7.6%	34.3%

Settings:

CPU: Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz

Memory: DDR4, 2x8GB

GPU: AMD MI50

Disk: 512GB, Samsung SSD 860

Docker image: rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_1.12.1

Example program：

example.py: a ResNet18 application. Enter 'y' to exit, or press any other key to perform inference.

import time
import os
import sys
import torch
import torchvision.models as models
import torchvision.transforms as transforms
torch.set_grad_enabled(False)

device = "cuda:0"

model = models.resnet18(weights='DEFAULT')
model = model.to(device)
model.eval()

batch_size = 1
channels = 3
height = 224
width = 224
input_tensor = torch.randn(batch_size, channels, height, width)
preprocess = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
input_tensor = preprocess(input_tensor)

while input()!="y":
    st = time.time()
    input_tensor = input_tensor.to(device)
    output = model(input_tensor)
    output = output.to("cpu")
    _, predicted_idx = torch.max(output, 1)
    torch.cuda.synchronize()
    ed = time.time()
    print("test time:",ed-st)
    sys.stdout.flush()

Steps：

Install CRIU

Follow the standard CRIU installation process. Ensure you set the environment variable CRIU_LIBS_DIR to the plugins/amdgpu path.

Dump checkpoint image

#In one shell
python3 example.py
#In another shell
mkdir -p /tmp/criu-dump
criu dump -t $(pgrep python3) -D /tmp/criu-dump -j --file-locks

Restore from disk

Test for sequential restore:

#Clear page cache
sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
criu restore -D /tmp/criu-dump -j --file-locks
cat stats-restore | crit decode --pretty | grep restore_time

Test for parallel restore:

sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
criu restore -D /tmp/criu-dump -j --file-locks --parallel
cat stats-restore | crit decode --pretty | grep restore_time

Restore from page cache

Install vmtouch for caching images:

sudo apt install vmtouch

Test:

sync; sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" 
#Cache image in memory
vmtouch -l criu-dump
#Warm up environment 
criu restore -D /tmp/criu-dump -j --file-locks
#Begin to Test
criu restore -D /tmp/criu-dump -j --file-locks
cat stats-restore | crit decode --pretty | grep restore_time
criu restore -D /tmp/criu-dump -j --file-locks --parallel
cat stats-restore | crit decode --pretty | grep restore_time

criu/crtools.c

criu/cr-restore.c

plugins/amdgpu/amdgpu_plugin.c

Ddnirvana · 2024-11-25T06:55:41Z

Thanks for the above comments @avagin @rst0git , we are fixing and polishing the PR. Will update ASAP.

rst0git · 2024-11-25T09:16:40Z

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

Ddnirvana · 2024-11-25T10:10:29Z

@Ddnirvana @wweewrwer Thank you for your contributions! It might be good to also update the content of the following files to reflect these changes:

https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md

https://github.com/checkpoint-restore/criu/blob/criu-dev/Documentation/criu-amdgpu-plugin.txt

@rst0git No problem. We will add proper description in the next version.

dayatsin-amd

Thank you @wweewrwer. Some minor nit picks, but overall the code looks good to me.

plugins/amdgpu/amdgpu_socket_utils.c

wweewrwer · 2024-11-28T06:38:46Z

@rst0git @avagin @dayatsin-amd hi maintainers, thanks for your prior reviews and comments. We have fixed all the issues, as the following:

Use the proper APIs to allocate (xmalloc, etc.)
Enable the optimizations by default
Change the name of hook
Fix the issues to run in Podman containers
Other fixes (line width, comments, etc.)
Add descriptions in README to explain the optimizations.

Please let us know if you have any further comments

dayatsin-amd

Thank you @wweewrwer

rst0git · 2024-11-28T20:06:02Z

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase?
https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

wweewrwer · 2024-11-29T02:52:45Z

@wweewrwer Would you be able to merge the fixup commits into the previous commits using git rebase? https://github.com/checkpoint-restore/criu/blob/criu-dev/CONTRIBUTING.md#submit-your-work-upstream

@rst0git Thanks for your comment! I have merged the fixup commits into the previous commits using git rebase. Please let me know if you have any further comments.

criu/cr-restore.c

plugins/amdgpu/amdgpu_socket_utils.c

plugins/amdgpu/amdgpu_plugin.c

plugins/amdgpu/README.md

plugins/amdgpu/amdgpu_socket_utils.c

wweewrwer · 2024-12-10T13:50:26Z

@rst0git @avagin
Dear maintainers,

We have pushed the V4 version of the PR, completing all mentioned issues since the last version. Specifically, we: (1) support multiple commands (from a single process), (2) support multiple processes restore, and (3) fix other minor issues mentioned.

Details:

Replaced UDP with TCP to distinguish messages between different processes and commands.
Multiple-command support: Instead of receiving the command only once, the hook function now launches a dedicated thread to receive commands indefinitely until all tasks finish their restore stage. The main thread in this hook uses restore_wait_inprogress_tasks to determine when tasks have finished. Once completed, it sends an exit command to the parallel restore thread to stop receiving commands.
Multi-process support: In the case of multiple processes, they are restored in parallel (with different processes) by default, which will not benefit from the parallel optimizations. Therefore, we introduce a flag (called parallel_disabled) to only enable the optimization for single-process (which is the common case) as a fast path, and fallback to original restore otherwise.
Multi-GPU parallel restore support: In the original restore, when a process has multiple GPUs, the content on each GPU is restored in parallel. In this version, we have supported multi-GPU parallel restore utilizing the original design.
Other issues: Big thanks to Andrei and Radostin for other issues and suggestions, which are all fixed accordingly.

We have performed all the tests with the above changes. The PR can still bring 31% decrease for the restore latency in the case of single process, and achieves the same results for mutlti-process scenarios.

Please let me know if you have any further comments.

wweewrwer · 2024-12-15T12:41:05Z

@rst0git @avagin Just a friendly reminder about the updates in this PR (in case maintainers miss the prior notifications)

plugins/amdgpu/amdgpu_plugin.c

avagin · 2024-12-16T16:17:58Z

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

wweewrwer · 2024-12-17T03:16:46Z

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.

Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Below figure shows the issues that BO restore must be finished before the CPU memory state restore:

wweewrwer · 2024-12-19T09:59:40Z

@rst0git @avagin Dear maintainers/reviewers, just want to know if there are any further issues/concerns about the latest version?

dayatsin-amd · 2024-12-19T15:16:51Z

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

wweewrwer · 2024-12-19T16:12:28Z

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

Sure. Thanks!

Ddnirvana · 2025-01-04T13:02:38Z

I have requested this PR to be validated on a multi-GPU set-up internally at AMD. Can you give us a few days to confirm there is no regression.

Thank you for this patch!

Dear David @dayatsin-amd , just want to know are there any progress/results about the internal regression test. Thank you again for the assistance and happy new year btw :)

avagin · 2025-01-07T02:29:19Z

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.

Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one.

Additionally, I see two potential issues:

Sequential Restoration: This change seems to introduce a new bottleneck by restoring buffer objects sequentially. Could this cause performance problems for workloads with many buffer objects across multiple processes? It would be helpful to understand how this approach scales.
Plugin Hook Execution: Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic. This deviates from the expectation that multiple plugins should operate independently with equal capabilities.

wweewrwer · 2025-01-09T12:12:05Z

Have you investigated other approaches of parallel restoring of BO? For example, it is possible to fork a thread and restoring BO asynchronously in context of its process. In this case, two BO will be restored concurrently.

Yes, we have investigated the approach of forking a thread in the background, but it cannot work as it conflicts with the restore logic of CRIU.
Specifically, when CRIU tries to restore its memory state, it will unmap all old mappings. However, some mappings may be needed by the background thread for BO restoring. Therefore, a thread can only run in parallel with shorter procedures (possibly before entering the restorer blob), while offloading the restore of BO content to a new process (in this PR) can be parallelized with almost the entire restore procedure.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses. I am still not convinced that the idea of restoring buffer objects from the main process is really what we need here.. I can miss something, but I want to see a clear explanation with numbers why the proposed solution is a valuable one.

Additionally, I see two potential issues:

Sequential Restoration: This change seems to introduce a new bottleneck by restoring buffer objects sequentially. Could this cause performance problems for workloads with many buffer objects across multiple processes? It would be helpful to understand how this approach scales.

Plugin Hook Execution: Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic. This deviates from the expectation that multiple plugins should operate independently with equal capabilities.

Thank you for your comment.

Everything what is happening in the restore blob should be fast. All mappings are restored before switching into the restore blob. There, the restored mappings are just remapped to proper addresses.

In the case of GPU apps (AMD GPUs in our case), not everything in the restore blob is fast, and not all mappings are restored before switching into the restore blob. Some mappings are still restored within the restore blob (e.g., criu/pie/restorer.c:1807), and these operations take a significant amount of time.

In our tests, the restore time of the restore blob even takes longer than the GPU buffer objects restore time (this can be easily reproduced using the same test application and approach as in this pull request). Here is the breakdown of our test.

This is the motivation of the PR and we believe it is valuable to parallelize the restoration of GPU buffer objects with other CRIU restore operations for optimized restore latency.

It would be helpful to understand how this approach scales.

This PR will not lead to sequential restoration issue. The optimization focuses on the single-process situation (common case), as shown in the following table. In other scenarios, it will turn to the original method. This is achieved with the new parallel_disabled flag.

	Our method	Original method
Single process	Y
Multiple process		Y

And for a single process, its restore logic is the same as the original method except that it is offloaded to the main CRIU process.

Running the plugin hook in the main CRIU process for an extended period and making it dependent on other processes is problematic.

It will not introduce new dependency. Currently the main CRIU process must wait for other processes to complete their memory state restoration before proceeding with additional restoration logic. Besides, most of the code is in separate files and related to AMD GPUs for optimized performance, and will not break anything in the main path.

Please let me know if you have any further comments.

avagin · 2025-01-13T20:11:06Z

In the case of GPU apps (AMD GPUs in our case), not everything in the restore blob is fast, and not all mappings are restored before switching into the restore blob. Some mappings are still restored within the restore blob (e.g., criu/pie/restorer.c:1807), and these operations take a significant amount of time.

I forgot that we moved restore of anon-vma-s to the restorer (91388fc). Now, I understand the problem. Thanks.

It will not introduce new dependency. Currently the main CRIU process must wait for other processes to complete their memory state restoration before proceeding with additional restoration logic. Besides, most of the code is in separate files and related to AMD GPUs for optimized performance, and will not break anything in the main path.

I think you misunderstand me here. I don't mean code dependencies. The problem here is that the amd plugin completely occupied the main criu process. It means if there will be another plugin with a similar post-fork hook, these plugin will not work together. If you want to introduce a hook handling events from child processes, it should be running in context of a separate thread.

ps, I like all these detailed comments, but why all these details have not been explained on the commit messages. It would speed up the review process and in future, it would help to understand ideas behind the implementation.

wweewrwer · 2025-01-15T07:53:29Z

Hi Andrei,

I think you misunderstand me here. I don't mean code dependencies. The problem here is that the amd plugin completely occupied the main criu process. It means if there will be another plugin with a similar post-fork hook, these plugin will not work together. If you want to introduce a hook handling events from child processes, it should be running in context of a separate thread.

Thanks for the clarification! Got the concern.

We can only launch a separate thread in POST_FORKING hook (but not blocking), and wait for the thread in a latter hook RESUME_DEVICES_LATE to avoid occupy the main CRIU process and other plugins in POST_FORKING.

Specifically, to avoid occupying the main CRIU process, we only launch a thread in POST_FORKING without waiting for it to complete. Instead, we wait for this thread in the next hook, RESUME_DEVICES_LATE. In RESUME_DEVICES_LATE, since every child process has completed its restore logic, no new commands will be sent to the thread. Therefore, we can then immediately send a stop command to the thread and wait for it to finish processing and terminate.

This design won't add extra waiting time to the main CRIU process, as when the GPU data is restored in child processes, the main CRIU process also needs to wait for the child processes to finish before calling the RESUME_DEVICES_LATE hook.

This is a not a big change, but can make the whole thing more clean and avoid the occupation issue.

We are making the change and will post updates once done. Please feel free to let us know if you have any further comments/suggestions on this.

ps, I like all these detailed comments, but why all these details have not been explained on the commit messages. It would speed up the review process and in future, it would help to understand ideas behind the implementation.

Big thanks for the suggestion. Will add more key information during the discussion/comments to the commit messages in the next version. And will also follow the principle in the future contributions :)

Currently, in the target process, device-related restore operations and other restore operations almost run sequentially. When the target process executes the corresponding CRIU hook functions, it can't perform other restore operations. However, for GPU applications, some device restore operations have no logical dependencies on other common restore operations and can be parallelized with other operations to speed up the process. Instead of launching a thread in child processes for parallelization, this patch chooses to add a new hook, `POST_FORKING`, in the main CRIU process to handle these restore operations. This is because the restoration of memory state in the restore blob is one of the most time-consuming parts of all restore logic. The main CRIU process can easily parallelize these operations, whereas parallelizing in threads within child processes is challenging. - POST_FORKING *POST_FORKING: Hook to enable the main CRIU process to perform some restore operations of plugins. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

Currently, when CRIU calls `cr_plugin_init`, `fdstore` is not initialized. However, during the plugin restore procedure, there may be some common file operations used in multiple hooks. This patch moves `cr_plugin_init` after `fdstore_init`, allowing `cr_plugin_init` to use `fdstore` to place these file operations. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

criu/pstree.c

plugins/amdgpu/README.md

Currently, parallel restore only focuses on the single-process situation. Therefore, it needs an interface to know if there is only one process to restore. This patch adds a `has_children` function in `pstree.h` and replaces some existing implementations with this function. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

When enabling parallel restore, the target process and the main CRIU process need an IPC interface to communicate and transfer restore commands. This patch adds a Unix domain TCP socket and stores this socket in `fdstore`. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

Currently the restore of buffer object comsumes a significant amount of time. However, this part has no logical dependencies with other restore operations. This patch introduce some structures and some helper functions for the target process to offload this task to the main CRIU process. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

This patch implements the entire logic to enable the offloading of buffer object content restoration. The goal of this patch is to offload the buffer object content restoration to the main CRIU process so that this restoration can occur in parallel with other restoration logic (mainly the restoration of memory state in the restore blob, which is time-consuming) to speed up the restore phase. The restoration of buffer object content usually takes a significant amount of time for GPU applications, so parallelizing it with other operations can reduce the overall restore time. It has three parts: the first replaces the restoration of buffer objects in the target process by sending a parallel restore command to the main CRIU process; the second implements the POST_FORKING hook in the amdgpu plugin to enable buffer object content restoration in the main CRIU process; the third stops the parallel thread in the RESUME_DEVICES_LATE hook. This optimization only focuses on the single-process situation (common case). In other scenarios, it will turn to the original method. This is achieved with the new `parallel_disabled` flag. Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

Signed-off-by: Yanning Yang <yangyanning@sjtu.edu.cn>

wweewrwer · 2025-01-16T07:34:31Z

The previous suggestions have been updated, and both the regression test and related tests have been completed. Please let us know if you have any further comments. :)

rst0git requested review from rst0git and dayatsin-amd November 22, 2024 12:07

rst0git reviewed Nov 22, 2024

View reviewed changes

criu/crtools.c Outdated Show resolved Hide resolved

avagin reviewed Nov 22, 2024

View reviewed changes

criu/cr-restore.c Outdated Show resolved Hide resolved

avagin reviewed Nov 22, 2024

View reviewed changes

criu/cr-restore.c Outdated Show resolved Hide resolved

avagin reviewed Nov 23, 2024

View reviewed changes

plugins/amdgpu/amdgpu_plugin.c Outdated Show resolved Hide resolved

avagin reviewed Nov 23, 2024

View reviewed changes

plugins/amdgpu/amdgpu_plugin.c Outdated Show resolved Hide resolved

avagin reviewed Nov 23, 2024

View reviewed changes

plugins/amdgpu/amdgpu_plugin.c Outdated Show resolved Hide resolved

dayatsin-amd reviewed Nov 25, 2024

View reviewed changes

dayatsin-amd approved these changes Nov 28, 2024

View reviewed changes

wweewrwer force-pushed the parallel_restore branch from 6d2c563 to 0b97df8 Compare November 29, 2024 02:42