-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDevicePrimaryCtxGetState() returns error 3 (CUDA_ERROR_NOT_INITIALIZED) in a resumed snapshot under certain circumstances #15
Comments
How does your snapshot tool generate the snapshot and restore after the cuda-checkpoint toggle if you can describe the process? If I'm understanding your flow correctly the failure happens after the app is toggled to the checkpointed state, snapshotted, and then resumed from the snapshot and toggled back to running state after which a subsequent call to cuDevicePrimaryCtxGetState() returns not initialized. Is the snapshot resume done from a cold start or with the currently running app (similar to CRIU --leave-running)? One thing to note is that NVML support isn't available just yet so the checkpoint will leave some stale references to /dev/nvidiactl /dev/nvidia0...N. Unlikely to be the issue in your snapshot tool but just in case there's some entries in /proc/pid/maps that may not be handled properly but usually pytorch apps only use NVML at the start to query for information and then don't touch it again. I'll try and test this internally replacing the snapshot tool with CRIU and see if I can replicate it. |
I tested this with internal NVML support and CRIU+CUDA plugin and it looks like it was able to checkpoint/restore properly. I removed the getenv() check and the 2nd dosnap check, I let the application run until the first "waiting for dosnap". I then issued a criu dump on the process with the cuda plugin which dumped/exited. I then ran the criu restore from the dump, did touch dosnap and let it resume and exit. Here's the produced output just in case but I got the same results running with/without the CRIU dump. Truncation was not explicitly activated but By the way which driver version are you running? |
|
We have our snapshot/resume software co-resident in the process with the running application. Our software is what is executed from the command line and "we" in turn load the application into the process' memory and transfer control to the application. We intercept application system calls (this is a simplification) and remember "important" information from those calls (like open fd's, created threads, mapped memory regions, probably others I don't remember). Snapshot requests can be made from the application via our system call, or a snapshot can be requested externally using a pipe. For the case under consideration we have the application initiating the snapshot. The snapshot request results in the generation of a linux core file, which is in ELF format. We add additional note types to remember information needed to reconstruct the application's kernel state. So we remember open files, active threads, auxv, and the other stuff I don't remember. We do not include information about our snapshot software in the core file, well we do include version information so that we don't try to load a newer version snapshot that an old version of software wouldn't understand. One thing we do not track and save is kernel state setup with ioctl() requests. We are hoping that cuda-checkpoint was saving gpu related ioctl created kernel state. We don't support snapshotting families of related processes. We only support snapshotting a single process. To resume a snapshot our software is invoked from the command line and the core file containing the snapshot is supplied as the name of the application executable. We recognize that it is a snapshot and restore open files, active threads, the application's memory regions and their contents. Note that we do not restore the process id's and thread id's. But, we give out "normalized" thread id's (from 1 to N) and intercept and alter the file path to open file calls that go after things like "/proc/XXXX/maps". We do not have full coverage of the /proc namespace for these games.
Snapshot resume is done from coldstart in this case, the snapped process is terminated when the snapshot is successfully created.
|
I have a question about using cuda-checkpoint with the nvidia cupti library. Should cuda-checkpoint work when the cupti library is being used by the application? |
One more question: Can you tell me why cuDevicePrimaryCtxGetState() fails with the error that the interface is not initialized when we know there is a call to cuInit() preceding the call to cuDevicePrimaryCtxGetState(). I was hoping to use the cupti library call tracing to see if something else had deinitialized between cuInit() and cuDevicePrimaryCtxGetState() but, as mention above, cupti caused the toggle operation to fail. Are there plans to allow cupti and cuda-checkpoint to coexist? |
CUDA checkpoint functionality currently doesn't work with devtools like cupti. I do eventually want to enable support for that but it hasn't been hugely requested just yet. The way cuda-checkpoint works is that all the gpu state should be tracked in host memory so as long as all those parts are restored properly and then restore+unlock actions executed it should work. I can't quite think of a good reason why cuDevicePrimaryCtxGetState() would return not initialized other than some CPU state of the process was not properly saved/restored. Only other thing I can think of is that the current release doesn't have support for NVML which pytorch apps usually use so there may be stale references to /dev/nvidiactl /dev/nvidia{0..N} in /proc/pid/maps and open fd's, this shouldn't lead to the issue you're seeing but just something to point out. I can outline the way that our integration with CRIU works if that might provide some insight: Once all processes in the tree have successfully been placed into the lock state and the whole process tree is frozen, we go through each process, unfreeze the restore thread (--get-restore-tid flag will return the thread id for this), issue a checkpoint action to save the state, then re-freeze the restore thread. On the restore side once all memory contents have been placed into memory but before we resume the process running we traverse the process tree issuing restore actions (early starting up the cuda restore thread) and unlock actions to release the API's for use. NVML support is coming soon though and I tested your example app with our CRIU integration as well and was able to checkpoint and restore from a CRIU dump just fine if that's an option for you down the line. |
I found out what is wrong here. Apparently cuInit() remembers the pid of the process that was snapshotted. I found the method used by crui to get back the same pid of the snapshotted process. When I added code to do this before the snapshot is resumed, the problem went away. |
Good catch, yeah the process expects to retain it's PID otherwise we assume a fork() has happened. Closing this out then. |
We are using cuda-checkpoint to save gpu context into a processes memory and then using our own
snapshot facilities to create an executable that will resume execution at the point of the snapshot.
The current version of our snapshot tool is not open source.
The application we are using is pytorch. Briefly this is the python program we run:
============================================================================================
============================================================================================
Place MAKE_SNAPSHOT=1 into the environment.
Remove the file "dosnap" before running the python program.
Run the python program using our snapshot generator/resume code.
Once you get the "waiting for dosnap to exist" message, run "cuda-checkpoint --toggle --pid xxxx"
Then run "touch dosnap", a snapshot will be generated.
Remove the file "dorun" before resuming the snapshot.
After we resume one of our snapshots, wait for the "waiting for dorun to exist" prompt,
then run "cuda-checkpoint --toggle --pid xxxx"
and then run "touch dorun", the snapshot will run.
The following error will happen.
If you look at pytorch source you find the following in aten/src/ATen/cuda/detail/CUDAHooks.cpp:
Line 67 is the call to cuDevicePrimaryCtxGetState().
Placing a breakpoint after cuDevicePrimaryCtxGetState() returns we find that the value 3 is returned.
That is CUDA_ERROR_NOT_INITIALIZED. The description of that error says cuInit() has not been called.
I placed a breakpoint on cuInit() and reran the test. I found that cuInit() had been called before the
failing call to cuDevicePrimaryCtxGetState().
I wondered if there had been other calls to cuDevicePrimaryCtxGetState(), so I placed a breakpoint on
cuDevicePrimaryCtxGetState() and resumed the snapshot again. I found there had been approximately 50 successful
calls. The call to cuInit() was made and then the failing call to cuDevicePrimaryCtxGetState() was made.
If we perform a single inference before running cuda-checkpoint, then take a snapshot, then resume the snapshot,
run cuda-checkpoint, and then perform another inference, the inference following snapshot resume works.
There is no failure return from cuDevicePrimaryCtxGetState().
If you remove MAKE_SNAPSHOT from the environment and run the test again (no snapshot is generated, no snapshot to resume)
the program works. The call to cuDevicePrimaryCtxGetState() does not fail.
Can you help me understand what is wrong?
The text was updated successfully, but these errors were encountered: