Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new video reading API crash #5419

Open
prabhat00155 opened this issue Feb 14, 2022 · 11 comments
Open

new video reading API crash #5419

prabhat00155 opened this issue Feb 14, 2022 · 11 comments

Comments

@prabhat00155
Copy link
Contributor

prabhat00155 commented Feb 14, 2022

🐛 Describe the bug

I get malloc(): memory corruption when running the following code with a video file.

reader = torchvision.io.VideoReader(path, num_threads=1)
data = next(reader)
print(data)
data = next(reader)
print(data)
data = next(reader)
print(data)
data = next(reader)
print(data)

Video metadata:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/data/home/prabhatroy/data/output.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.45.100
  Duration: 00:01:02.00, start: 0.000000, bitrate: 838 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 360x360 [SAR 1:1 DAR 1:1], 694 kb/s, 60 fps, 60 tbr, 16k tbn, 2k tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 127 kb/s (default)
    Metadata:
      handler_name    : SoundHandler

On debugging, it points at this line as the culprit:

outFrame = torch::zeros({outHeight, outWidth, numChannels}, torch::kByte);

Versions

Collecting environment information...
PyTorch version: 1.11.0.dev20220203+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.20.4
Libc version: glibc-2.27

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-1051-aws-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] torch==1.11.0.dev20220203+cu111
[pip3] torchvision==0.12.0a0+22f8dc4
[conda] numpy 1.22.2 pypi_0 pypi
[conda] torch 1.11.0.dev20220203+cu111 pypi_0 pypi
[conda] torchvision 0.12.0a0+22f8dc4 dev_0

@prabhat00155
Copy link
Contributor Author

prabhat00155 commented Feb 22, 2022

On kinetics dataset: kinetics/070618/train_avi-480p/knitting/q2sqyxhLiDU_000031_000041.avi, I get the following crash:

python: malloc.c:2401: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
Aborted (core dumped)

@prabhat00155
Copy link
Contributor Author

On kinetics dataset: kinetics/070618/600/train/visiting_the_zoo/KgVpMUp4KmM_000000_000010.mp4, I get the following crashes(on different runs):

malloc(): memory corruption
Aborted (core dumped)

and

free(): invalid next size (fast)
Aborted (core dumped)

@bjuncek
Copy link
Contributor

bjuncek commented Mar 3, 2022

Documenting current debugging:

  • verify if the segfaults are caught in the test suite -- Feb 28th
  • update test suite to check for the problematic video ([wip] address issue 5419 #5543) -- March 2nd

Note:
Upon env re-creation and re-install (caused by a server crash due to power outage), I can't reproduce this error.
Having said that, the video is producing wrong results in both video_reader and VideoReader APIs.

Just a quick edit:
Still can't reproduce this with a fully updated driver and environment. Will dive into nvcodec versions at some point next week

@v-iashin
Copy link

v-iashin commented May 8, 2022

@prabhat00155 do you still have this error?

Could you share the version of ffmpeg in your environment

@v-iashin
Copy link

v-iashin commented May 10, 2022

@bjuncek

I could reproduce the problem as follows:

  1. Install pytorch using the conda line from the website
  2. Updated ffmpeg with conda install -c conda-forge ffmpeg.

Pytorch installs ffmpeg==4.2, while conda-forge updates it to 4.3.2.

@alexnwang
Copy link

@v-iashin
Does this error still occur if you use ffmpeg==4.2?

@v-iashin
Copy link

v-iashin commented Jun 30, 2022

I think torch has changed the ffmpeg version that it installs with torchvision (4.2 -> 4.3).

When I try to call VideoReader, it throws: RuntimeError: Not compiled with video_reader support, to enable video_reader support, please install ffmpeg (version 4.2 is currently supported) and build torchvision from source.

I tried to install ffmpeg=4.2 from conda-forge or just with conda install ffmpeg=4.2. It did install it but the error on VideoReader call persists.

Reproduce:

  1. conda create -n issue5419
  2. conda install python (python=3.10.4)
  3. conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge (torch=1.12, torchvision=0.13.0, cudatoolkit=11.6.0, ffmpeg=4.3)
  4. >>> import torchvision; torchvision.io.VideoReader('path-to-mp4.mp4') --> RuntimeError
  5. conda install [-c conda-forge] ffmpeg=4.2
  6. >>> import torchvision; torchvision.io.VideoReader('path-to-mp4.mp4') --> RuntimeError

@alexnwang
Copy link

It seems the error occurs at torchvision.io._read_video_from_memory and is affected by the decoded resolution.
Larger the target resolutions -> more malloc errors. I've managed to avoid them altogether by setting the argument video_min_dimension=224.

Suppose this makes sense as it is some sort of memory error.

@v-iashin
Copy link

v-iashin commented Jul 22, 2022

Suppose this makes sense as it is some sort of memory error.

It is not consistent with my findings. Previously, when I compared ffmpeg=4.2 and 4.3.2, torch could load the same video into the memory and was failing with another version.

I think the same error is being caused by two different reasons: a lack of RAM (your case) and a version mismatch (at least in my case).

@bjuncek
Copy link
Contributor

bjuncek commented Jul 22, 2022

Hi all,
So I've dug quite a bit into this for the past two weeks (and am continuing to do so), and there are a few confusing factors.
The error, to the best of my tracebacks is coming the fact that returned data from FFMPEG is larger than the allocated tensor (which should be guaranteed based on the headers), but there is some sort of a mismatch.

what I can't seem to figure out is why that is happening. I've tried hi-res videos, didn't have an issue, but then a video from #6204 does. The codec looks the same as some other videos, and it passes the ffprobe without an issue.

I've been getting some help from collegues at QS so hopefully will be able to get to the bottom of this.

@hmaarrfk
Copy link
Contributor

We've had to disable ffmpeg support at conda-forge.

We can reliably recreate the a segfault that seems to occur during the video read tests.

Curiously, it doesn't occur on python 3.9.

This occurs for CPU builds too, not just GPU.

Build logs can be followed conda-forge/torchvision-feedstock#60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants