new video reading API crash #5419

prabhat00155 · 2022-02-14T13:19:15Z

🐛 Describe the bug

I get malloc(): memory corruption when running the following code with a video file.

reader = torchvision.io.VideoReader(path, num_threads=1)
data = next(reader)
print(data)
data = next(reader)
print(data)
data = next(reader)
print(data)
data = next(reader)
print(data)

Video metadata:

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from '/data/home/prabhatroy/data/output.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 512
    compatible_brands: isomiso2avc1mp41
    encoder         : Lavf58.45.100
  Duration: 00:01:02.00, start: 0.000000, bitrate: 838 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 360x360 [SAR 1:1 DAR 1:1], 694 kb/s, 60 fps, 60 tbr, 16k tbn, 2k tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 127 kb/s (default)
    Metadata:
      handler_name    : SoundHandler

On debugging, it points at this line as the culprit:

vision/torchvision/csrc/io/video/video.cpp

Line 314 in 0db67d8

outFrame = torch::zeros({outHeight, outWidth, numChannels}, torch::kByte);

Versions

Collecting environment information...
PyTorch version: 1.11.0.dev20220203+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.20.4
Libc version: glibc-2.27

Python version: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-1051-aws-x86_64-with-glibc2.17
Is CUDA available: False
CUDA runtime version: 11.1.105
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/local/cuda-10.1/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] torch==1.11.0.dev20220203+cu111
[pip3] torchvision==0.12.0a0+22f8dc4
[conda] numpy 1.22.2 pypi_0 pypi
[conda] torch 1.11.0.dev20220203+cu111 pypi_0 pypi
[conda] torchvision 0.12.0a0+22f8dc4 dev_0

The text was updated successfully, but these errors were encountered:

prabhat00155 · 2022-02-22T00:07:42Z

On kinetics dataset: kinetics/070618/train_avi-480p/knitting/q2sqyxhLiDU_000031_000041.avi, I get the following crash:

python: malloc.c:2401: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.
Aborted (core dumped)

prabhat00155 · 2022-02-22T00:10:06Z

On kinetics dataset: kinetics/070618/600/train/visiting_the_zoo/KgVpMUp4KmM_000000_000010.mp4, I get the following crashes(on different runs):

malloc(): memory corruption
Aborted (core dumped)

and

free(): invalid next size (fast)
Aborted (core dumped)

bjuncek · 2022-03-03T22:20:32Z

Documenting current debugging:

verify if the segfaults are caught in the test suite -- Feb 28th
update test suite to check for the problematic video ([wip] address issue 5419 #5543) -- March 2nd

Note:
Upon env re-creation and re-install (caused by a server crash due to power outage), I can't reproduce this error.
Having said that, the video is producing wrong results in both video_reader and VideoReader APIs.

Just a quick edit:
Still can't reproduce this with a fully updated driver and environment. Will dive into nvcodec versions at some point next week

v-iashin · 2022-05-08T12:25:56Z

@prabhat00155 do you still have this error?

Could you share the version of ffmpeg in your environment

v-iashin · 2022-05-10T07:06:11Z

@bjuncek

I could reproduce the problem as follows:

Install pytorch using the conda line from the website
Updated ffmpeg with conda install -c conda-forge ffmpeg.

Pytorch installs ffmpeg==4.2, while conda-forge updates it to 4.3.2.

alexnwang · 2022-06-30T14:44:57Z

@v-iashin
Does this error still occur if you use ffmpeg==4.2?

v-iashin · 2022-06-30T20:16:54Z

I think torch has changed the ffmpeg version that it installs with torchvision (4.2 -> 4.3).

When I try to call VideoReader, it throws: RuntimeError: Not compiled with video_reader support, to enable video_reader support, please install ffmpeg (version 4.2 is currently supported) and build torchvision from source.

I tried to install ffmpeg=4.2 from conda-forge or just with conda install ffmpeg=4.2. It did install it but the error on VideoReader call persists.

Reproduce:

conda create -n issue5419
conda install python (python=3.10.4)
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge (torch=1.12, torchvision=0.13.0, cudatoolkit=11.6.0, ffmpeg=4.3)
>>> import torchvision; torchvision.io.VideoReader('path-to-mp4.mp4') --> RuntimeError
conda install [-c conda-forge] ffmpeg=4.2
>>> import torchvision; torchvision.io.VideoReader('path-to-mp4.mp4') --> RuntimeError

alexnwang · 2022-07-21T17:19:04Z

It seems the error occurs at torchvision.io._read_video_from_memory and is affected by the decoded resolution.
Larger the target resolutions -> more malloc errors. I've managed to avoid them altogether by setting the argument video_min_dimension=224.

Suppose this makes sense as it is some sort of memory error.

v-iashin · 2022-07-22T04:29:09Z

Suppose this makes sense as it is some sort of memory error.

It is not consistent with my findings. Previously, when I compared ffmpeg=4.2 and 4.3.2, torch could load the same video into the memory and was failing with another version.

I think the same error is being caused by two different reasons: a lack of RAM (your case) and a version mismatch (at least in my case).

bjuncek · 2022-07-22T13:09:48Z

Hi all,
So I've dug quite a bit into this for the past two weeks (and am continuing to do so), and there are a few confusing factors.
The error, to the best of my tracebacks is coming the fact that returned data from FFMPEG is larger than the allocated tensor (which should be guaranteed based on the headers), but there is some sort of a mismatch.

what I can't seem to figure out is why that is happening. I've tried hi-res videos, didn't have an issue, but then a video from #6204 does. The codec looks the same as some other videos, and it passes the ffprobe without an issue.

I've been getting some help from collegues at QS so hopefully will be able to get to the bottom of this.

hmaarrfk · 2022-07-25T02:22:13Z

We've had to disable ffmpeg support at conda-forge.

We can reliably recreate the a segfault that seems to occur during the video read tests.

Curiously, it doesn't occur on python 3.9.

This occurs for CPU builds too, not just GPU.

Build logs can be followed conda-forge/torchvision-feedstock#60

prabhat00155 added bug module: video module: io labels Feb 14, 2022

bjuncek mentioned this issue Apr 1, 2022

2022: state of video IO in torchvision #5720

Open

18 tasks

datumbox mentioned this issue Jun 30, 2022

video_reader core dumps on specific video #6204

Open

datumbox assigned bjuncek Jul 1, 2022

pseudoyim mentioned this issue Aug 23, 2022

torchvision 0.11.3 AnacondaRecipes/torchvision-feedstock#4

Merged

hungdtrn mentioned this issue Sep 6, 2022

Memory Error while loading Kinetic Dataset facebookresearch/SlowFast#551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new video reading API crash #5419

new video reading API crash #5419

prabhat00155 commented Feb 14, 2022 •

edited

Loading

prabhat00155 commented Feb 22, 2022 •

edited

Loading

prabhat00155 commented Feb 22, 2022

bjuncek commented Mar 3, 2022 •

edited

Loading

v-iashin commented May 8, 2022

v-iashin commented May 10, 2022 •

edited

Loading

alexnwang commented Jun 30, 2022

v-iashin commented Jun 30, 2022 •

edited

Loading

alexnwang commented Jul 21, 2022

v-iashin commented Jul 22, 2022 •

edited

Loading

bjuncek commented Jul 22, 2022

hmaarrfk commented Jul 25, 2022

new video reading API crash #5419

new video reading API crash #5419

Comments

prabhat00155 commented Feb 14, 2022 • edited Loading

🐛 Describe the bug

Versions

prabhat00155 commented Feb 22, 2022 • edited Loading

prabhat00155 commented Feb 22, 2022

bjuncek commented Mar 3, 2022 • edited Loading

v-iashin commented May 8, 2022

v-iashin commented May 10, 2022 • edited Loading

alexnwang commented Jun 30, 2022

v-iashin commented Jun 30, 2022 • edited Loading

alexnwang commented Jul 21, 2022

v-iashin commented Jul 22, 2022 • edited Loading

bjuncek commented Jul 22, 2022

hmaarrfk commented Jul 25, 2022

prabhat00155 commented Feb 14, 2022 •

edited

Loading

prabhat00155 commented Feb 22, 2022 •

edited

Loading

bjuncek commented Mar 3, 2022 •

edited

Loading

v-iashin commented May 10, 2022 •

edited

Loading

v-iashin commented Jun 30, 2022 •

edited

Loading

v-iashin commented Jul 22, 2022 •

edited

Loading