Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Config Support]: Whole Machine Crashing - looking for some tips #8470

Open
madasus opened this issue Nov 5, 2023 · 135 comments
Open

[Config Support]: Whole Machine Crashing - looking for some tips #8470

madasus opened this issue Nov 5, 2023 · 135 comments
Labels
beta Related to the current beta version of frigate support triage

Comments

@madasus
Copy link

madasus commented Nov 5, 2023

Describe the problem you are having

I have two docker hosts and both have a coral. I find that Frigate seems to cause the whole host to freeze completely (console is not responsive) at frequent intervals - right now I would say on average every 48 hours but its not consistent. I've moved the docker container to the other host and cleared out all the other dockers and the freeze follows Frigate.

Its likely Frigate is pushing the hosts much harder than any other docker and perhaps its finding a bug somewhere in the hardware or OS. The Devices are BeeLink devices running the latest Ubuntu.

Looking for some advice - has anyone seen this sort of behavior and identified the cause?

This has been happening for many months so it is not related to the beta Frigate or any particular Frigate (and likely this is NOT a Frigate bug)

Version

0.13 Beta 3

Frigate config file

database:
  path: /db/frigate.db

mqtt:
  host: 10.2.1.171
  user: mqtt
  password: xxx

ffmpeg:
#  hwaccel_args: -c:v h264_qsv
#  hwaccel_args: preset-intel-qsv-h264
  hwaccel_args: preset-vaapi

logger:
  # Optional: default log level (default: shown below)
  default: warning
  # Optional: module by module log level configuration
  logs:
    frigate.mqtt: error

detectors:
  coral:
    type: edgetpu
    device: usb

motion:
  # Optional: The threshold passed to cv2.threshold to determine if a pixel is different enough to be counted as motion. (default: shown below)
  # Increasing this value will make motion detection less sensitive and decreasing it will make motion detection more sensitive.
  # The value should be between 1 and 255.
  threshold: 40
  contour_area: 20
  lightning_threshold: 0.7

detect:
  max_disappeared: 500
  width: 1280
  # Optional: height of the frame for the input with the detect role (default: shown below)
  height: 720


timestamp_style:
  # Optional: Position of the timestamp (default: shown below)
  #           "tl" (top left), "tr" (top right), "bl" (bottom left), "br" (bottom right)
  position: tl
  # Optional: Format specifier conform to the Python package "datetime" (default: shown below)
  #           Additional Examples:
  #             german: "%d.%m.%Y %H:%M:%S"
  format: '%m/%d/%Y %H:%M:%S'
  # Optional: Color of font
  color:
    # All Required when color is specified (default: shown below)
    red: 255
    green: 255
    blue: 255
  # Optional: Line thickness of font (default: shown below)
  thickness: 1
  # Optional: Effect of lettering (default: shown below)
  #           None (No effect),
  #           "solid" (solid background in inverse color of font)
  #           "shadow" (shadow for font)
  effect: solid


birdseye:
  # Optional: Enable birdseye view (default: shown below)
  enabled: true
  # Optional: Width of the output resolution (default: shown below)
  width: 1280
  # Optional: Height of the output resolution (default: shown below)
  height: 720
  # Optional: Encoding quality of the mpeg1 feed (default: shown below)
  # 1 is the highest quality, and 31 is the lowest. Lower quality feeds utilize less CPU resources.
  quality: 8
  # Optional: Mode of the view. Available options are: objects, motion, and continuous
  #   objects - cameras are included if they have had a tracked object within the last 30 seconds
  #   motion - cameras are included if motion was detected in the last 30 seconds
  #   continuous - all cameras are included always
  mode: objects
  restream: true

objects:
  track:
  - person
  - cat


record:
  enabled: true
  events:
    retain:
      default: 10
      mode: active_objects
    pre_capture: 5
    post_capture: 15

  sync_on_startup: true
  expire_interval: 60

# Optional: Configuration for the jpg snapshots written to the clips directory for each event
# NOTE: Can be overridden at the camera level
snapshots:
  # Optional: Enable writing jpg snapshot to /media/frigate/clips (default: shown below)
  enabled: true
  # Optional: save a clean PNG copy of the snapshot image (default: shown below)
  clean_copy: true
  # Optional: print a timestamp on the snapshots (default: shown below)
  timestamp: false
  # Optional: draw bounding box on the snapshots (default: shown below)
  bounding_box: false
  # Optional: crop the snapshot (default: shown below)
  crop: false
  # Optional: height to resize the snapshot to (default: original size)
  height: 175
  # Optional: Restrict snapshots to objects that entered any of the listed zones (default: no required zones)
  required_zones: []
  # Optional: Camera override for retention settings (default: global values)
  retain:
    # Required: Default retention days (default: shown below)
    default: 10
    # Optional: Per object retention days
    objects:
      person: 15
  # Optional: quality of the encoded jpeg, 0-100 (default: shown below)
  quality: 70


ui:
  # Optional: Set the default live mode for cameras in the UI (default: shown below)
  live_mode: mse
  # Optional: Set a timezone to use in the UI (default: use browser local time)
  timezone: America/New_York
  # Optional: Use an experimental recordings / camera view UI (default: shown below)
  use_experimental: false
  # Optional: Set the time format used.
  # Options are browser, 12hour, or 24hour (default: shown below)
  time_format: 12hour
  # Optional: Set the date style for a specified length.
  # Options are: full, long, medium, short
  # Examples:
  #    short: 2/11/23
  #    medium: Feb 11, 2023
  #    full: Saturday, February 11, 2023
  # (default: shown below).
  date_style: full
  # Optional: Set the time style for a specified length.
  # Options are: full, long, medium, short
  # Examples:
  #    short: 8:14 PM
  #    medium: 8:15:22 PM
  #    full: 8:15:22 PM Mountain Standard Time
  # (default: shown below).
  time_style: medium
  # Optional: Ability to manually override the date / time styling to use strftime format
  # https://www.gnu.org/software/libc/manual/html_node/Formatting-Calendar-Time.html
  # possible values are shown above (default: not set)
  strftime_fmt: '%Y/%m/%d %H:%M'

telemetry:
  # Optional: Enabled network interfaces for bandwidth stats monitoring (default: shown below)
  #network_interfaces:
  #  - eth
  #  - enp
  #  - eno
  #  - ens
  #  - wl
  #  - lo
  # Optional: Configure system stats
  stats:
    # Enable AMD GPU stats (default: shown below)
   # amd_gpu_stats: True
    # Enable Intel GPU stats (default: shown below)
    intel_gpu_stats: true
    # Enable network bandwidth stats monitoring for camera ffmpeg processes, go2rtc, and object detectors. (default: shown below)
    network_bandwidth: false
  # Optional: Enable the latest version outbound check (default: shown below)
  # NOTE: If you use the HomeAssistant integration, disabling this will prevent it from reporting new versions
  version_check: true

cameras:

REMOVED - but I have about 15


I also wanted to include my docker compose for ideas

version: "3"
services:
  frigate:
    image: ghcr.io/blakeblackshear/frigate:0.13.0-beta3
#    image: ghcr.io/blakeblackshear/frigate:dev-c743dfd
    shm_size: "2048mb"
    container_name: frigate
    privileged: true
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - /disk1/docker/frigate/config:/config
#      - /disk1/docker/frigate/db:/db
#      - /disk1/docker/frigate/media:/media/frigate
      - /etc/localtime:/etc/localtime:ro
      - /dev/bus/usb:/dev/bus/usb
    environment:
     - PUID=0
     - PGID=0
     - TZ=America/New_York
     - FRIGATE_RTSP_PASSWORD="xxx"
     - PLUS_API_KEY=xxx
    restart: unless-stopped

Relevant log output

None that I can find relevant.

Frigate stats

No response

Operating system

Other

Install method

Docker Compose

Coral version

USB

Any other information that may be helpful

No response

@NickM-27 NickM-27 added the beta Related to the current beta version of frigate label Nov 5, 2023
@NickM-27
Copy link
Collaborator

NickM-27 commented Nov 5, 2023

There's no info provided here so there is nothing to go off of. You first need to figure out why the machine is actually freezing (is it memory issue, kernel panic, etc.)

@madasus
Copy link
Author

madasus commented Nov 5, 2023

nothing is shown on the console. I'll check to see if there is anything in syslog. The last time i checked there was nothing - the whole machine was just frozen.

@NickM-27
Copy link
Collaborator

NickM-27 commented Nov 5, 2023

it can happen for many different reasons, if there is no information that can be provided then there's not really much that can be done on the frigate side. There are plenty of solutions like having a log written to a file so the cause can be seen in the logs after restarting the machine.

Also, you can try putting a memory limit on the frigate container

@blakeblackshear
Copy link
Owner

The next steps would be to back down frigate to a bare minimum config and slowly add parts back until you can see what is causing the issue.

@madasus
Copy link
Author

madasus commented Nov 7, 2023

Thanks - i'm following the other thread also. I also added some more debugging to Ubuntu to see if I can capture anything in the logs before the freeze.

Do you have any suggestions on where to start removing the config from? us the hwaccel param a place to start?

ffmpeg:
hwaccel_args: preset-vaapi

@antipesto93
Copy link

antipesto93 commented Nov 8, 2023

This comment is not very helpful but I had the exact issue running the containers on kubernetes (microk8s on ubuntu).
Host would crash, have to power cycle. No useful information in logs or kernel log.

I ended up removing my coral (m.2) and switching to CPU/VAAPI detection for now, It's been a few weeks without issue.

It's a long shot but could be worth trying the same to rule it out? I have not gone back to the coral as I have only a 1 camera doing detection / CPU usage is not high.

@ggidofalvy-tc
Copy link

ggidofalvy-tc commented Nov 11, 2023

I have a similar issue, running an i5-6500T, no external accelerator, and so far I've been able to ascertain the following:

  • VAAPI hardware accelerated video decoding/encoding, CPU detector -> no crash
  • Software decoding/encoding, OpenVINO detector running -> no crash
  • VAAPIa hardware accelerated video decoding/encoding, OpenVINO detector running -> machine hangs after 16-24 hours

Here's my config using three random camera feeds from the Internet that I use for debugging, currently the hardware acceleration for decoding/encoding is commented out:

mqtt:
  enabled: false

go2rtc:
  streams:
#    test_camera_1_main:
#      - rtsp://admin:xxxxxxxxxxx@192.168.10.17:554/h265Preview_01_main
#    test_camera_1_sub:
#      - rtsp://admin:xxxxxxxxxxx@192.168.10.17:554/h264Preview_01_sub
    test1:
   #  - ffmpeg:http://78.31.82.246/mjpg/video.mjpg#video=h264#hardware
     - ffmpeg:http://78.31.82.246/mjpg/video.mjpg#video=h264

    test2:
   #  - ffmpeg:http://webcam.zvnoordwijk.nl:82/mjpg/video.mjpg#video=h264#hardware
     - ffmpeg:http://webcam.zvnoordwijk.nl:82/mjpg/video.mjpg#video=h264
    test3:
   #  - ffmpeg:http://tacocam.tacoma.uw.edu/mjpg/video.mjpg#video=h264#hardware
     - ffmpeg:http://tacocam.tacoma.uw.edu/mjpg/video.mjpg#video=h264

cameras:
#  test_camera_1: # <------ Name the camera
#    ffmpeg:
#      output_args:
#        record: preset-record-generic-audio-copy
#      inputs:
#        - path: rtsp://127.0.0.1:8554/test_camera_1_sub # <----- The stream you want to use for detection
#          input_args: preset-rtsp-restream
#          hwaccel_args: preset-vaapi
#          roles:
#            - detect
#        - path: rtsp://127.0.0.1:8554/test_camera_1_main # <----- The stream you want to use for recording
#          input_args: preset-rtsp-restream
#          hwaccel_args: preset-vaapi
#          roles:
#            - record
#    record:
#      enabled: True
#    detect:
#      enabled: True # <---- disable detection until you have a working camera feed
#      width: 640 # <---- update for your camera's resolution
#      height: 480 # <---- update for your camera's resolution
#    live:
#      stream_name: test_camera_1_main
  test1: # <------ Name the camera
    ffmpeg:
      inputs:
      - path: rtsp://127.0.0.1:8554/test1   # <----- The stream you want to use for detection
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        roles:
        - detect
      - path: rtsp://127.0.0.1:8554/test1   # <----- The stream you want to use for recording
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        roles:
        - record
    record:
      enabled: true
    detect:
      enabled: true # <---- disable detection until you have a working camera feed
      width: 1280 # <---- update for your camera's resolution
      height: 720 # <---- update for your camera's resolution
    live:
      stream_name: test1
    objects:
      track:
      - person
      - car
  test2: # <------ Name the camera
    ffmpeg:
      inputs:
      - path: rtsp://127.0.0.1:8554/test2   # <----- The stream you want to use for detection
        input_args: preset-rtsp-restream
        hwaccel_args: preset-vaapi
        roles:
        - detect
      - path: rtsp://127.0.0.1:8554/test2   # <----- The stream you want to use for recording
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        roles:
        - record
    record:
      enabled: true
    detect:
      enabled: true # <---- disable detection until you have a working camera feed
      width: 1280 # <---- update for your camera's resolution
      height: 720 # <---- update for your camera's resolution
    live:
      stream_name: test2
    objects:
      track:
      - person
  test3: # <------ Name the camera
    ffmpeg:
      inputs:
      - path: rtsp://127.0.0.1:8554/test3   # <----- The stream you want to use for detection
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        roles:
        - detect
      - path: rtsp://127.0.0.1:8554/test3   # <----- The stream you want to use for recording
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        roles:
        - record
    record:
      enabled: true
    detect:
      enabled: true # <---- disable detection until you have a working camera feed
      width: 1920 # <---- update for your camera's resolution
      height: 1080 # <---- update for your camera's resolution
    live:
      stream_name: test3
    objects:
      track:
      - person

    motion:
      mask:
      - 716,0,723,359,126,378,129,0
      - 1920,0,1920,0,1920,731,1869,783,1804,823,1722,860,1604,855,1480,838,1314,778,1299,729,1188,676,1123,667,1061,683,1010,642,978,598,961,516,850,496,755,464,674,447,603,306,582,0
record:
  retain:
    days: 0
    mode: all
  events:
    retain:
      default: 14
      mode: motion
      objects:
        person: 30

detectors:
  ov:
    type: openvino
    device: AUTO
    model:
      path: /openvino-model/ssdlite_mobilenet_v2.xml

model:
  width: 300
  height: 300
  input_tensor: nhwc
  input_pixel_format: bgr
  labelmap_path: /openvino-model/coco_91cl_bkgr.txt

# Include all cameras by default in Birdseye view
birdseye:
  enabled: true

I tried to grab kernel crashdump via kdump, and also tried out kernel netconsole (dmesg) logging to another server running on the same network, but neither resulted in any output, which makes me think it's a driver issue that affects the CPU itself, not even a kernel crash.

Running the beta2 image in docker-compose, the beta3 image has an issue with go2rtc failing to parse the camera feed URLs.

If you have any ideas for any further troubleshooting I could do, please do let me know.

@Pingbo
Copy link

Pingbo commented Nov 11, 2023

@ggidofalvy-tc
Have the same issue on unraid: #8461

After round about 12h-24h the Host is crashing when using OpenVINO.

Tried different drivers in the Host, but didnt Help.

Currently thinking about to buy a Coral...

@audiophonicz
Copy link

I have same issue on K3s on Debian with i3-6100U. VAAPI HW encode/decode + OpenVINO setup in config.

with obj detection turned off its rock solid. if I turn on obj detection for a single object on a single camera, whole node hangs within 3 days.

Those of us using the official helm chart cant update go2rtc or ffmpeg with custom versions.

@ggidofalvy-tc
Copy link

Adding onto my previous comment:

Running Ubuntu 22.04, tried both the GA (5.15) and HWE (6.2) kernels, both exhibited the same crash behaviour.

@NickM-27
Copy link
Collaborator

#8338 (comment) may be relevant with a couple suggestions (and other linked issue)

@kevin-david
Copy link
Contributor

kevin-david commented Nov 13, 2023

@ggidofalvy-tc @madasus especially if your frigate machine is headless, I would recommend removing the often-default quiet kernel parameter/command-line-argument and adding debug. that's what helped in my case linked above to at least narrow down the issue to the GPU, but I have made limited progress above as NickM has linked. my errors only showed up on the physical console, due to the hang.

It's certainly suspicious that what I reported in #8338 is also using a i7-6600(U) / Skylake GPU - same generation as you both - wondering if there is a driver bug / hardware quirk that other generations don't have that the i915 driver isn't handling

@madasus
Copy link
Author

madasus commented Nov 14, 2023

@kevin-david my host is headless so i'll give this a try. Will the debug then be written to syslog? how are you grabbing it?

Can you point me in the direction of where you made this change in your linux distro? (i'm using Ubuntu).

I'm glad i opened this thread as it appears this is not an isolated problem - and while not a Frigate issue but likely something that Frigate exposes due to load in the underlying hardware/software.

Thanks

M

@kevin-david
Copy link
Contributor

kevin-david commented Nov 14, 2023

@madasus sure - I am using proxmox, so it should be similar. In my case the message never appeared in syslog, only on the physically connected screen - I guess because the machine was hung, it wasn't able to be written to syslog. this might mean you need to temporarily connect a monitor to the machine.

To do what I was talking about, you'll want to change GRUB_CMDLINE_LINUX_DEFAULT in the /etc/default/grub file and run update-grub to regenerate the configuration file, and reboot.

This describes it a little more: https://askubuntu.com/a/19487. Again in my case, I removed quiet which resulted in messages logged to the console, and added debug (which I'm not sure makes a huge difference, but isn't super noisy either)

@ggidofalvy-tc
Copy link

I gave echo 0 | sudo tee /sys/class/drm/card0/engine/rcs0/preempt_timeout_ms a spin, but no luck in preventing/prolonging the crash.

This is what I got in dmesg:

[525594.184400] [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110

I'll keep a look out for more messages in the netconsole destination now that I rebooted again and set the debug kernel commandline flag, I won't be applying the "fix" this time around.

@Pingbo
Copy link

Pingbo commented Nov 18, 2023

@ggidofalvy-tc @madasus

I probably found a solution... Running a yolov8s model since some days and currently >48h stable without any crash. Perhaps you can try this aswell?

@ggidofalvy-tc
Copy link

ggidofalvy-tc commented Nov 20, 2023

@Pingbo can you share your model and detector config.yml snippets? Sorry for the mild derail, I would like to see if this might be a model-specific issue, not an OpenVINO-related one. Running the beta2 branch, since beta3 has issues with go2rtc with my config.

I've been trying to get yolov8n/yolov8s running on my setup based on the notebook linked in this comment: #5184 (comment)

But I keep getting an error when the detector starts up:

2023-11-20 12:55:12.267683201  Traceback (most recent call last):
2023-11-20 12:55:12.267747530    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
2023-11-20 12:55:12.267749391      self.run()
2023-11-20 12:55:12.267806449    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
2023-11-20 12:55:12.267808281      self._target(*self._args, **self._kwargs)
2023-11-20 12:55:12.267855642    File "/opt/frigate/frigate/object_detection.py", line 102, in run_detector
2023-11-20 12:55:12.267857527      object_detector = LocalObjectDetector(detector_config=detector_config)
2023-11-20 12:55:12.267898126    File "/opt/frigate/frigate/object_detection.py", line 53, in __init__
2023-11-20 12:55:12.267899858      self.detect_api = create_detector(detector_config)
2023-11-20 12:55:12.267941613    File "/opt/frigate/frigate/detectors/__init__.py", line 18, in create_detector
2023-11-20 12:55:12.267943162      return api(detector_config)
2023-11-20 12:55:12.267986312    File "/opt/frigate/frigate/detectors/plugins/openvino.py", line 26, in __init__
2023-11-20 12:55:12.267988059      self.ov_model = self.ov_core.read_model(detector_config.model.path)
2023-11-20 12:55:12.268047582  RuntimeError: Check 'false' failed at src/frontends/common/src/frontend.cpp:53:
2023-11-20 12:55:12.268049096  Converting input model
2023-11-20 12:55:12.268050592  Cannot create Interpolate layer /model.10/Resize id:164 from unsupported opset: opset11

My config.yml bits, attempting to run the yolov8n model:

detectors:
  ov:
    type: openvino
    device: AUTO
    model:
      path: /config/openvino-model/yolov8n.xml

model:
  width: 416
  height: 416
  input_tensor: nhwc
  input_pixel_format: bgr
  model_type: yolov8
  labelmap_path: /openvino-model/coco_91cl_bkgr.txt

(all 3 output files are mounted /config/openvino-model, I'm reusing the labelmap from the original mobileSSD model used)

@madasus
Copy link
Author

madasus commented Nov 21, 2023

when mine freezes I managed to check the console this time and there were no messages at all being written to the console before the crash.

@Pingbo can you elaborate on how to use the model you are suggesting? is this being used instead of the Coral?

@Pingbo
Copy link

Pingbo commented Nov 22, 2023

@ggidofalvy-tc

Thats how i have done it:

  1. Generate a yolo model with https://colab.research.google.com/drive/1G05mESOhDdM1HpinKMJZWpI_jxNq_qIO?usp=sharing#scrollTo=rKnUE62F925P
  2. Download the following labelmap: https://github.com/openvinotoolkit/open_model_zoo/blob/master/data/dataset_classes/coco_80cl.txt
  3. My Config:
detectors:
  ov:
    type: openvino
    device: GPU
    model:
      # path: /openvino-model/ssdlite_mobilenet_v2.xml
      path: /config/yolov8s/yolov8s.xml

model:
#  width: 300
#  height: 300
 width: 416
 height: 416
 input_tensor: nchw  # nhwc
 input_pixel_format: bgr
 model_type: yolov8
 labelmap_path: /config/coco_80cl.txt  #/openvino-model/coco_91cl_bkgr.txt

@madasus
Yes this is using OpenVino as detector and not coral. As far is i know you cannot use coral and yolo models together

@audiophonicz
Copy link

audiophonicz commented Dec 6, 2023

Thank you for the detail @Pingbo

The only thing I would clarify for others is 1. you want to put all 3 files in the .zip file from the yolo model generation in the model folder, and 2. the files that were generated for me were yolo8n.xml, so make sure your file path is correct. Hopefully this is the fix.

Edit: 2 weeks running with solid person detection using the yolov8n model on a single camera. Looks like CPU usage dropped significantly for me. Enabling it on the rest of my cameras now.

@ggidofalvy-tc
Copy link

@Pingbo Thank you for the help and the detailed instructions! I've been using yolov8n for nearly two weeks now without any crashing on beta2.

I think the issue might indeed be caused by the combination of the bundled ssdlite_mobilenet_v2 model and Skylake-gen OpenVINO -- is this perhaps worth documenting somewhere?

@FeatherKing
Copy link

FeatherKing commented Dec 8, 2023

Wanted to chime in here, im a new frigate user as of about two weeks ago. My hardware is an i7-7700 kaby lake. I am running frigate and wyze-bridge together. Wyze bridge is correctly using Intel QSV with ffmpeg and Frigate will use it fine on ffmpeg as well. However, if i tried to use any openvino detector, it would crash the container everytime. If i set a detector as cpu (not openvino cpu), the container would start and detect fine.

Today i followed these steps by @Pingbo and finally my openvino detector will start with GPU selected. My inference speed went from 45ms (cpu) to 15ms (ov gpu).

The only error i could make out from the container was
RuntimeError: The input blob size is not equal to the network input size: got 307200 expecting 270000
I tried spinning up various python openvino demos and i was getting similar errors. I was running these demos inside the container. Errors like Resulting shape '{1,3,300,3}' after preprocessing is not aligned with original parameter's shape: {1,300,300,3}, input parameter: image_tensor. This led me to believe maybe something with the included frigate openvino model and kabylake was not working out.

Anyway, the yolov8 model from the above comment seems to have resolved my issue for now. Ive been stable for a few hours (where previously i was unable to even start the containers). I will continue to monitor. (thanks @Pingbo !!)

edit: i am on frigate version 0.12.1-367D724

@bean72
Copy link

bean72 commented Dec 13, 2023

Followed the advice of @Pingbo for using the yolov8 model as well. Been running for a couple weeks without any issues. My detections are more reliable as well, so that is an added bonus. Thanks @Pingbo

@ivanjx
Copy link

ivanjx commented Sep 9, 2024

@vista- your config crashes my i5-6500T a couple hours later.
i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8ed97ff2, in frigate.detecto [942217]

@matew00
Copy link

matew00 commented Sep 9, 2024

Hello,

similar issue here.
Introduced in 0.14.
ffmpeg decoding using AMD iGPU, first the streams/cameras become unavailable, then the whole container starts to be unresponsive and iGPU crashes as well.
Automatic reset of iGPU is not successful.

In case I am using GPU statistics plugin this causes as well GUI unresponsiveness.

amd ryzen 7900, 64gb RAM, coral tpu usb.

@NickM-27
Copy link
Collaborator

NickM-27 commented Sep 9, 2024

Without logs, dmesg, etc there's not much to go off of here. I have two Frigate instances running using my AMD 5700G iGPU without any issues.

@matew00
Copy link

matew00 commented Sep 9, 2024

Without logs, dmesg, etc there's not much to go off of here. I have two Frigate instances running using my AMD 5700G iGPU without any issues.

Here, from unraid:

Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:24 vmid:4 pasid:32771, for process ffmpeg pid 15489 thread ffmpeg:cs0 pid 15658)
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x00008001072fb000 from client 0x12 (VMC)
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00403831
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: VCN (0x1c)
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x1
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:24 vmid:4 pasid:32771, for process ffmpeg pid 15489 thread ffmpeg:cs0 pid 15658)
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu:   in page starting at address 0x00008001072fb000 from client 0x12 (VMC)
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00000000
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 Faulty UTCL2 client ID: MP0 (0x0)
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 MORE_FAULTS: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 WALKER_ERROR: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
Aug 30 21:20:29 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: 	 RW: 0x0
Aug 30 21:20:39 unraid kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec_0 timeout, signaled seq=18111395, emitted seq=18111397
Aug 30 21:20:39 unraid kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ffmpeg pid 15489 thread ffmpeg:cs0 pid 15658
Aug 30 21:20:39 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: GPU reset begin!
Aug 30 21:20:39 unraid kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Aug 30 21:20:40 unraid kernel: [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x00000360 != 0x00000320
Aug 30 21:20:40 unraid kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: free PSP TMR buffer
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: MODE2 reset
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: GPU reset succeeded, trying to resume
Aug 30 21:20:40 unraid kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F41FC00000).
Aug 30 21:20:40 unraid kernel: [drm] PSP is resuming...
Aug 30 21:20:40 unraid kernel: [drm] reserve 0xa00000 from 0xf41e000000 for PSP TMR
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: RAS: optional ras ta ucode is not available
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: RAP: optional rap ta ucode is not available
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: SMU is resuming...
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: smu driver if version = 0x00000004, smu fw if version = 0x00000005, smu fw program = 0, smu fw version = 0x00544fde (84.79.222)
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: SMU driver if version not matched
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: SMU is resumed successfully!
Aug 30 21:20:40 unraid kernel: [drm] DMUB hardware initialized: version=0x05000C00
Aug 30 21:20:40 unraid kernel: [drm] kiq ring mec 2 pipe 1 q 0
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring vcn_dec_0 test failed (-110)
Aug 30 21:20:40 unraid kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vcn_v3_0> failed -110
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: GPU reset(1) failed
Aug 30 21:20:40 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: GPU reset end with ret = -110
Aug 30 21:20:40 unraid kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
Aug 30 21:20:41 unraid kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Aug 30 21:20:41 unraid kernel: [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x00000010 != 0x00000000
Aug 30 21:20:41 unraid kernel: [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000002
Aug 30 21:20:43 unraid kernel: [drm] Fence fallback timer expired on ring sdma0
### [PREVIOUS LINE REPEATED 3 TIMES] ###
Aug 30 21:20:51 unraid kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring vcn_dec_0 timeout, signaled seq=18111397, emitted seq=18111397
Aug 30 21:20:51 unraid kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ffmpeg pid 15489 thread ffmpeg:cs0 pid 15658
Aug 30 21:20:51 unraid kernel: amdgpu 0000:0e:00.0: amdgpu: GPU reset begin!

@NickM-27
Copy link
Collaborator

NickM-27 commented Sep 9, 2024

Yeah, looks like issues on the host. Since yours is newer perhaps the older kernel that Unraid uses has issues, perhaps Unraid 7 with the newer kernel will improve things.

@matew00
Copy link

matew00 commented Sep 9, 2024

Yes, issues on the host triggered by frigate 0.14. 0.13 worked in this way flawlessly.
There are some who have the same version 6.12.10 along with amd iGPU who don't experience these issues.
I am not saying that unraid 7 wont fix it, maybe it will with newer kernel, etc. but still either I would have to revert back to 0.13 or to find another solution.

@NickM-27
Copy link
Collaborator

NickM-27 commented Sep 9, 2024

Not sure how 0.14 would be the cause as the AMD GPU driver used is the same version as 0.13 as well as the same ffmpeg version.

You could try updating to a newer ffmpeg version and see if that helps. https://docs.frigate.video/configuration/advanced#custom-ffmpeg-build

@matew00
Copy link

matew00 commented Sep 11, 2024

Thank you for your feedback, I will try few steps (including alter ffmpeg with a newer vesrion).
In the meantime I have spin up a second frigate container with 0.13, so far so good - stable.
Then I will grab logs from 0.14 and 0.13 to compare. Host and container logs.

@johnmarksilly
Copy link
Contributor

So strange. I just stumbled upon this thread. I get an OS hang around once a month.

I am using a Dell Optiplex full form factor. Only thing running is frigate in docker (latest beta). My measurement tools don't show anything erratic on CPU/GPU/MEM just before the hang, but I haven't yet grabbed any kernel logs. I'll have to do that next. The machine is running the latest Debian.

I'll be trying to set quicksync instead of vaapi. Definitely annoying not knowing the cause. Latest crash I was only at around 75% memory usage. CPU is below 12% all the time. Doesn't ever deviate outside of 8-12%. It definitely seems like a GPU/CPU kernel thing.

What should my SHM size be for 6 4k cameras? I have it set at 6gb I think. I'll have to double check.

@NickM-27
Copy link
Collaborator

That's fine for 0.14, 0.15 not as much is needed (could just be 1gb)

@ivanjx
Copy link

ivanjx commented Sep 21, 2024

in my case i have to switch openvino from gpu to cpu and use qsv. no crash so far for 1 week.

@NickM-27
Copy link
Collaborator

Are you on 0.14 or dev? Dev has a newer driver and newer openvino that may help

@ivanjx
Copy link

ivanjx commented Sep 21, 2024

dev but i think im fine with current config if it is stable

@djcrawleravp
Copy link

Any luck with raspberry pi4?, I am not able to make much tests because it is an office server

@ivanjx
Copy link

ivanjx commented Oct 12, 2024

this is stable for me with latest dev:

detectors:
  ov:
    type: openvino
    device: GPU

ffmpeg:
  hwaccel_args: ' '

@audiophonicz
Copy link

audiophonicz commented Oct 13, 2024

Side Question for all of you: I have been running the yolov8s since Dec '23 without issue, however I have since upgraded and am no longer running on Skylake. One thing I've noticed with the yolo is that cars and pickups are detected fine, but box trucks or vans are not picked up. Do those running the yolov8n/s models have this behavior as well?

@NickM-27
Copy link
Collaborator

When you say yolo what do you mean? UPS and FedEx are only supported with frigate+ model. If you are referring to the yolonas frigate+ model then I have no issues detecting these labels

@hua-bing
Copy link

Side Question for all of you: I have been running the yolov8s since Dec '23 without issue

Are you running 0.13 or 0.12?

@ekaszubski
Copy link

ekaszubski commented Nov 5, 2024

I just spent a week debugging the amdgpu crash issue on my 6900HX-based system and finally have a setup that seems stable for more than 12 hours at a time. The key seems to be to install a recent version of mesa-libgallium in the container (and possibly upgrade mesa-va-drivers); I'm at 24.2.4-1~bpo12+1 for both. My specific working setup is:

  • Minisforum UM690S 64G
  • Host running Proxmox 8.2.7 (Linux vm2 6.8.12-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-3 (2024-10-23T11:41Z) x86_64 GNU/Linux)
    • I updated to the latest available kernel but I don't think this had any impact since nothing was stable until the libgallium change
  • LXC container with debian 12 standard (12.7.1)
  • Started with Frigate 0.15.0-beta1 and added bookworm-backports, then installed libgallium and upgraded mesa-va-drivers
  • I'm using a wide variety of cameras:
    • 2x Eufy E340 floodlight cam
    • 1x Eufy S350 Indoor Cam
    • 1x Eufy E220 Indoor Cam
    • 1x Tapo C225
    • 1x Reolink Duo 3 PoE
    • 1x Reolink TrackMix PoE

I started with the same LXC container setup (debian 12 standard), with vanilla Frigate 0.14.1, and things would work great for about 12 hours before I'd get the amdgpu crash.

The same would happen running everything through a VM (also with debian 12 standard) with GPU passthrough, but the amdgpu crash would happen on the VM instead of the host.

I tried all kinds of kernel cmdline parameters, disabling the IOMMU, changing the iGPU VRAM setting in the BIOS, upgrading the PVE kernel, etc. The system was actually less stable with all of those changes. The single thing that seems to have brought consistent stability is the libgallium install with mesa-va-drivers upgrade.

I suspect that this same tweak in docker in the VM would have fixed the issue there too. Also, side note, using an LXC container and passing through the USB coral is getting about 2x better inference speed (7.5 ms vs 15 ms) for me compared to running in a VM and passing through the port, as recommended by the docs.

Anyway, I'm going to test a clean install to evaluate the minimum required change for stability, but if it ends up being an install of libgallium and an upgrade of mesa-va-drivers, is that something that you guys would be willing to push to the docker config?

@NickM-27
Copy link
Collaborator

NickM-27 commented Nov 5, 2024

The problem with updating mesa drivers is it breaks older hardware, so we'd have to consider what that would look like

@ekaszubski
Copy link

@NickM-27 That makes sense; what about a variant of the container with a layer that just applies tweaks for Rembrandt and newer AMD GPUs that benefit from libgallium? I'm barely docker literate but can't we just slap on a layer that makes those tweaks on top of the standard image (i.e. it would be fairly trivial to maintain)?

@NickM-27
Copy link
Collaborator

NickM-27 commented Nov 6, 2024

That's more effort for us to maintain, test, and deploy before every release.

@audiophonicz
Copy link

Side Question for all of you: I have been running the yolov8s since Dec '23 without issue

Are you running 0.13 or 0.12?

still on 0.12. I've been playing around with trying to get yolov11 converted to openvino. I think my issue is just the coco models.

Still confused as to why "car is listed twice because truck has been renamed to car by default." and what constitutes as "truck" but thats probably a diff discussion.

@T0m112233
Copy link

T0m112233 commented Nov 15, 2024

Hi folks, just wanted to update that I had many freeze / crashes for a long long time.

I recently upgraded my Intel NUC from 4GB to 12GB RAM and all the freeze problems disappeared.
After doing so - also managed to change the hwaccel_args parameter back to default (using GPU) with no problem.

@ChirpyTurnip
Copy link

I have migrated my whole Unraid-based container configuration (2x Frigate, 2x HDD, 2x TPU) to a new AMD-based PC (moving from a Core i7 14700K Intel box). I'm happy to report that it is suddenly a lot more stable (no crashes after 4 days versus crashing within hours).

I wonder how many people reporting issues were on Core i5/i7/i9 Intel platforms versus AMD or other Intel variants (such as the NUC-based CPUs).

@luckyycode
Copy link

luckyycode commented Dec 17, 2024

I got k3s on debian 6.15 with Intel Arc A380, got the whole node crash after 30-60 minutes using version 0.14. Xeon server, not arm64

It's not ffmpeg's fault and I tried both hwaccel and cpu, I tried both yolo-nas and ov models, still getting the whole node crash. Can't catch a kernel log because it's headless, even with kvm i can't see any log before the crash. When I turn off the detector I get no server freeze. Arc is being used as shown in Frigate system metrics (privileged container with sys_*). Shm size is 512 megs, 128 gb on host so it's not oom.

update: this is still happening on 0.15.0-beta3

My config:

mqtt:
  host: 172.14.0.3
  port: 1883
  user: 'abcd'
  password: 'abcd'

ffmpeg:
  input_args:
    - -rtsp_transport
    - tcp
    - -r
    - '25'
  hwaccel_args:
    - -hwaccel
    - vaapi
    - -hwaccel_device
    - /dev/dri/renderD128
    - -hwaccel_output_format
    - yuv420p

detectors:
  ov:
    type: openvino
    device: GPU

model:
  model_type: yolonas
  width: 640 
  height: 640
  input_tensor: nchw
  input_pixel_format: bgr
  path: /config/yolo_nas_s_fp16.onnx
  labelmap_path: /config/coco_80cl.txt

go2rtc:
  streams:
    front_garage_camera_stream_hd:
      - rtsp://admin:abcd@a.b.c.d:2554/cam/realmonitor?channel=1&subtype=0
    left_side_camera_stream_hd:
      - rtsp://admin:abcd@a.b.c.d:554/cam/realmonitor?channel=2&subtype=0
    right_side_camera_stream_hd:
      - rtsp://admin:abcd@a.b.c.d:554/cam/realmonitor?channel=3&subtype=0
    entrance_camera_stream_hd:
      - rtsp://admin:abcd@a.b.c.d:554/cam/realmonitor?channel=4&subtype=0

cameras:
  front_garage_camera:
    enabled: true
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/front_garage_camera_stream_hd
          input_args: preset-rtsp-restream
          roles:
            - detect
    detect:
      enabled: true
      width: 1920
      height: 1080
    live:
      stream_name: front_garage_camera_stream_hd
  left_side_camera:
    enabled: true
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/left_side_camera_stream_hd
          input_args: preset-rtsp-restream
          roles:
            - detect
    detect:
      enabled: true
      width: 1920
      height: 1080
    live:
      stream_name: left_side_camera_stream_hd
  right_side_camera:
    enabled: true
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/right_side_camera_stream_hd
          input_args: preset-rtsp-restream
          roles:
            - detect
    detect:
      enabled: true
      width: 1920
      height: 1080
    live:
      stream_name: right_side_camera_stream_hd
  entrance_camera:
    enabled: true
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/entrance_camera_stream_hd
          input_args: preset-rtsp-restream
          roles:
            - detect
    detect:
      enabled: true
      width: 1920
      height: 1080
    live:
      stream_name: entrance_camera_stream_hd

version: 0.14

using stable image, and talking about tips - maybe this will be at least penny-valuable

@audiophonicz
Copy link

I got k3s on debian 6.15 with Intel Arc A380, got the whole node crash after 30-60 minutes using version 0.14. Xeon server, not arm64

Im not sure what Debian 6.15 is. Debian 12 with a 6.15 kernel? I think only 6.10 or 6.11 is supported.
Why does your ffmpeg config look wierd? you have some things double dashed.
Should you still be using vaapi with Arc? more likely preset-intel-qsv-h264.
Is that first camera supposed to have a diff port than the rest?
Your model is set to 640x640, but your detect is set to 1920x1080 in your camera config. big gap.
Youre using your main stream to detect at full resolution and framerate? Should probably try to use a substream with a lower resolution and framerate to see if that makes a difference.
I dont see what youre detecting either, but that could just be you left it out.

You might want to go over your config, I can run 15b3 stable in K3s 1.30 on a NUC with:
(No im not using Arc, Im using 11th gen UHD integrated, but Im pretty sure intel is intel)

    detectors:
      ov_0:
        type: openvino
        device: GPU
      ov_1:
        type: openvino
        device: GPU

    model:
      model_type: yolonas
      width: 640
      height: 640
      input_pixel_format: bgr
      input_tensor: nchw
      path: /config/yolo_nas_s.onnx
      labelmap_path: /labelmap/coco-80.txt

    ffmpeg:
      global_args: -hide_banner -loglevel warning
      hwaccel_args: preset-intel-qsv-h264
      output_args:
        detect: -f rawvideo -pix_fmt yuv420p
        record: preset-record-generic-audio-aac

    detect:
      width: 640
      height: 480
      fps: 5
      enabled: true
      max_disappeared: 25
      stationary:
        interval: 25
        threshold: 50

    go2rtc:
      streams:
        Entrance:
          - rtsp://user:pass@Entrancel.example.com:554/cam/realmonitor?channel=1&subtype=0
        Entrance_sub:
          - rtsp://user:pass@Entrancel.example.com:554/cam/realmonitor?channel=1&subtype=1

    cameras:
      Entrance:
        enabled: True
        ffmpeg:
          inputs:
            - path: rtsp://127.0.0.1:8554/Entrance_sub
              input_args: preset-rtsp-restream
              roles:
                - audio
                - detect
            - path: rtsp://127.0.0.1:8554/Entrance
              input_args: preset-rtsp-restream
              roles:
                - record
        objects:
          track:
            - person

@jacco-hass
Copy link

Hi all, I'm also dealing with this issue, and I'm running out of solutions to try.. Any other ideas?

My machine is an Intel J5005 with integrated graphics, running Debian 12 with 6.1 kernel and a PCIe Coral TPU. Frigate is running as Home Assistant add-on. The only way to get my system stable is by removing my cameras from frigate.yaml; as soon as I add one, the crashes start occurring every 6 to 24 hours.

Things I've tried so far:

  • Docker image instead of Home Assistant add-on
  • Run Docker privileged
  • Increased shm_size to 1024mb in Docker
  • 6.11 (backported) kernel (seemed to worsen the crashes)
  • Frigate Full Access
  • Frigate Beta (15.4)
  • CPU detector
  • Coral (PCIe) detector
  • hwaccell_args: " "
  • hwaccell_args: preset-vaapi
  • hwaccell_args: preset-intel-qsv-h264
  • i965 driver (with vaapi HW acceleration)

Still on my list to try are an older Frigate version (0.13?) and a yolo model for detection. Any other tips to try are very welcome!

@jacco-hass
Copy link

I managed to "fix" it by installing an additional 8GB of RAM. I noticed a spike (+2GB) in RAM usage sometimes, which probably will have caused the crashes, but no clue yet what the underlying source of the problem is. I hope this helps somebody

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta Related to the current beta version of frigate support triage
Projects
None yet
Development

No branches or pull requests