Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] [R-package] r-sanitizers jobs segfaulting while installing packages #6367

Closed
jameslamb opened this issue Mar 17, 2024 · 5 comments · Fixed by #6374
Closed

[ci] [R-package] r-sanitizers jobs segfaulting while installing packages #6367

jameslamb opened this issue Mar 17, 2024 · 5 comments · Fixed by #6374

Comments

@jameslamb
Copy link
Collaborator

Description

CI jobs that run the R package tests under various sanitizers are failing. Installation of {lightgbm}'s dependencies are failing with segmentation faults.

Example:

/__w/_temp/a7680c7b-d1d0-46c4-88d6-205a89e1f205.sh: line 1:   486 Segmentation fault      (core dumped) RDscriptcsan -e "install.packages(c('R6', 'data.table', 'jsonlite', 'knitr', 'markdown', 'Matrix', 'RhpcBLASctl', 'testthat'), repos = 'https://cran.rstudio.com/', Ncpus = parallel::detectCores())"
Error: Process completed with exit code 139.

Reproducible example

This is happening across all PRs, over at least the last 2 days. For example: https://github.com/microsoft/LightGBM/actions/runs/8311885124/job/22746169482?pr=6357

Additional Comments

Those jobs use the wch1/r-debug container image from DockerHub.

container: wch1/r-debug

That project claims it does rebuilds daily (https://github.com/wch/r-debug), but it looks like the last push to DockerHub was 2 months ago.

Screen Shot 2024-03-16 at 9 38 35 PM

https://hub.docker.com/r/wch1/r-debug/tags

So I don't think the root cause is "a new image was just pushed" 🤔

@jameslamb
Copy link
Collaborator Author

#6357 made these jobs optional... they still run on every commit, but merging won't be blocked if they're failing.

Let's keep this issue open until those tests are working and required again.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Mar 18, 2024

I was not able to reproduce this locally on my mac.

docker run \
  --rm \
  -it wch1/r-debug \
  bash

RDscriptcsan -e "install.packages(c('R6', 'data.table', 'jsonlite', 'knitr', 'markdown', 'Matrix', 'RhpcBLASctl', 'testthat'), repos = 'https://cran.r-project.org')"

environment info

output of 'docker info' (click me)
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.8.2)
  compose: Docker Compose (Docker Inc., v2.6.1)
  extension: Manages Docker extensions (Docker Inc., v0.2.7)
  sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 3
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.104-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 6.789GiB
 Name: docker-desktop
 ID: ZGVM:EN6E:NNAJ:JUJK:3ZGA:AOVX:GEKA:J2OX:5EZR:AQPW:GFSD:37Z7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false

@jameslamb
Copy link
Collaborator Author

@jameslamb
Copy link
Collaborator Author

the gcc one just passed on #6368

https://github.com/microsoft/LightGBM/actions/runs/8353249363/job/22864673691?pr=6368

I really think the root cause here might end up being "the installation process is using too much memory".

@jameslamb
Copy link
Collaborator Author

I'm still unsure what happened, but based on the conversation in wch/r-debug#35 there was some problem with the automated builds that led to them not being published.... and they're now working again ... and as of #6374 so are the r-sanitizers jobs!

Maybe these things are all related. Maybe there was a bug in a specific commit of R-devel 2 months ago that led to a lot more memory being used during package installation, and maybe that eventually broke the image publishing, and maybe it's since been fixed in R-devel and now everything is fine?

I'm not sure but it's all working now so 🤷🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant