Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent issue w/ fcase in 1.16 #6448

Closed
ethanbsmith opened this issue Aug 31, 2024 · 13 comments
Closed

intermittent issue w/ fcase in 1.16 #6448

ethanbsmith opened this issue Aug 31, 2024 · 13 comments

Comments

@ethanbsmith
Copy link
Contributor

im getting a frequent, but intermittent issue with fcase with 1.16

The offending code has not changed in weeks
The issue only started after upgrading to 1.16
The issues does not surface after reverting to 1.15.4
The issue is intermittent. it cannot reliably be reproduced with the same input
The problem code does not use data.table, but uses fcase and fcoalesce

Im am trying to isolate the issue to produce an MRE, but due to the intermittent nature, i cant even reliably trap it in the debugger start investigating

@MichaelChirico
Copy link
Member

if you can get some code that reliably produces the error (even if it's only intermittent), that will help a lot. There was some change to memory management in fcase to support vectorized default=, so it's most likely that's the issue. but without example code, it's hard to isolate the issue & fix it.

@ethanbsmith
Copy link
Contributor Author

ethanbsmith commented Aug 31, 2024

this consistently eventually produces the error for me.

library(xts)

data <- matrix(runif(4000, 0, 100), ncol = 4)
data[5,4] <- NA
data[10,1] <- NA
data[15,c(1,4)] <- NA
data <- xts(data, order.by = rep(Sys.Date(), 1000))
repeat {
fcase(is.na(data[,1]) & is.na(data[,4]), data[,1] + data[,4],
      is.na(data[,1]), data[,4], 
      is.na(data[,4]), data[,1], 
	  data[,1] > data[,4], data[,1] - data[,4], 
	  data[,1] < data[,4], data[,4] - data[,1]
	  )
}
#Error in fcase(is.na(data[, 1]) & is.na(data[, 4]), data[, 1] + data[,  : 
#  Argument #8 has different class than argument #2, Please make sure all output values have the same class.

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/Denver
tzcode source: internal

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] jsonlite_1.8.8    readxl_1.4.3      xml2_1.3.6        curl_5.2.2        Rcpp_1.0.13       matrixStats_1.3.0 data.table_1.16.0 quantmod_0.4.26.1 TTR_0.24.4        doFuture_1.0.1    future_1.34.0     doParallel_1.0.17 iterators_1.0.14  foreach_1.5.2    
[15] xts_0.14.1        zoo_1.8-12        plotrix_3.8-4    

loaded via a namespace (and not attached):
 [1] future.apply_1.11.2 compiler_4.4.1      stringr_1.5.1       systemfonts_1.1.0   globals_0.16.3      scales_1.3.0        fastmap_1.2.0       lattice_0.22-6      R6_2.5.1            knitr_1.48          kableExtra_1.4.0    munsell_0.5.1       svglite_2.1.3      
[14] rlang_1.1.4         stringi_1.8.4       xfun_0.47           viridisLite_0.4.2   cli_3.6.3           magrittr_2.0.3      digest_0.6.37       grid_4.4.1          rstudioapi_0.16.0   lifecycle_1.0.4     evaluate_0.24.0     glue_1.7.0          cellranger_1.1.0   
[27] listenv_0.9.1       codetools_0.2-20    colorspace_2.1-1    parallelly_1.38.0   rmarkdown_2.28      tools_4.4.1         htmltools_0.5.8.1  

@MichaelChirico MichaelChirico added this to the 1.16.2 milestone Aug 31, 2024
@r2evans
Copy link
Contributor

r2evans commented Aug 31, 2024

Odd. I just ran that code on my machine (with 256GB of ram) and got this:

 data <- matrix(runif(4000, 0, 100), ncol = 4)                                                                                                                                                                                                                                                    
 data[5,4] <- NA                                                                                                                                                                                                                                                                                    
 data[10,1] <- NA                                                                                                                                                                                                                                                                                   
 data[15,c(1,4)] <- NA                                                                                                                                                                                                                                                                              
 data <- xts::xts(data, order.by = rep(Sys.Date(), 1000))                                                                                                                                                                                                                                           
 repeat {                                                                                                                                                                                                                                                                                           
 fcase(is.na(data[,1]) & is.na(data[,4]), data[,1] + data[,4],                                                                                                                                                                                                                                      
       is.na(data[,1]), data[,4],                                                                                                                                                                                                                                                                   
       is.na(data[,4]), data[,1],                                                                                                                                                                                                                                                                   
           data[,1] > data[,4], data[,1] - data[,4],                                                                                                                                                                                                                                                
           data[,1] < data[,4], data[,4] - data[,1]                                                                                                                                                                                                                                                 
           )                                                                                                                                                                                                                                                                                        
 }

  *** caught segfault ***
 address 0x30, cause 'memory not mapped'

 Traceback:
  1: fcase(is.na(data[, 1]) & is.na(data[, 4]), data[, 1] + data[,     4], is.na(data[, 1]), data[, 4], is.na(data[, 4]), data[,     1], data[, 1] > data[, 4], data[, 1] - data[, 4], data[,     1] < data[, 4], data[, 4] - data[, 1])

 Possible actions:
 1: abort (with core dump, if enabled)
 2: normal R exit
 3: exit R without saving workspace
 4: exit R saving workspace
 Selection: 

I restarted R, loaded data.table and xts, and it gave me the first error,

Error in fcase(is.na(data[, 1]) & is.na(data[, 4]), data[, 1] + data[,  :
  Argument #8 has different class than argument #2, Please make sure all output values have the same class.

And then (without restarting R) ran the same code again and got the segfault again. Sounds like there's a memory allocation problem. (The system is not under any load.)

But then, since it is a fair (purpose-wise) comparison, I modified it to use dplyr::case_when instead, and it changed error:

data <- matrix(runif(4000, 0, 100), ncol = 4)                                                                                                                                                                                                                                                    
 data[5,4] <- NA                                                                                                                                                                                                                                                                                    
 data[10,1] <- NA                                                                                                                                                                                                                                                                                   
 data[15,c(1,4)] <- NA                                                                                                                                                                                                                                                                              
 data <- xts::xts(data, order.by = rep(Sys.Date(), 1000))                                                                                                                                                                                                                                           
 repeat {                                                                                                                                                                                                                                                                                           
   dplyr::case_when(                                                                                                                                                                                                                                                                                
     is.na(data[,1]) & is.na(data[,4]) ~ data[,1] + data[,4],                                                                                                                                                                                                                                       
     is.na(data[,1]) ~ data[,4],                                                                                                                                                                                                                                                                    
     is.na(data[,4]) ~ data[,1],                                                                                                                                                                                                                                                                    
     data[,1] > data[,4] ~ data[,1] - data[,4],                                                                                                                                                                                                                                                     
     data[,1] < data[,4] ~ data[,4] - data[,1]                                                                                                                                                                                                                                                      
   )                                                                                                                                                                                                                                                                                                
 }

 Error in c.xts(NA_real_, 91.7918950784951, 68.4573939070106, c(73.4160375548527,  :
   zero-length vectors with non-zero-length index are not allowed

(I do not know if the c.xts error is a help or red herring.)

R-4.3.2, data.table_1.16.0, xts_0.14.0

@r2evans
Copy link
Contributor

r2evans commented Aug 31, 2024

Thinking it might be an issue relating specifically to xts, I ran it without the xts(data, ...) call ... and after a few minutes, I have neither erred nor segfaulted.

@ethanbsmith
Copy link
Contributor Author

yes. xts is a matrix w/ an index attribute. subsetting the matrix requites subsetting the index, which is prolly a bit much for fcase to get involved with. in this scenario, fcase works, since i make sure all the inputs are already aligned and dont need to be aligned via the index. so its simple enough to just have fcase deal with the raw data vectors and then convert back to xts after. this is a fairly common practice... i didnt think of it as it used to work under 1.15.4. I just put in a wrapper to handle this and it works fine.

fcase <- function(..., default = NA) {
  d <- lapply(list(...), FUN = as.vector)
  d["default"] <- as.vector(default)
  do.call(data.table::fcase, d)
}

@r2evans
Copy link
Contributor

r2evans commented Sep 1, 2024

I wasn't suggesting that you should find a way to not use xts here, I was just attempting to narrow down further search. Whether or not fcase knows how to subset those indices is one thing, I don't know if the segfault is crashing solely due to xts or fcase itself.

@tdhock
Copy link
Member

tdhock commented Sep 1, 2024

valgrind does not report any issues

(base) tdhock@tdhock-MacBook:~/R$ R -d valgrind --vanilla < data.table-issue6448.R
==62322== Memcheck, a memory error detector
==62322== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==62322== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==62322== Command: /home/tdhock/lib/R/bin/exec/R --vanilla
==62322== 

R version 4.4.1 (2024-06-14) -- "Race for Your Life"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.

R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.

Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.

> library(xts)
Le chargement a nécessité le package : zoo

Attachement du package : ‘zoo’

Les objets suivants sont masqués depuis ‘package:base’:

    as.Date, as.Date.numeric

> library(data.table)

Attachement du package : ‘data.table’

Les objets suivants sont masqués depuis ‘package:xts’:

    first, last

Les objets suivants sont masqués depuis ‘package:zoo’:

    yearmon, yearqtr

> data <- matrix(runif(4000, 0, 100), ncol = 4)
> data[5,4] <- NA
> data[10,1] <- NA
> data[15,c(1,4)] <- NA
> data <- xts(data, order.by = rep(Sys.Date(), 1000))
> ##repeat {
> fcase(is.na(data[,1]) & is.na(data[,4]), data[,1] + data[,4],
+       is.na(data[,1]), data[,4], 
+       is.na(data[,4]), data[,1], 
+ 	  data[,1] > data[,4], data[,1] - data[,4], 
+ 	  data[,1] < data[,4], data[,4] - data[,1]
+ 	  )
           m.c.seq.row..seq.n...seq.col..drop...FALSE.
2024-08-31                                  39.9787008
2024-08-31                                  14.4910687
2024-08-31                                  11.4113385
2024-08-31                                  77.0791226
2024-08-31                                  33.6603878
2024-08-31                                  14.5883231
2024-08-31                                  11.4812065
2024-08-31                                  20.0252484
2024-08-31                                  13.0387685
2024-08-31                                  73.0278527
       ...                                            
2024-08-31                                   4.9903308
2024-08-31                                  15.8005020
2024-08-31                                  75.6947036
2024-08-31                                   0.9207879
2024-08-31                                   7.9627162
2024-08-31                                  11.2869202
2024-08-31                                   0.2669216
2024-08-31                                   5.0148688
2024-08-31                                  47.0472454
2024-08-31                                  43.8645759
> #}
> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.16.0 xts_0.13.2        zoo_1.8-12       

loaded via a namespace (and not attached):
[1] compiler_4.4.1 grid_4.4.1     lattice_0.22-6
> 
==62322== 
==62322== HEAP SUMMARY:
==62322==     in use at exit: 60,400,121 bytes in 11,600 blocks
==62322==   total heap usage: 33,740 allocs, 22,140 frees, 94,405,324 bytes allocated
==62322== 
==62322== LEAK SUMMARY:
==62322==    definitely lost: 0 bytes in 0 blocks
==62322==    indirectly lost: 0 bytes in 0 blocks
==62322==      possibly lost: 0 bytes in 0 blocks
==62322==    still reachable: 60,400,121 bytes in 11,600 blocks
==62322==                       of which reachable via heuristic:
==62322==                         newarray           : 4,264 bytes in 1 blocks
==62322==         suppressed: 0 bytes in 0 blocks
==62322== Rerun with --leak-check=full to see details of leaked memory
==62322== 
==62322== For lists of detected and suppressed errors, rerun with: -s
==62322== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@ethanbsmith
Copy link
Contributor Author

there are lots of scenarios where one needs to operate on the underlying data, then convert back to the original class. this applies not just to xts, but to underlying zoo as well. its so common, that xts::reclass exists for this purpose and its easy as can be to use. just FYI, i do this all the time w/ matrixstats::rowXXX functions, which return plain vectors

lets decompose this into underlying issues:

  1. Is this an issue on xts ojects, or a general problem w/ fcase?
    my original use case is very stressful (big, computationally intensive, run in parallel). it is working fine w/ the wrapper, so I think the issue is the xts inputs. hard to prove this, but based on how stressful my process is, thats my assumption at this point
  2. Handling xts (and zoo) inputs:
    i SUSPECT fcase is just operating on the ordered inputs. its not really handling time series input alignment. doing xts/zoo input alignment on their index is doable, but non-trivial. merge.xts and merge.zoo will handle the bulk of the work, but they convert everything to a matrix. its not just a matter of splitting the matrix back into column vectors, u would have to convert the odd columns back to logical.
    IMO, index alignment this should be left to the caller, and fcase should really only operate on inputs assuming they are order aligned
    in fact, if the inputs are matrix columns, fcase returns a vector, not a 1 x n matrix, so its not really honoring input type to begin with
  3. crashing/bad error on xts/zoo inputs:
    this could probably be handled better. took me a while to track this one down as the worker process would either work fine, or crash, with the same input

FWIW, calling as.vector on the inputs inside fcase might solve a lot of these issues, and not just for xts input

@aitap
Copy link
Contributor

aitap commented Sep 1, 2024

This example crashes for me (R-devel, data.table from Git) with a null pointer dereference in getAttrib:

Thread 1 "R" received signal SIGSEGV, Segmentation fault.
getAttrib0 (vec=vec@entry=0x555559657fc0, name=0x5555559c4be0) at ../../../src/main/attrib.c:153
153             if (TAG(s) == name) {
(gdb) bt
#0  getAttrib0 (vec=vec@entry=0x555559657fc0, name=0x5555559c4be0) at ../../../src/main/attrib.c:153
#1  0x00005555555ef182 in Rf_getAttrib (vec=vec@entry=0x555559657fc0, name=<optimized out>) at ../../../src/main/attrib.c:189
#2  0x00007ffff42d3ee9 in fcaseR (rho=0x5555559fc058, args=0x555556dcc048) at fifelse.c:263
(gdb) p s
$5 = (SEXP) 0x0

At this point, ATTRIB(vec) looks very damaged:

(rr) p vec
$5 = (SEXP) 0x55d3e56c7f50
(rr) p *vec
$6 = {sxpinfo = {type = 0, scalar = 1, obj = 1, alt = 1, gp = 44440, mark = 1, debug = 0, trace = 1, spare = 0, gcgen = 0, gccls = 7,
    named = 21971, extra = 0}, attrib = 0x55d3e56cbe40, gengc_next_node = 0x0, gengc_prev_node = 0x0, u = {primsxp = {offset = 1000}, symsxp = {
      pname = 0x3e8, value = 0x0, internal = 0x40644f5c313c0000}, listsxp = {carval = 0x3e8, cdrval = 0x0, tagval = 0x40644f5c313c0000},
    envsxp = {frame = 0x3e8, enclos = 0x0, hashtab = 0x40644f5c313c0000}, closxp = {formals = 0x3e8, body = 0x0, env = 0x40644f5c313c0000},
    promsxp = {value = 0x3e8, expr = 0x0, env = 0x40644f5c313c0000}}}
(rr) p *((SEXP) 0x55d3e56cbe40)
$7 = {sxpinfo = {type = 0, scalar = 0, obj = 0, alt = 0, gp = 19456, mark = 1, debug = 1, trace = 0, spare = 1, gcgen = 0, gccls = 2,
    named = 49472, extra = 16433}, attrib = 0x6e61, gengc_next_node = 0x55d3e56c7f40, gengc_prev_node = 0x55d3e56daae0, u = {primsxp = {
      offset = 0}, symsxp = {pname = 0x0, value = 0x0, internal = 0x3e8}, listsxp = {carval = 0x0, cdrval = 0x0, tagval = 0x3e8}, envsxp = {
      frame = 0x0, enclos = 0x0, hashtab = 0x3e8}, closxp = {formals = 0x0, body = 0x0, env = 0x3e8}, promsxp = {value = 0x0, expr = 0x0,
      env = 0x3e8}}}

If says type = 0 (NILSXP) but it's not equal to R_NilValue, and CAR() and CDR() are both null pointers, which is invalid for an R value. Where does this come from? Looks like vec had been freed:

(rr) rc
Continuing.

Hardware watchpoint 1: ((SEXP) 0x55d3e56c7f50)->attrib

Old value = (struct SEXPREC *) 0x55d3e56cbe40
New value = (struct SEXPREC *) 0x7fa101ff1cc0 <main_arena+96>
_int_free (av=0x7fa101ff1c60 <main_arena>, p=0x55d3e56cbe40, have_lock=<optimized out>, have_lock@entry=0) at ./malloc/malloc.c:4635
(rr) bt
#0  _int_free (av=0x7fa101ff1c60 <main_arena>, p=0x55d3e56cbe40, have_lock=<optimized out>, have_lock@entry=0) at ./malloc/malloc.c:4635
#1  0x00007fa101eb7e8f in __GI___libc_free (mem=mem@entry=0x55d3e56d0d30) at ./malloc/malloc.c:3385
#2  0x000055d3e0ba3aea in ReleaseLargeFreeVectors () at ../../../src/main/memory.c:1167
#3  RunGenCollect (size_needed=0) at ../../../src/main/memory.c:1951
#4  R_gc_internal (size_needed=size_needed@entry=0) at ../../../src/main/memory.c:3232
#5  0x000055d3e0ba5e30 in Rf_cons (car=0x55d3e3ee1840, car@entry=0x55d3e40a33a0, cdr=0x55d3e2167400) at ../../../src/main/memory.c:2483
...
#19 0x000055d3e0beda18 in do_relop (call=0x55d3e55ff060, op=0x55d3e2171ac0, args=0x55d3e5648368, env=<optimized out>)
    at ../../../src/main/relop.c:66
#20 0x000055d3e0b63bec in Rf_eval (e=0x55d3e55ff060, rho=rho@entry=0x55d3e219e498) at ../../../src/main/eval.c:1269
#21 0x00007fa0fead3dfc in fcaseR (rho=0x55d3e219e498, args=0x55d3e34e5888) at fifelse.c:224

Should value0 have been PROTECTed in the i == 0 branch independently from thens?

(rr) frame 21
#21 0x00007fa0fead3dfc in fcaseR (rho=0x55d3e219e498, args=0x55d3e34e5888) at fifelse.c:224
224         REPROTECT(whens = eval(SEXPPTR_RO(args)[2*i], rho), Iwhens);
(rr) p value0
$11 = (SEXP) 0x55d3e56c7f50
(rr) info watch
Num     Type           Disp Enb Address            What
1       hw watchpoint  keep y                      ((SEXP) 0x55d3e56c7f50)->attrib
(rr) info locals
i = 3
nprotect = 4
len0 = 1000
len2 = 997
ans = 0x55d3e56c9ed0
value0 = 0x55d3e56c7f50
Iwhens = 7
Ithens = 8
type0 = 14
imask = false
naout = false
idefault = false
p = 0x55d3e5ae46c0
n = 6

@r2evans
Copy link
Contributor

r2evans commented Sep 1, 2024

valgrind does not report any issues

@tdhock, I appreciate your insight and taking the steps for that test. I would not assume that a problem others reproduce using repeat{...} would always show the problem in valgrind when you run it just once. Did you run it multiple times? Are you able to repeat it say 100 times (completely arbitrary on my part) to make sure that there are no unexpected "booms"?

@MichaelChirico
Copy link
Member

I confirm the issue on our codespace, and also that the issue is not present in 1.15.4.

@MichaelChirico
Copy link
Member

Great sleuthing @aitap indeed, REPROTECT() on the next iteration will drop value0 from protection. AFAICT simply PROTECT(thens) when assigning value0 fixes the issue.

I was able to reproduce the segfault on master even with this version that strips attributes:

data <- matrix(runif(4000, 0, 100), ncol = 4)
data[5,4] <- NA
data[10,1] <- NA
data[15,c(1,4)] <- NA
data <- xts(data, order.by = rep(Sys.Date(), 1000))
i <- 0L
repeat {
  message(i <- i+1L, " ", appendLF=i %% 100 == 1L)
  fcase(as.logical(is.na(data[,1]) & is.na(data[,4])),
          as.double(data[,1] + data[,4]),
        as.logical(is.na(data[,1])),
          as.double(data[,4]), 
        as.logical(is.na(data[,4])),
          as.double(data[,1]), 
        as.logical(data[,1] > data[,4]),
          as.double(data[,1] - data[,4]), 
        as.logical(data[,1] < data[,4]),
          as.double(data[,4] - data[,1])
      )
}

I reckon it's just that having a thens with attributes makes potentially-dangerous getAttrib() calls much more likely (I haven't debugged which line triggers the segfault with attributes stripped).

@tdhock I think gctorture would have done a better job at finding this issue? Since an object was removed from PROTECT() too early, IIUC gctorture would have smoked that out more reliably.

@MichaelChirico
Copy link
Member

Closed by #6452 + #6451. Cherrypicked to the patch release branch: 65dd334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants