Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during sorting #2707

Closed
KyleDKavanagh opened this issue Mar 26, 2018 · 17 comments
Closed

Segfault during sorting #2707

KyleDKavanagh opened this issue Mar 26, 2018 · 17 comments

Comments

@KyleDKavanagh
Copy link

With v1.10.5 installed from source, newly seeing a segfault during standard grpby/assign operations on massive datatables

address 0x7fc8862fa354, cause 'memory not mapped'

Traceback:
 1: fsort(as.numeric(irows))
 2: `[.data.table`(dt, query %in% c("1", "2"), `:=`(position = as.numeric(1:.N)),     by = grpvar)
 3: dt[query %in% c("1", "2"), `:=`(position = as.numeric(1:.N)),     by = grpvar]
An irrecoverable exception occurred. R is aborting now ...```
@mattdowle
Copy link
Member

mattdowle commented Mar 26, 2018

Thanks for reporting. Any chance of a way to reproduce that dataset? It might be specific to the column type and query. I can see the query from the output, but what is query column and grpvar ? And does position exist already or is it being added?

@shrektan
Copy link
Member

The table seems contain strings. Is it possible that it's related to the encoding issue that I've reported?

@mattdowle
Copy link
Member

mattdowle commented Mar 27, 2018

Anything is possible. @KyleDKavanagh - it's much harder (and perhaps impossible) if you make us guess.

My guess is that it's to do with the %in% optimization added recently in dev. There's a line there that converts an integer vector to numeric and passes that to fsort, for speed. From the traceback shows above, it feels like that to me. Perhaps something to do with the bit pattern of whole integers in IEEE754 that tickles out an edge condition in fsort. Or that there's an NA in that integer vector, which fsort doesn't handle. Or, fsort on very small input, like 2 items in this case.

@KyleDKavanagh
Copy link
Author

Apologies for the delayed response - Both query and grpvar are strings. Unfortunately, I can't provide the exact dataset I've been working on as it's proprietary data for my employer.

@mattdowle
Copy link
Member

Ok no problem. Can you provide str(DT) and perhaps edit the strings in that summary if they are proprietary. Also the output of sapply(DT, uniqueN) would be useful, again, obfuscating the column names manually as appropriate. The output with verbose=TRUE as well please. This should all help and we can go from there. Do "1" and "2" exist in the column, or are they missing? Any info like that would be useful.

@mattdowle
Copy link
Member

mattdowle commented Mar 27, 2018

My guess is that one of "1" or "2" isn't in the data. The fact that fsort doesn't handle negatives is caught, but as.numeric(NA_integer_) is not it seems.

 > fsort(c(4,5), verbose=TRUE)
nth=8, nBatch=16
Range = [4,5]
maxBit=50; MSBNbits=16; shift=35; MSBsize=65536
counts is 0MB (128 pages per nBatch=1, batchSize=1024, lastBatchSize=2)
Top 5 MSB counts: 1 1 0 0 0 
Reduced MSBsize from 65536 to 0 by excluding 0 and 1 counts
1: 0.000 ( 0.2%)
2: 0.008 (46.3%)
3: 0.000 ( 0.7%)
4: 0.003 (15.6%)
5: 0.002 (12.5%)
6: 0.002 (14.8%)
7: 0.002 ( 9.9%)
[1] 4 5

> fsort(c(4,NA), verbose=TRUE)
nth=8, nBatch=16
Range = [4,4]
maxBit=-2147483648; MSBNbits=-2147483647; shift=0; MSBsize=2
counts is 0MB (0 pages per nBatch=1, batchSize=1024, lastBatchSize=2)

 *** caught segfault ***
address (nil), cause 'unknown'

Traceback:
 1: fsort(c(4, NA), verbose = TRUE)

Hopefully @KyleDKavanagh can provide some verbose output similar to the above. But in the meantime this has to be fixed anyway. Kyle, since you can't provide the data, you could type debug(data.table:::fsort) then run the query. It will stop when it enters fsort. Then you type x and tell us what fsort is being passed. This is onerous but it would enable progress without sending us your data.

@mattdowle mattdowle added this to the v1.10.6 milestone Mar 27, 2018
@KyleDKavanagh
Copy link
Author

Everything below run on v1.10.4-3 after rolling back

> dt[query %in% c("1", "2"), `:=`(grpIdx=as.numeric(1:.N)),by=grpvar, verbose=T];
Using existing index 'query'
Starting bmerge ...done in 0 secs
i clause present and columns used in by detected, only these subset: grpvar
Detected that j uses these columns: <none> 
Finding groups using forderv ... 0.094 sec
Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
Getting back original order ... 0.006 sec
lapply optimization is on, j unchanged as 'list(as.numeric(1:.N))'
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 0.130s for 289239 groups
  eval(j) took 4.950s for 289239 calls

Results from str()

Classes ‘data.table’ and 'data.frame':	13539020 obs. of  4 variables:
 $ ts_ns             : chr  "1522156437655270332" "1522156437655279062" "1522156437655286082" "1522156437655293372" ...
 $ groupvar      : chr  "" "" "" "" ...
 $ query     : chr  "1" "1" "1" "1" ...
 $ SBE.SourceSequence: chr  "0" "0" "0" "0" ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "sorted")= chr "SBE.SourceSequence"

Results from uniqueN()

             ts_ns          groupvar                query SBE.SourceSequence 
          13539020           12173006                  4           12735508 

Values for query:

   query     num
1:     1  595317
2:     0 6193535
3:     2  758743
4:     5 5991425

fsort debug - Not sure how helpful this is because of the number of rows

Browse[2]> x
   [1]     1     2     3     4     5     6     7     8     9    10    11    12    14    16    20    21    22    24    26    27    28    30    31    32
  [25]    34    36    37    38    40    41    42    44    45    46    48    49    50    52    53    54    56    57    58    61    62    63    68    69
  [49]    70    73    75    77    78    79    80    82    84    86    88    93    94    95    97    98    99   100   103   104   105   107   109   112
  [73]   113   114   119   121   123   127   129   139   145   146   147   149   150   151   153   154   155   157   158   159   160   162   164   166
  [97]   168   170   174   175   176   178   179   180   182   184   186   187   188   265   266   267   271   272   273   294   295   296   305   306
 [121]   307   308   309   310   326   327   328   333   334   335   348   349   350   382   383   384   385   386   387   419   420   421   488   489
 [145]   490   491   492   493   539   540   541   700   701   702   703   704   705   899   900   901   902   903   904   905   906   907   934   935
 [169]   936  1023  1024  1025  1026  1027  1028  1029  1030  1031  1032  1033  1034  1035  1036  1037  1038  1039  1040  1041  1042  1043  1130  1131
 [193]  1132  1133  1134  1135  1288  1289  1290  1291  1292  1293  1521  1522  1523  1591  1592  1593  1618  1619  1620  1621  1622  1623  1624  1625
 [217]  1626  2051  2052  2053  2083  2084  2085  2094  2095  2096  2117  2118  2119  2150  2151  2152  2173  2174  2175  2193  2194  2195  2226  2227
 [241]  2228  2298  2299  2300  2317  2318  2319  2337  2338  2339  2342  2343  2344  2367  2368  2369  2393  2394  2395  2462  2463  2464  2490  2491
 [265]  2492  2618  2619  2620  2641  2642  2643  2654  2655  2656  2692  2693  2694  2706  2707  2708  2716  2717  2718  2749  2750  2751  2763  2764
 [289]  2765  2766  2767  2768  2797  2798  2799  2837  2838  2839  2869  2870  2871  2913  2914  2915  2938  2939  2940  2969  2970  2971  3034  3035
 [313]  3036  3044  3045  3046  3053  3054  3055  3070  3071  3072  3076  3077  3078  3082  3083  3084  3126  3127  3128  3174  3175  3176  3184  3185
 [337]  3186  3212  3213  3214  3314  3315  3316  3392  3393  3394  3439  3440  3441  3446  3447  3448  3481  3482  3483  3493  3494  3495  3501  3502
 [361]  3503  3527  3528  3529  3542  3543  3544  3585  3586  3587  3712  3713  3714  3750  3751  3752  3784  3785  3786  3828  3829  3830  3863  3864
 [385]  3865  3904  3905  3906  3915  3916  3917  3978  3979  3980  3993  3994  3995  4008  4009  4010  4015  4016  4017  4029  4030  4031  4099  4100
 [409]  4101  4115  4116  4117  4136  4137  4138  4146  4147  4148  4159  4160  4161  4168  4169  4170  4216  4217  4218  4265  4266  4267  4270  4271
 [433]  4272  4283  4284  4285  4338  4339  4340  4374  4375  4376  4400  4401  4402  4515  4516  4517  4740  4741  4742  4743  4744  4745  4784  4785
 [457]  4786  4811  4812  4813  4873  4874  4875  4908  4909  4910  4932  4933  4934  4951  4952  4953  4957  4958  4959  5054  5055  5056  5107  5108
 [481]  5109  5127  5128  5129  5159  5160  5161  5209  5210  5211  5267  5268  5269  5318  5319  5320  5347  5348  5349  5378  5379  5380  5423  5424
 [505]  5425  5465  5466  5467  5642  5643  5644  5786  5787  5788  5797  5798  5799  5827  5828  5829  5840  5841  5842  6001  6002  6003  6019  6020
 [529]  6021  6122  6123  6124  6198  6199  6200  6219  6220  6221  6233  6234  6235  6254  6255  6256  6302  6303  6304  6314  6315  6316  6345  6346
 [553]  6347  6400  6401  6402  6428  6429  6430  6561  6562  6563  6596  6597  6598  6602  6603  6604  6666  6667  6668  6693  6694  6695  6703  6704
 [577]  6705  6739  6740  6741  6765  6766  6767  6923  6924  6925  6960  6961  6962  6978  6979  6980  7149  7150  7151  7193  7194  7195  7210  7211
 [601]  7212  7380  7381  7382  7529  7530  7531  7609  7610  7611  7664  7665  7666  7682  7683  7684  7686  7687  7688  7707  7708  7709  7797  7798
 [625]  7799  7852  7853  7854  7868  7869  7870  7887  7888  7889  7935  7936  7937  7941  7942  7943  7964  7965  7966  7979  7980  7981  8045  8046
 [649]  8047  8051  8052  8053  8085  8086  8087  8099  8100  8101  8208  8209  8210  8345  8346  8347  8360  8361  8362  8375  8376  8377  8463  8464
 [673]  8465  8498  8499  8500  8534  8535  8536  8543  8544  8545  8591  8592  8593  8657  8658  8659  8661  8662  8663  8699  8700  8701  8714  8715
 [697]  8716  8750  8751  8752  8773  8774  8775  8785  8786  8787  8946  8947  8948  9128  9129  9130  9155  9156  9157  9244  9245  9246  9252  9253
 [721]  9254  9274  9275  9276  9290  9291  9292  9293  9294  9295  9367  9368  9369  9447  9448  9449  9477  9478  9479  9484  9485  9486  9495  9496
 [745]  9497  9559  9560  9561  9695  9696  9697  9763  9764  9765  9786  9787  9788  9862  9863  9864 10013 10014 10015 10130 10131 10132 10197 10198
 [769] 10199 10235 10236 10237 10260 10261 10262 10266 10267 10268 10295 10296 10297 10313 10314 10315 10392 10393 10394 10413 10414 10415 10455 10456
 [793] 10457 10470 10471 10472 10502 10503 10504 10550 10551 10552 10560 10561 10562 10615 10616 10617 10677 10678 10679 10723 10724 10725 10751 10752
 [817] 10753 10754 10755 10756 10808 10809 10810 10867 10868 10869 10879 10880 10881 10893 10894 10895 10943 10944 10945 10980 10981 10982 11056 11057
 [841] 11058 11082 11083 11084 11106 11107 11108 11116 11117 11118 11127 11128 11129 11183 11184 11185 11203 11204 11205 11253 11254 11255 11259 11260
 [865] 11261 11297 11298 11299 11372 11373 11374 11404 11405 11406 11428 11429 11430 11444 11445 11446 11461 11462 11463 11504 11505 11506 11514 11515
 [889] 11516 11696 11697 11698 11816 11817 11818 11849 11850 11851 11897 11898 11899 11908 11909 11910 11962 11963 11964 11986 11987 11988 12036 12037
 [913] 12038 12058 12059 12060 12109 12110 12111 12141 12142 12143 12159 12160 12161 12166 12167 12168 12236 12237 12238 12307 12308 12309 12329 12330
 [937] 12331 12343 12344 12345 12365 12366 12367 12391 12392 12393 12421 12422 12423 12434 12435 12436 12442 12443 12444 12456 12457 12458 12472 12473
 [961] 12474 12488 12489 12490 12513 12514 12515 12562 12563 12564 12575 12576 12577 12610 12611 12612 12685 12686 12687 12696 12697 12698 12700 12701
 [985] 12702 12703 12704 12705 12709 12710 12711 12712 12713 12714 12989 13037 13041 13045 13056 13100
 [ reached getOption("max.print") -- omitted 1353060 entries ]

@mattdowle
Copy link
Member

Thanks! That's odd as I didn't think v1.10.4-3 would call fsort in this case. I thought usage of fsort was a recent internal addition.

@mattdowle
Copy link
Member

The query contains by=grpvar but the column name is groupvar. Is that correct? Could it possibly be picking up grpvar in calling scope? Maybe just a copy and paste type or artifact of obfuscation.

@KyleDKavanagh
Copy link
Author

Artifact of obfuscating by hand. Confirmed that it's not picking up any wider-scoped variables.

@mattdowle
Copy link
Member

Have been looking through, and there is a call to fsort in 1.10.4-3. But it doesn't invoke the new parallel numeric sort, which is likely where the problem lies. That call has changed in dev to do so.
Could you save that x to file? (the 1353060 vector).

save(list="x", file="/tmp/x.Rdata")

In a fresh R session, load data.table dev 1.10.5 and then do:

load("/tmp/x.Rdata")
fsort(x)

Does it crash? If anyNA(x) is TRUE then that's fine and I know why. If not, then could you send me that file? It's just a vector of row numbers; nothing proprietary. If integers, apx 5MB. If double, apx 10MB. So try email first, otherwise please place online somewhere, or attach here in GitHub might work. Thanks.

@KyleDKavanagh
Copy link
Author

KyleDKavanagh commented Mar 28, 2018

Interestingly, the segfault only seems to occur when setDTthreads=0 (Rather than the default of 16). Four tests, 2 with unlimited threads (both crashed), 2 with 16 threads (both worked). X was confirmed to be identical for all four tests

x.Rdata attached as a tgz

> anyNA(x)
[1] FALSE

xRdata.gz

@mattdowle
Copy link
Member

Perfect! Many thanks. I haven't seen the crash locally yet, but the info that it depends on nThreads helps a lot as that determines the chunk size. Could you run those variations with verbose=TRUE passed to fsort please and paste the full output here. I now have a good chance anyway, but the extra output might shed light earlier, especially the verbose output before the crash. Thanks.

@mattdowle
Copy link
Member

@KyleDKavanagh Could you retry latest dev please, now that PR is merged. I'd put the chance that fixes it at 70%.

@KyleDKavanagh
Copy link
Author

No luck...

@mattdowle
Copy link
Member

mattdowle commented Mar 29, 2018

@KyleDKavanagh Ok. Can you provide the verbose=TRUE output please (see above). And your sessionInfo(). Please provide full output from a fresh R session that shows you loading x and running fsort on it. Are you sure you have closed all R sessions, re-installed the latest dev version and loaded it properly. We need to see the startup message to confirm this. If you are Windows, for example, we have seen dll problems cause improper upgrades. This is one reason we ask for sessionInfo() up front. Please also run test.data.table() and post the full output.

@mattdowle
Copy link
Member

Did manage to reproduce a segfault in this area and fixed it 12 days ago, but Kyle reported it didn't work for them. Haven't heard from Kyle in 11 days and I haven't got the information I asked for above. Other memory faults recently fixed in dev could have been at play.
Closing as not reproducible and I can't think of any other guesses I can make.
@KyleDKavanagh Please retest with latest dev and open a new issue if you still have problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants