improve fread behaviour on inconsistent number of columns #3436

TobiasGold · 2019-02-28T15:30:31Z

Somewhat related to #2267, although not precisely the same. Also #2263 would probably solve my issue. #2727 is also related.

I have csv data which is compiled from different sources, some chunks of the data are bad in that they apparently end their lines with an additional seperator (comma in this case).
Good line: AB,1234,9,9,7
Bad line: CD,1234,6,6,7,

If I read it with test <- fread("/home/user/onebigfile.csv", fill=TRUE) then I would expect fread() to realize that there are lines with one column too many and just add a "V1" column which I could then just throw away. Instead fread() stops on the first bad line and does not continue (there would be good lines afterwards again).

If I interpret the verbose output correctly, fread() "realizes" that there are bad lines, but somehow does not change its guessing of columns from 56 to 57. Instead it assumes an awkward jump and carries on?

> test <- fread("/home/user/onebigfile.csv", fill=TRUE)
omp_get_max_threads() = 12
omp_get_thread_limit() = 2147483647
DTthreads = 4
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 4 threads (omp_get_max_threads()=12, nth=4)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file /home/user/onebigfile.csv
  File opened, size = 188.9MB (198091496 bytes).
  Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<Var1,Var2,Var3,Var4,Var5,>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 56 fields using quote rule 0
  Detected 56 columns on line 1. This line is either column names or first data row. Line starts as: <<Var1,Var2,Var3,Var4,Var5,>>
  Quote rule picked = 0
  fill=true and the most number of columns found is 56
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (198091495 bytes from row 1 to eof) / (2 * 15166 jump0size) == 6530
  Type codes (jump 000)    : 5A557525252555555555552525252525252555555555555555555555  Quote rule 0
  Type codes (jump 004)    : 5A557525252555555555555555555555555555555555555555555555  Quote rule 0
  Type codes (jump 005)    : 5A557575757555555555555555555555555555555555555555555555  Quote rule 0
  A line with too-many fields (56/56) was found on line 1 of sample jump 6. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 10. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 11. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 12. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 13. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 14. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 15. Most likely this jump landed awkwardly so type bumps here will be skipped.
  Type codes (jump 017)    : 5A657575757555555555555555555555555555556565655555556555  Quote rule 0
  A line with too-many fields (56/56) was found on line 1 of sample jump 23. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 24. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 25. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 26. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 27. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 28. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 29. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 33. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 34. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 35. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 36. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 37. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 38. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 39. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 40. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 41. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 42. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 43. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 44. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 45. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 46. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 47. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 48. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 49. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 50. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 51. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 52. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 53. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 54. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 55. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 56. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 57. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 58. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 59. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 60. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 61. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 62. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 63. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 64. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 70. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 71. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 72. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 83. Most likely this jump landed awkwardly so type bumps here will be skipped.
  A line with too-many fields (56/56) was found on line 1 of sample jump 84. Most likely this jump landed awkwardly so type bumps here will be skipped.
  Type codes (jump 100)    : 5A657575757555555555555555555555555555556565655555556555  Quote rule 0
  'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 4846 sample rows
  =====
  Sampled 4846 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 2 to the end of last row: 198091109
  Line length: mean=157.30 sd=9.73 min=135 max=196
  Estimated number of rows: 198091109 / 157.30 = 1259284
  Initial alloc = 1437041 rows (1259284 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5A657575757555555555555555555555555555556565655555556555
[10] Allocate memory for the datatable
  Allocating 56 column slots (56 - 0 dropped) with 1437041 rows
[11] Read the data
  jumps=[0..188), chunk_size=1053676, total_size=198091109
  Restarting team from jump 11. nSwept==0 quoteRule==1
  jumps=[11..188), chunk_size=1053676, total_size=198091109
  Restarting team from jump 11. nSwept==0 quoteRule==2
  jumps=[11..188), chunk_size=1053676, total_size=198091109
  Restarting team from jump 11. nSwept==0 quoteRule==3
  jumps=[11..188), chunk_size=1053676, total_size=198091109
Read 74936 rows x 56 columns from 188.9MB (198091496 bytes) file in 00:00.199 wall clock time
[12] Finalizing the datatable
  Type counts:
        46 : int32     '5'
         5 : int64     '6'
         4 : float64   '7'
         1 : string    'A'
=============================
   0.001s (  0%) Memory map 0.184GB file
   0.177s ( 89%) sep=',' ncol=56 and header detection
   0.000s (  0%) Column type detection using 4846 sample rows
   0.002s (  1%) Allocation of 1437041 rows x 56 cols (0.353GB) of which 74936 (  5%) rows used
   0.019s (  9%) Reading 188 chunks (0 swept) of 1.005MB (each chunk 398 rows) using 4 threads
   +    0.013s (  6%) Parse to row-major thread buffers (grown 0 times)
   +    0.004s (  2%) Transpose
   +    0.002s (  1%) Waiting
   0.000s (  0%) Rereading 0 columns due to out-of-sample type exceptions
   0.199s        Total
Warning message:
In fread("/home/user/onebigfile.csv",  :
  Stopped early on line 74938. Expected 56 fields but found 57. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1234,AB,666660,44444,100.111,1,,-1,,-1,,,1,1,2000,1,1,1,7,1,1,1,,-7,,-6,,-2,,-6,,-3,,-2,,-2,1,1,1,1,,-2,99999,1,999888,1,1,1,11,1,1,1,,-9,10,10,>>

# Unfortunately cannot attach an minimum reproducible example since my data is sensitive, sorry for that. If more information is needed on the issue I will try to provide it!

# Output of sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.4.0     survival_2.43-3   haven_2.1.0       readr_1.3.1       dplyr_0.8.0.1     data.table_1.12.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       rstudioapi_0.9.0 magrittr_1.5     hms_0.4.2        splines_3.5.1    tidyselect_0.2.5 bit_1.1-14       lattice_0.20-38 
 [9] R6_2.4.0         rlang_0.3.1      tools_3.5.1      grid_3.5.1       yaml_2.2.0       bit64_0.9-7      assertthat_0.2.0 tibble_2.0.1    
[17] crayon_1.3.4     Matrix_1.2-15    purrr_0.3.0      glue_1.3.0       stringi_1.3.1    compiler_3.5.1   pillar_1.3.1     forcats_0.4.0   
[25] pkgconfig_2.0.2

The text was updated successfully, but these errors were encountered:

TobiasGold changed the title ~~fread error on inconsistent number of columns~~ improve fread behaviour on inconsistent number of columns Feb 28, 2019

jangorecki added the fread label Aug 6, 2019

ben-schwen mentioned this issue Aug 28, 2021

fread: use fill with integer as ncol guess #5119

Merged

ben-schwen added this to the 1.16.0 milestone Jan 5, 2024

MichaelChirico closed this as completed in #5119 Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve fread behaviour on inconsistent number of columns #3436

improve fread behaviour on inconsistent number of columns #3436

TobiasGold commented Feb 28, 2019

improve fread behaviour on inconsistent number of columns #3436

improve fread behaviour on inconsistent number of columns #3436

Comments

TobiasGold commented Feb 28, 2019