You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Somewhat related to #2267, although not precisely the same. Also #2263 would probably solve my issue. #2727 is also related.
I have csv data which is compiled from different sources, some chunks of the data are bad in that they apparently end their lines with an additional seperator (comma in this case).
Good line: AB,1234,9,9,7
Bad line: CD,1234,6,6,7,
If I read it with test <- fread("/home/user/onebigfile.csv", fill=TRUE) then I would expect fread() to realize that there are lines with one column too many and just add a "V1" column which I could then just throw away. Instead fread() stops on the first bad line and does not continue (there would be good lines afterwards again).
If I interpret the verbose output correctly, fread() "realizes" that there are bad lines, but somehow does not change its guessing of columns from 56 to 57. Instead it assumes an awkward jump and carries on?
> test <- fread("/home/user/onebigfile.csv", fill=TRUE)
omp_get_max_threads() = 12
omp_get_thread_limit() = 2147483647
DTthreads = 4
RestoreAfterFork = true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=12, nth=4)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file /home/user/onebigfile.csv
File opened, size = 188.9MB (198091496 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Var1,Var2,Var3,Var4,Var5,>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 56 fields using quote rule 0
Detected 56 columns on line 1. This line is either column names or first data row. Line starts as: <<Var1,Var2,Var3,Var4,Var5,>>
Quote rule picked = 0
fill=true and the most number of columns found is 56
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because (198091495 bytes from row 1 to eof) / (2 * 15166 jump0size) == 6530
Type codes (jump 000) : 5A557525252555555555552525252525252555555555555555555555 Quote rule 0
Type codes (jump 004) : 5A557525252555555555555555555555555555555555555555555555 Quote rule 0
Type codes (jump 005) : 5A557575757555555555555555555555555555555555555555555555 Quote rule 0
A line with too-many fields (56/56) was found on line 1 of sample jump 6. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 7. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 10. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 11. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 12. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 13. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 14. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 15. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 017) : 5A657575757555555555555555555555555555556565655555556555 Quote rule 0
A line with too-many fields (56/56) was found on line 1 of sample jump 23. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 24. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 25. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 26. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 27. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 28. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 29. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 33. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 34. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 35. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 36. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 37. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 38. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 39. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 40. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 41. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 42. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 43. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 44. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 45. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 46. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 47. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 48. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 49. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 50. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 51. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 52. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 53. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 54. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 55. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 56. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 57. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 58. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 59. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 60. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 61. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 62. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 63. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 64. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 70. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 71. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 72. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 83. Most likely this jump landed awkwardly so type bumps here will be skipped.
A line with too-many fields (56/56) was found on line 1 of sample jump 84. Most likely this jump landed awkwardly so type bumps here will be skipped.
Type codes (jump 100) : 5A657575757555555555555555555555555555556565655555556555 Quote rule 0
'header' determined to be true due to column 1 containing a string on row 1 and a lower type (int32) in the rest of the 4846 sample rows
=====
Sampled 4846 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 2 to the end of last row: 198091109
Line length: mean=157.30 sd=9.73 min=135 max=196
Estimated number of rows: 198091109 / 157.30 = 1259284
Initial alloc = 1437041 rows (1259284 + 14%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : 5A657575757555555555555555555555555555556565655555556555
[10] Allocate memory for the datatable
Allocating 56 column slots (56 - 0 dropped) with 1437041 rows
[11] Read the data
jumps=[0..188), chunk_size=1053676, total_size=198091109
Restarting team from jump 11. nSwept==0 quoteRule==1
jumps=[11..188), chunk_size=1053676, total_size=198091109
Restarting team from jump 11. nSwept==0 quoteRule==2
jumps=[11..188), chunk_size=1053676, total_size=198091109
Restarting team from jump 11. nSwept==0 quoteRule==3
jumps=[11..188), chunk_size=1053676, total_size=198091109
Read 74936 rows x 56 columns from 188.9MB (198091496 bytes) file in 00:00.199 wall clock time
[12] Finalizing the datatable
Type counts:
46 : int32 '5'
5 : int64 '6'
4 : float64 '7'
1 : string 'A'
=============================
0.001s ( 0%) Memory map 0.184GB file
0.177s ( 89%) sep=',' ncol=56 and header detection
0.000s ( 0%) Column type detection using 4846 sample rows
0.002s ( 1%) Allocation of 1437041 rows x 56 cols (0.353GB) of which 74936 ( 5%) rows used
0.019s ( 9%) Reading 188 chunks (0 swept) of 1.005MB (each chunk 398 rows) using 4 threads
+ 0.013s ( 6%) Parse to row-major thread buffers (grown 0 times)
+ 0.004s ( 2%) Transpose
+ 0.002s ( 1%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
0.199s Total
Warning message:
In fread("/home/user/onebigfile.csv", :
Stopped early on line 74938. Expected 56 fields but found 57. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<1234,AB,666660,44444,100.111,1,,-1,,-1,,,1,1,2000,1,1,1,7,1,1,1,,-7,,-6,,-2,,-6,,-3,,-2,,-2,1,1,1,1,,-2,99999,1,999888,1,1,1,11,1,1,1,,-9,10,10,>>
# Unfortunately cannot attach an minimum reproducible example since my data is sensitive, sorry for that. If more information is needed on the issue I will try to provide it!
Somewhat related to #2267, although not precisely the same. Also #2263 would probably solve my issue. #2727 is also related.
I have csv data which is compiled from different sources, some chunks of the data are bad in that they apparently end their lines with an additional seperator (comma in this case).
Good line:
AB,1234,9,9,7
Bad line:
CD,1234,6,6,7,
If I read it with
test <- fread("/home/user/onebigfile.csv", fill=TRUE)
then I would expect fread() to realize that there are lines with one column too many and just add a "V1" column which I could then just throw away. Instead fread() stops on the first bad line and does not continue (there would be good lines afterwards again).If I interpret the verbose output correctly, fread() "realizes" that there are bad lines, but somehow does not change its guessing of columns from 56 to 57. Instead it assumes an awkward jump and carries on?
#
Unfortunately cannot attach an minimum reproducible example since my data is sensitive, sorry for that. If more information is needed on the issue I will try to provide it!#
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: