fread error reading Latin-1 file containing NUL byte <0x00> #2435

adamaltmejd · 2017-10-23T16:51:08Z

Having trouble creating a reproducible example and can't share the data, but I think I stumbled upon a bug in fread(). Trying to read my 1.658GB tsv file encoded in Latin-1 produces the following error:

Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1",  :
Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565	01	0	1	0	1999	1	TNMAT		NMAC09	015	015	15.>>

The problematic line is line no 11129896 where there is a NUL mark written out as <0x00> in Sublime Text and ^@ in Vi (can't copy it). If i set skip = 11129895, fread throws the same error but now on "jump 0", if I set skip = 11129896 it works, but nrows=11129895 still throws the same error. Having removed the character the file reads as it should. Maybe fread() is not supposed to support reading files with these encoding issues, but at least it would be great if the error was more informative. Took me quite a while to understand what was going on and to find the correct line.

The verbose output of fread() is:

> dt <- fread(
...         "POANG.txt",
...         header = TRUE,
...         sep = "\t",
...         sep2 = NULL,
...         encoding = "Latin-1", #same as ISO-8859-1
...         na.strings = NULL,
...         check.names = TRUE,
...         verbose = TRUE,
...     )
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 4 threads (omp_get_max_threads()=4, nth=4)
  No NAstrings provided.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file POANG.txt
  File opened, size = 1.658GB (1780385879 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<LopNr	AterPNr	SenPNr	foralder	>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep '\t'
  sep=0x9  with 100 lines of 15 fields using quote rule 0
  Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<LopNr	AterPNr	SenPNr	foralder	>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 15
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (1780385877 bytes from row 1 to eof) / (2 * 9642 jump0size) == 92324
  Type codes (jump 000)    : 5111115510101055710  Quote rule 0
  Type codes (jump 100)    : 5111115510101055710  Quote rule 0
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 1780385877
  Line length: mean=88.42 sd=14.68 min=58 max=176
  Estimated number of rows: 1780385877 / 88.42 = 20135934
  Initial alloc = 30148798 rows (20135934 + 49%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5111115510101055710
[10] Allocate memory for the datatable
  Allocating 15 column slots (15 - 0 dropped) with 30148798 rows
[11] Read the data
  jumps=[0..1700), chunk_size=1047285, total_size=1780385770
Read 55%. ETA 00:00
[12] Finalizing the datatable
Read 11179334 rows x 15 columns from 1.658GB (1780385879 bytes) file in 00:10.121 wall clock time
Thread buffers were grown 0 times (if all 4 threads each grew once, this figure would be 4)
Final type counts
         0 : drop
         5 : bool8
         0 : bool8
         0 : bool8
         0 : bool8
         5 : int32
         0 : int64
         1 : float64
         0 : float64
         0 : float64
         4 : string
Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1",  :
  Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565	01	0	1	0	1999	1	TNMAT		NMAC09	015	015	15.>>

And sessionInfo():

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin17.0.0 (64-bit)
Running under: macOS High Sierra 10.13

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.2.20/lib/libopenblasp-r0.2.20.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] dtplyr_0.0.2      data.table_1.10.5 dplyr_0.7.4       purrr_0.2.4
 [5] readr_1.1.1       tidyr_0.7.2       tibble_1.3.4      ggplot2_2.2.1
 [9] tidyverse_1.1.1   colorout_1.1-2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     cellranger_1.1.0 compiler_3.4.2   plyr_1.8.4
 [5] bindr_0.1        forcats_0.2.0    tools_3.4.2      lubridate_1.6.0
 [9] jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35
[13] pkgconfig_2.0.1  rlang_0.1.2      psych_1.7.8      parallel_3.4.2
[17] haven_1.1.0      bindrcpp_0.2     xml2_1.1.1       stringr_1.2.0
[21] httr_1.3.1       hms_0.3          grid_3.4.2       glue_1.1.1
[25] R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.1
[29] reshape2_1.4.2   magrittr_1.5     scales_0.5.0     rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2

The text was updated successfully, but these errors were encountered:

djbirke · 2018-04-25T02:24:32Z

I am also affected by this. For large files, replacing <0x00> is often infeasible or too slow. I assume the <0x00> bytes are created by a database export.

readr::read_delim throws a warning for those bytes, but does not stop parsing (readr issue history)

Would that be a possible workaround for fread as well?

mattdowle · 2019-04-15T21:58:27Z

@adamaltmejd and @djbirke would you mind trying your files again please using 1.12.3 from GitHub: embedded NUL should be fixed now.

djbirke · 2019-04-17T09:24:12Z

@mattdowle That's great news, thanks a lot! We decided to bite the bullet and manually replace any <0x00> bytes in our files. I do no longer have access to the original version of the files, so unfortunately cannot test the bugfix on them.

adamaltmejd · 2019-04-23T14:56:54Z

Thanks for fixing this! In my case the file is in on a secure server where I cannot update Datatable myself, but pretty sure it would work if it does in other test cases :). Thanks!

st-pasha mentioned this issue Nov 28, 2017

Master task for fread bugs / proposals #2247

Closed

MichaelChirico mentioned this issue Jan 9, 2018

Support UTF-16 encoded files in fread #2560

Open

mattdowle added this to the v1.10.6 milestone Feb 14, 2018

mattdowle added the fread label Feb 28, 2018

mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018

st-pasha mentioned this issue May 4, 2018

fread does not stop when input is unsupported (e.g. binary file) #2834

Closed

mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019

mdavy86 mentioned this issue Feb 13, 2019

[bug] fread handling embedded NUL characters #3400

Closed

mattdowle added this to the 1.12.4 milestone Apr 15, 2019

mattdowle mentioned this issue Apr 15, 2019

fread NUL follow up #3505

Merged

3 tasks

mattdowle closed this as completed in #3505 Apr 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread error reading Latin-1 file containing NUL byte <0x00> #2435

fread error reading Latin-1 file containing NUL byte <0x00> #2435

adamaltmejd commented Oct 23, 2017

djbirke commented Apr 25, 2018

mattdowle commented Apr 15, 2019

djbirke commented Apr 17, 2019

adamaltmejd commented Apr 23, 2019

fread error reading Latin-1 file containing NUL byte <0x00> #2435

fread error reading Latin-1 file containing NUL byte <0x00> #2435

Comments

adamaltmejd commented Oct 23, 2017

djbirke commented Apr 25, 2018

mattdowle commented Apr 15, 2019

djbirke commented Apr 17, 2019

adamaltmejd commented Apr 23, 2019