Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread error reading Latin-1 file containing NUL byte <0x00> #2435

Closed
adamaltmejd opened this issue Oct 23, 2017 · 4 comments · Fixed by #3505
Closed

fread error reading Latin-1 file containing NUL byte <0x00> #2435

adamaltmejd opened this issue Oct 23, 2017 · 4 comments · Fixed by #3505
Labels
Milestone

Comments

@adamaltmejd
Copy link

Copying from my question on SO:

Having trouble creating a reproducible example and can't share the data, but I think I stumbled upon a bug in fread(). Trying to read my 1.658GB tsv file encoded in Latin-1 produces the following error:

Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1",  :
Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565	01	0	1	0	1999	1	TNMAT		NMAC09	015	015	15.>>

The problematic line is line no 11129896 where there is a NUL mark written out as <0x00> in Sublime Text and ^@ in Vi (can't copy it). If i set skip = 11129895, fread throws the same error but now on "jump 0", if I set skip = 11129896 it works, but nrows=11129895 still throws the same error. Having removed the character the file reads as it should. Maybe fread() is not supposed to support reading files with these encoding issues, but at least it would be great if the error was more informative. Took me quite a while to understand what was going on and to find the correct line.

The verbose output of fread() is:

> dt <- fread(
...         "POANG.txt",
...         header = TRUE,
...         sep = "\t",
...         sep2 = NULL,
...         encoding = "Latin-1", #same as ISO-8859-1
...         na.strings = NULL,
...         check.names = TRUE,
...         verbose = TRUE,
...     )
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 4 threads (omp_get_max_threads()=4, nth=4)
  No NAstrings provided.
  show progress = 1
  0/1 column will be read as boolean
[02] Opening the file
  Opening file POANG.txt
  File opened, size = 1.658GB (1780385879 bytes).
  Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
  \r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<LopNr	AterPNr	SenPNr	foralder	>>
[06] Detect separator, quoting rule, and ncolumns
  Using supplied sep '\t'
  sep=0x9  with 100 lines of 15 fields using quote rule 0
  Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<LopNr	AterPNr	SenPNr	foralder	>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 15
[07] Detect column types, good nrow estimate and whether first row is column names
  'header' changed by user from 'auto' to true
  Number of sampling jump points = 101 because (1780385877 bytes from row 1 to eof) / (2 * 9642 jump0size) == 92324
  Type codes (jump 000)    : 5111115510101055710  Quote rule 0
  Type codes (jump 100)    : 5111115510101055710  Quote rule 0
  =====
  Sampled 10054 rows (handled \n inside quoted fields) at 101 jump points
  Bytes from first data row on line 1 to the end of last row: 1780385877
  Line length: mean=88.42 sd=14.68 min=58 max=176
  Estimated number of rows: 1780385877 / 88.42 = 20135934
  Initial alloc = 30148798 rows (20135934 + 49%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  =====
[08] Assign column names
[09] Apply user overrides on column types
  After 0 type and 0 drop user overrides : 5111115510101055710
[10] Allocate memory for the datatable
  Allocating 15 column slots (15 - 0 dropped) with 30148798 rows
[11] Read the data
  jumps=[0..1700), chunk_size=1047285, total_size=1780385770
Read 55%. ETA 00:00
[12] Finalizing the datatable
Read 11179334 rows x 15 columns from 1.658GB (1780385879 bytes) file in 00:10.121 wall clock time
Thread buffers were grown 0 times (if all 4 threads each grew once, this figure would be 4)
Final type counts
         0 : drop
         5 : bool8
         0 : bool8
         0 : bool8
         0 : bool8
         5 : int32
         0 : int64
         1 : float64
         0 : float64
         0 : float64
         4 : string
Error in fread("POANG.txt", header = TRUE, sep = "\t", sep2 = NULL, encoding = "Latin-1",  :
  Jump 949 did not finish counting rows exactly where jump 950 found its first good line start: prevEnd(0x14e51d6dc)<<>> != thisStart(prevEnd+180966)<<4908565	01	0	1	0	1999	1	TNMAT		NMAC09	015	015	15.>>

And sessionInfo():

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin17.0.0 (64-bit)
Running under: macOS High Sierra 10.13

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.2.20/lib/libopenblasp-r0.2.20.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] dtplyr_0.0.2      data.table_1.10.5 dplyr_0.7.4       purrr_0.2.4
 [5] readr_1.1.1       tidyr_0.7.2       tibble_1.3.4      ggplot2_2.2.1
 [9] tidyverse_1.1.1   colorout_1.1-2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     cellranger_1.1.0 compiler_3.4.2   plyr_1.8.4
 [5] bindr_0.1        forcats_0.2.0    tools_3.4.2      lubridate_1.6.0
 [9] jsonlite_1.5     nlme_3.1-131     gtable_0.2.0     lattice_0.20-35
[13] pkgconfig_2.0.1  rlang_0.1.2      psych_1.7.8      parallel_3.4.2
[17] haven_1.1.0      bindrcpp_0.2     xml2_1.1.1       stringr_1.2.0
[21] httr_1.3.1       hms_0.3          grid_3.4.2       glue_1.1.1
[25] R6_2.2.2         readxl_1.0.0     foreign_0.8-69   modelr_0.1.1
[29] reshape2_1.4.2   magrittr_1.5     scales_0.5.0     rvest_0.3.2
[33] assertthat_0.2.0 mnormt_1.5-5     colorspace_1.3-2 stringi_1.1.5
[37] lazyeval_0.2.0   munsell_0.4.3    broom_0.4.2
@djbirke
Copy link

djbirke commented Apr 25, 2018

I am also affected by this. For large files, replacing <0x00> is often infeasible or too slow. I assume the <0x00> bytes are created by a database export.

readr::read_delim throws a warning for those bytes, but does not stop parsing (readr issue history)

Would that be a possible workaround for fread as well?

@mattdowle
Copy link
Member

@adamaltmejd and @djbirke would you mind trying your files again please using 1.12.3 from GitHub: embedded NUL should be fixed now.

@djbirke
Copy link

djbirke commented Apr 17, 2019

@mattdowle That's great news, thanks a lot! We decided to bite the bullet and manually replace any <0x00> bytes in our files. I do no longer have access to the original version of the files, so unfortunately cannot test the bugfix on them.

@adamaltmejd
Copy link
Author

Thanks for fixing this! In my case the file is in on a secure server where I cannot update Datatable myself, but pretty sure it would work if it does in other test cases :). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants