Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to DataFrames v0.11, CSV 0.2.1, RData 0.3 #48

Merged
merged 8 commits into from
Jan 11, 2018

Conversation

alyst
Copy link
Member

@alyst alyst commented Sep 19, 2017

A part of JuliaData/DataFrames.jl#1232.
Still have to update .yml so that CI loads master versions of DataFrames, CSV and RData.

@randyzwitch
Copy link
Contributor

Thanks @alyst. When this is ready for merging, let me know

@alyst
Copy link
Member Author

alyst commented Sep 19, 2017

@quinnj I've switched from readtable() to CSV.read(), but now I've got errors like:

DataStreams.Data.NullException("encountered a null value for a non-null column type on row = 6, col = 1")
 Stacktrace:
   [1] (::CSV.##4#5)(::Int64, ::Int64) at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:124
   [2] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:233 [inlined]
   [3] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:315 [inlined]
   [4] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:130 [inlined] (repeats 2 times)
   [5] streamfrom at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:205 [inlined]
   [6] macro expansion at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:272 [inlined]
   [7] stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}},WeakRefStrings.WeakRefStringArray{Union{Nulls.Null, WeakRefString{UInt8}},1}}}, ::DataStreams.Data.Schema{true,Tuple{CategoricalArrays.CategoricalValue{String,UInt32},Union{Nulls.Null, WeakRefString{UInt8}}}}, ::Int64, ::Tuple{Base.#identity,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}) at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:344
   [8] #stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataFrames.DataFrame}) at /fs/home/<>/.julia/v0.6/DataStreams/src/DataStreams.jl:222
   [9] (::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Nulls.Null}, ::Type{DataFrames.DataFrame}) at ./<missing>:0
   [10] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:311
   [11] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
   [12] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...

or

CSV.ParsingException("Unexpected start of quote (34), use \"9234\" to type \"34\"")
  Stacktrace:
   [1] readsplitline!(::Array{CSV.RawField,1}, ::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8, ::UInt8, ::Base.AbstractIOBuffer{Array{UInt8,1}}) at /fs/home/<>/.julia/v0.6/CSV/src/io.jl:98
   [2] #Source#22(::String, ::CSV.Options{Nulls.Null}, ::Int64, ::Int64, ::Array{Type,1}, ::Nulls.Null, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:99
   [3] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
   [4] #Source#21(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{Type,1}, ::Nulls.Null, ::Nulls.Null, ::UInt8, ::String, ::String, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T, ::String) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:27
   [5] CSV.Source(::String) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:24
   [6] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:310
   [7] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
   [8] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...

or

ArgumentError: Symbol name may not contain \0
 Stacktrace:
   [1] macro expansion at ./broadcast.jl:153 [inlined]
   [2] macro expansion at ./simdloop.jl:73 [inlined]
   [3] macro expansion at ./broadcast.jl:147 [inlined]
   [4] _broadcast! at ./broadcast.jl:139 [inlined]
   [5] broadcast_t at ./broadcast.jl:268 [inlined]
   [6] broadcast_c at ./broadcast.jl:314 [inlined]
   [7] broadcast at ./broadcast.jl:434 [inlined]
   [8] close!(::DataFrames.DataFrameStream{Tuple{CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}}}) at /fs/home/<>/.julia/v0.6/DataFrames/src/abstractdataframe/io.jl:302
   [9] #read#39(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:312
   [10] dataset(::String, ::String) at /fs/home/<>/.julia/v0.6/RDatasets/src/dataset.jl:13
   [11] macro expansion at /fs/gpfs03/lv06/pool/pool-innate-bioinfo2/<>/.julia/v0.6/RDatasets/test/dataset.jl:15 [inlined]
...

Maybe you have an easy fix/explanations to those.

@alyst
Copy link
Member Author

alyst commented Sep 21, 2017

I've failed to resist the temptation of caching the datasets() and packages() tables in global variables. I'm not sure, though, the way I implemented it follows the recommended code pattern.

@quinnj
Copy link
Member

quinnj commented Sep 21, 2017

@alyst, can you list the individual files you saw the errors on? Happy to take a look.

@alyst
Copy link
Member Author

alyst commented Sep 21, 2017

The csv.gz issues listed above are resolved. By default I'm using 200 rows for type detection now, and there are some datasets that need more rows to properly infer types.
12 datasets are failing, but mostly because null are encountered beyond 200-th row.
So far the only dataset that requires fixing JuliaData/CSV.jl#64 is datasets/attenu.csv.gz, the inferred type of Station column is CategoricalArrays.CategoricalValue{String,UInt32}:

DataStreams.Data.NullException("encountered a null value for a non-null column type on row = 79, col = 3")
  Stacktrace:
   [1] (::CSV.##4#5)(::Int64, ::Int64) at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:124
   [2] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:235 [inlined]
   [3] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:317 [inlined]
   [4] parsefield at /fs/home/<>/.julia/v0.6/CSV/src/parsefields.jl:131 [inlined] (repeats 2 times)
   [5] streamfrom at /fs/home/<>/.julia/v0.6/CSV/src/Source.jl:216 [inlined]

(the CSV.jl line numbers could be off, because I've made some changes to support all-nulls columns):

@alyst
Copy link
Member Author

alyst commented Nov 27, 2017

Some datasets need JuliaData/CSV.jl#115 to load properly

@tlnagy
Copy link
Contributor

tlnagy commented Dec 6, 2017

Thanks for working on this. I believe RDatasets is only package that needs to be updated before we can patch up Gadfly to work on top of Dataframes v0.11

@alyst
Copy link
Member Author

alyst commented Dec 6, 2017

I'm waiting for JuliaData/CSV.jl#115 to remove [WIP] and consider merging. I hope that PR would be reviewed in a few days.

@randyzwitch
Copy link
Contributor

Just ping me when this is ready for merging

@Nosferican
Copy link

@alyst JuliaData/CSV.jl#115 has been merged.

@randyzwitch
Copy link
Contributor

Even with the PR being merged, it still needs to be tagged. And that tag should be the minimum version in REQUIRES here.

@Nosferican
Copy link

@quinnj could you tag a minor release? It would help with updating RDatasets and Gadfly.

@alyst
Copy link
Member Author

alyst commented Dec 24, 2017

@Nosferican Thanks! Actually, I haven't forgotten this PR at all :) Besides JuliaData/CSV.jl#115 there are a few other outstanding issues (notably JuliaData/CSV.jl#132), which we are trying to resolve. Once it's done and CI is green for DataFrames and CSV, I'll update this PR and it's ok to merge.

@quinnj
Copy link
Member

quinnj commented Jan 10, 2018

A new tag of CSV was just merged.

@randyzwitch randyzwitch reopened this Jan 10, 2018
@randyzwitch
Copy link
Contributor

@alyst Travis fails overall because it still mentions 0.4 and 0.5, but 0.6 passes. If you could please modify .travis.yml to just test 0.6 and master, then we should be able to tag a new version of RDatasets

@alyst
Copy link
Member Author

alyst commented Jan 10, 2018

@randyzwitch I hope to have a look tomorrow.

StoneCypher added a commit to StoneCypher/RDatasets.jl that referenced this pull request Jan 10, 2018
randyzwitch pushed a commit that referenced this pull request Jan 10, 2018
@randyzwitch randyzwitch reopened this Jan 10, 2018
@randyzwitch
Copy link
Contributor

Thanks to @StoneCypher, this is now passing on 0.6 as well. Up to you @alyst if there are more updates you want to do, or if I should merge this in and tag a new release.

@tlnagy
Copy link
Contributor

tlnagy commented Jan 11, 2018

It would be great to have this merged and a new release tagged as soon as possible. Thanks everyone for getting this up and running. (esp @alyst!)

alyst added 8 commits January 11, 2018 16:04
- decompress csv.gz and feed the stream to CSV using CodecZlib

- raise row detection limit for csv files to 15000
  as vcd::Bundesliga needs to read upto row #10804 to detect
  that 7th column can contain missing values
so that test results for non-problematic packages are collapsed into
a single row in the report
allow digits in the first position to make psych::cubits pass,
but disallow whitespace at any other position
use 20 rows by default and define Dataset_typedetect_rows dictionary
that stores the exceptions
@alyst alyst changed the title [WIP] Update to Julia v0.6, DataFrames v0.11 Update to DataFrames v0.11, CSV 0.2.1, RData 0.3 Jan 11, 2018
@alyst
Copy link
Member Author

alyst commented Jan 11, 2018

I've updated REQUIRE, removed enclosing modules from tests (now all test.jl files are executed in their own testset) and finally removed [WIP].

@randyzwitch randyzwitch merged commit a3e584c into JuliaStats:master Jan 11, 2018
@randyzwitch
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants