-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC/WIP/RFC: faster integer parsing; define parse
semantics
#16981
Conversation
To me, it seems a bit intrusive to always return a Nullable from |
parse
semantics
What about |
This would also allow fallbacks like this: function parse{T<:Number}(::Type{T}, str::AbstractString)
n = parse(Nullable{T}, str, args...)
return !isnull(n) ? get(n) : throw(ParseError("invalid format for $T: $(repr(str))"))
end
function parse{T<:Number}(::Type{Nullable{T}}, str::AbstractString)
return try parse(T, str)
catch err
return Nullable{T}()
end
end Mutual fallbacks seem a bit weird but but this would allow defining either the erroring version or the nullable version. However, I think that making the |
The key to this seems to be the idea that parsing has three cases: valid data, invalid data, or missing data. I think there are only two cases: you either have valid integer data or you don't. The reason is that there is no canonical notion of what a missing integer looks like. I can also say that Nullable or Maybe types are not necessarily tied to a notion of missing data; they're a general utility for cases where you might not have a value to return. |
This speedup is amazing. Looking at our current code, I'm guessing the call to |
Maybe distinguishing these various cases makes sense: parse(Int, str::String) # valid integer syntax, otherwise error
parse(Nullable{Int}, str::String) # valid integer syntax or null syntax, otherwise error
parse(Result{Int}, str::String) # return Result wrapping either an Int or an error
parse(Result{Nullable{Int}}, str::String) # return Result wrapping either an (Int or a null) or an error Of course that relies on there being some standard way or representing a null, which there isn't. |
@StefanKarpinski See previous discussion at #9487 about @quinnj In practice, I guess your parsing code has to deal with both |
I don't think we're that far from a "standard null representation". Currently we have a simple rule that leading/trailing whitespace (which I think we should more strictly define as space or tab) is allowed, that means nulls can only be "" # empty string or immediate eof(io)
" " # any length of space/tab characters In a further iteration, I've had ideas around extending the immutable Options{B}
null::String
end where a user could specify their own null representation to be parsed as such, i.e. On the other hand, having a notion of "parsing missing values" is much stronger, IMO, interacting with databases, delimited, files, website APIs, any data format really. @JeffBezanson, I'll admit I too was a bit surprised by just how much of a speedup was available. Here's a shortlist of what I think makes the difference, though I'm unsure on just how much each contributes to the overall speedup:
@nalimilan, see my comments above about eventually allowing the user to specify a custom null representation to parse. |
it doesn't generalize very well, because you really need to implement a parser to verify if the number will parse. For example, |
msg::String | ||
end | ||
|
||
immutable Options{base} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be an ImmutableDict
. this type will cause issues for compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? I didn't even realize ImmutableDict
exists, so it would be interesting and relevant for me to know when I should use it ;)
What motivates changing the argument order from |
The idea was making it consistent with |
If anything, I'd change |
I like that ^ |
+1, but see also discussion at #14412. |
Personally I think the main motivation to make |
We should rank our conventions and then apply that rank consistently in the standard library. |
We should determine if we want to do this, and if so what the minimal breaking change needed to go into 1.0 is – or explicitly decide that we're not doing this. |
As an intermediate step, would there be anyway to extract only the performance improvements while not changing any behaviour? |
We have addressed part of this, by returning But I'll definitely second the request for the performance improvements :) |
Funny, I just started picking this up again the other day. You can catch all the latest improvements over here. |
Great! Is the semantic issue resolved sufficiently for 1.0? |
Meh, I think it's fine for now. We're basically just providing convenience for try
result = parse(T, str)
catch
result = nothing
end which is fine, though doesn't seem like a very common practice to do (i.e. we don't have In Parsing.jl, I have ideas around creating various parsing "layers", similar to the approach in TextParse.jl, where you basically have a |
I would say so, yes. I agree something like |
I agree too. Maybe we should have some of those others. For a shorter name, we could also perhaps call them I think it's also not uncommon to want |
Now that I've drawn you in with my carrot of a headline, I also want to discuss what I consider a fundamental design flaw in our current parsing functions. Currently, we have Exhibit A:
The problem, as I see it, is we're utilizing
Nullable
not to represent nullability, but as a form of control flow; basically as a way to avoid having to do a try-catch block as a safeguard against invalid inputs.This conflicts with a fundamental principle of parsing, which in my book could be stated as:
I think this importance distinction can result in subtle bugs in a user's current code, Exhibit B:
The new semantics I propose are this, Exhibit C:
If you need to sanitize or control flow around invalid inputs, then sanitize or use control flow; nullability should be left to it's job of missingness.
This PR is half-baked right now because I wanted a chance to throw the change in semantics out there and get feedback before plowing ahead to re-organize float parsing similarly (and move it from C to julia).
I'll also note the drastically simpler (i.e. smaller and more legible), yet 4x-10x faster net implementation of integer parsing as we define it to be based on the new semantic and on an IO instead of string.
I'm viewing this as a solid step one to making Base Julia parsing machinery world-class in functionality and performance. Future iterations will involve incorporating more type richness (Dates/DateTimes), greater parsing machinery (utilizing the
Options
type for specifying delimiters and such), and eventually the excisement ofdatafmt.jl
, which, according to my forecasts, could be re-implemented in about 20 lines of code with the right parsing machinery in place.Some benchmarks for dessert:
Warmed-up results: