-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an efficient way to parse Integers (and Floats)? #113
Comments
I think it makes sense, but we shouldn't we support things besides base 10? I'd expect to be able to provide a base (I think) |
The way I was seeing it, these would be a different format name, e.g. with the first syntax:
But maybe that's not flexible enough? The issue is that for this to perform, we need to invoke a C function with a pointer, so argument passing a bit more complex I think? |
I plan on using the changes from #89 in HexaPDF (thanks @tenderlove) once they are released. This proposal feels quite similar and from a user's perspective I would probably go with the method names The |
I like Do we need the pattern argument ( |
I think so, because Ruby's >> "1_234".to_i
=> 1234 >> Integer("0x80")
=> 128 So if we're going to use something like Alternatively we can not take a pattern argument, but we shouldn't allow thing like What do you think? |
We can avoid the I have some concerns about mandatory pattern argument:
Can we add |
For the HexaPDF use case there would need to be support for two things:
For the integer use case I think this is what most integers look like. For the float use case the PDF syntax is probably special because a digit before or after the period is not necessary, as shown in the above examples. I'm not sure if only providing a scan pattern would work because the method itself would still need to know how to deal with that situation. So maybe a keyword argument for |
How would you determine whether the digits at the current scan position are actually an integer and not something else? For example, is Would it make sense to specify a "separator" pattern or "separator" characters? |
True, if we could avoid matching a regexp it would be preferable.
That indeed sound like an easy mistake to do.
In my opinion, underscores should either not be supported or be opt-in, not opt-out, as it's a common thing in Ruby and YAML, but fairly rare otherwise, and I'm not even sure Ruby and YAML agree exactly on how underscores can be used. So I think we shouldn't even support underscores, or at least not initially.
I think it's not that uncommon, Ruby, JavaScript, Python and likely many others support that, one notable exception I can think of being JSON. So the decision is of course @kou call, but at that point I think |
I'm OK with this.
How about implementing |
We may implement |
Sounds good. I'll work on this today. |
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
@casperisfine I have been using the code from your pull request to benchmark HexaPDF with an adapted implementation that uses
The adjusted implementation looks like this: # Parses the number (integer or real) at the current position.
#
# See: PDF2.0 s7.3.3
def parse_number
prepare_string_scanner(20)
pos = self.pos
if (tmp = @ss.scan_integer)
if @ss.eos? || @ss.match?(WHITESPACE_OR_DELIMITER_RE)
# Handle object references, see PDF2.0 s7.3.10
prepare_string_scanner(10)
if @ss.scan(REFERENCE_RE)
tmp = if tmp > 0
Reference.new(tmp, @ss[1].to_i)
else
maybe_raise("Invalid indirect object reference (#{tmp},#{@ss[1].to_i})")
nil
end
end
return tmp
else
self.pos = pos
end
end
val = scan_until(WHITESPACE_OR_DELIMITER_RE) || @ss.scan(/.*/)
if val.match?(/\A[+-]?(?:\d+\.\d*|\.\d+)\z/)
val << '0' if val.getbyte(-1) == 46 # dot '.'
Float(val)
else
TOKEN_CACHE[val] # val is keyword
end
end
end Would it be possible to provide a "separator" argument to |
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: ruby#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
Fix: #113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported.
@gettalong Could you open a new issue for your use case? Let's discuss it in a new issue not in a closed PR. |
(ruby/strscan#115) Fix: ruby/strscan#113 This allows to directly parse an Integer from a String without needing to first allocate a sub string. Notes: The implementation is limited by design, it's meant as a first step, only the most straightforward, based 10 integers are supported. ruby/strscan@6a3c74b4c8
Done: #119. |
Previous discussion: https://bugs.ruby-lang.org/issues/20394
Context
When trying to write pure Ruby gems that are competitive in term of performance with C extensions, a very common bottleneck is parsing of text based protocols and formats, such as the Redis RESP protocol, or even the PDF format (FYI @gettalong).
As a result, currently the most efficient way to parse integers in a string in Ruby, is to reimplement
atoi
usingString#getbyte
, which is a bit ridiculous.Otherwise if you create a substring with
String#slice
orStringScanner#scan
and then callto_i
orInteger
, instantiating the sub string and copying the bytes really tank the performance.Proposal
Given that
StringScanner
is a default gem, is often involved in string parsing, and already act as a "pointer into a String", I think it's well positioned to offer an efficient way to parse an Integer without instantiating a useless temporary string.Basically an optimized way to do
scanner.scan(/\d+/).to_i
.The API could be any of:
scanner.scan(/\d+/, :to_i)
scanner.scan(/\d+/, Integer)
scanner.scan_integer(/\d+/)
Logically the two supported types would be
Integer
andFloat
, but perhaps others would be helpful for other protocols?@kou as maintainer of
strscan
, do you have any opinion? I'm happy to put the work on this, but I'd need to know if the feature is desired, and which API would be deemed acceptable.Also cc @tenderlove @mame from previous discussions.
The text was updated successfully, but these errors were encountered: