-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regexp/syntax: document limit of 1000 in {n,m} quantifier forms #7252
Labels
Milestone
Comments
For reference, other popular regular expression engines have significantly higher limits, all close to maxima of the integer types used internally: Perl: "n and m are limited to non-negative integral values less than a preset limit defined when perl is built. This is usually 32766 on the most common platforms." (http://search.cpan.org/~rjbs/perl-5.18.2/pod/perlre.pod) PCRE: "All values in repeating quantifiers must be less than 65536." (http://man7.org/linux/man-pages/man3/pcrelimits.3.html) Python: no documented limit, architecture dependent. Observed limits include 2^16-1 (https://gist.github.com/athomason/7432497b3f6f8ba73c26) and 2^32-1 (https://5621528620171264-dot-shared-playground.appspot.com/?set_access_key_cookie=z1iuou3sszu4y3q9ntq03n81d). Ruby: undocumented limit of 100,000 (https://github.com/ruby/ruby/blob/trunk/include/ruby/oniguruma.h#L341) Java: undocumented effective limit of 2^31-1, the maximum value of an int. (http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/regex/Pattern.java?av=f#3067) Javascript: no documented limit; Chrome 33 and Firefox 27 accept m >= 2^31 (in dev console/firebug, `/x{1,2147483646}/.test("x")` ==> true), IE 11 accepts m <= 2^31. |
I have a program that accepts a regexp as a way to filter its line-based data stream. I had a need to filter for lines with a minimum length of 1800 bytes, but /.{1800,}/ was rejected as a pattern. Obviously there are much better ways to do this, but this would have been one that didn't require a code change, rebuild, and deployment. I would argue that this is a "[reasonable]" desire as long as there's no downside to supporting it. Note that strings.Repeat(".", 1800) is accepted as a pattern, so there seems little sense in prohibiting the equivalent quantified version. Since the current value appears entirely arbitrary, it might as well be an equally arbitrary value that is more permissive. I would guess that the check was originally added to prevent runaway memory use and compilation time, which is fair to a point. But with the check removed from tip, I can compile /.{100000,}/ in 22MB in 23ms, which isn't at all crazy. Both values appear to increase about linearly; compiling /.{10000000,}/ takes a couple seconds and consumes ~2.5GB. Regardless of what magic value strikes the right balance, it ought to be documented. I'm happy to submit patch(es) to change the value and/or add docs. |
I don't think a regular expression is the correct tool for checking line length when there's a much better performing* built-in len(). IOW, this isn't IMO a compelling reason to extend the limit. Also, I suggest to intentionally leave the limit undocumented. Otherwise some code can start to rely on this limit, which it should not. (*): In your case in several orders of magnitude, both in space and time. |
> I don't think a regular expression is the correct tool for checking line length when there's a much better performing* built-in len(). That's obvious, and ignores that the ability still would have been useful in at least one real case. It's already allowed with worse syntax and identical performance. Clearly there's not a policy prohibiting sub-optimal implementations. > IOW, this isn't IMO a compelling reason to extend the limit. That's not a reason to not extend it. No thoughtful basis for the current value has been offered. Is your position that inertia alone is a reason to reject a change that would be helpful for some people? > Also, I suggest to intentionally leave the limit undocumented. Otherwise some code can start to rely on this limit, which it should not. Do you really prefer the current situation where the only explanation available is in uncommented source code? |
> That's obvious, and ignores that the ability still would have been > useful in at least one real case. It's already allowed with worse > syntax and identical performance. Clearly there's not a policy > prohibiting sub-optimal implementations. Multiplication is useful. Even when someone does it by repetitive addition. But there's no reason to encourage the incorrect approach by providing support of it. > That's not a reason to not extend it. No thoughtful basis for the > current value has been offered. Is your position that inertia alone > is a reason to reject a change that would be helpful for some people? That's backward. There should be a reason to extend it. > Do you really prefer the current situation where the only explanation > available is in uncommented source code? Yes, I really do and I even wrote why. |
Disagreement is not a problem, is it? ;-) Correct/proper use of the regexp package does/should not run into the limitation. If it would, then there would be a reason to raise the limit. That's IMO why the limit is already so impractically high. I'd guess ~100 would suffice 99% of legitimate cases. |
The limit is there because - unlike all the other implementations you mentioned - package regexp implements x{1000} by making 1000 copies of x. There has to be a limit. The other implementations use a counter, but using a counter precludes them from providing a useful runtime guarantee. We chose predictable performance; they chose an algorithm that allows input+regexp combinations that will run until the sun burns out. But having paide that cost, they can support larger counts quite "cheaply". See swtch.com/~rsc/regexp/regexp1.html for more details. Will fix docs for Go 1.3. Labels changed: added release-go1.3, documentation. Status changed to Accepted. |
Comment 11 by adam@fastly.com: Thanks for the note and link. A doc fix seems appropriate given the implementation. |
For the benefit of anyone coming across this issue in the future, the trivial workaround is to split the quantified expression into repeated copies of the desired match. See http://play.golang.org/p/XjdKuTKx3t for an example for /x{n}/. /x{n,}/ as originally encountered is not much more complicated. |
This issue was closed by revision 929ee59. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: