-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better regex library #30
Comments
Both options are good. Making (or finding one made) that exactly mimics Emacs would be ideal, as long as it truly matches, while PCRE is also a strong choice. Another way to look at this to match to how grep works. There is POSIX grep, GNU grep and PCRE style grep (using the -P option). EmacsImplementing a similar regex engine from scratch, mimicking Emacs' behavior and features is pretty cool. And maybe someone has done it and shared? POSIX GrepThe older systems use POSIX basic regex. Just like termios.h, regex.h is part of the POSIX.1-2001 standard. That's the modern, Extended POSIX Regex ('ERE'). "Both BREs and EREs are supported by the Regular Expression Matching interface in the System Interfaces volume of IEEE Std 1003.1-2001 " ~ The Open Group Base Specifications. OSX and BSD also use POSIX grep. PCRE GrepThere is a program called pcregrep. But you can use PCRE in normal grep by using the -P option. So PCRE is not default grep. And it would be great to stick to the idea of no extra libraries and stay compatible with older POSIX systems. GNU GrepGNU grep does not use the regex.h provided by the system. Instead, it uses its the GNU C Library (glibc) regex library, which provides additional features and extensions beyond the standard POSIX regular expressions. You can set POSIXLY_CORRECT and gnu grep will use POSIX. But the two are very similary anyway, see this REGEX comparison chart which goes into details about the differences between GNU ERE and POSIX ERE. My VoteIn the end my vote goes to POSIX Extended Regex (ERE) So then it is simply:
|
I just tried it out on MSYS2 / mingw64. Seems to work fine. I got regex.h from |
Flaw in my thinking here: to think that a POSIX 2001 system's regex.h would work with your (very well implemented) bundled UTF-8 without a crazy amount of frigging around. Would need to bundle it. Thinking more about it, GNU regex would theoretically be the best choice cos it supports most of the same constructs as Emacs regex. And you could have a config.def.h options to enable POSIXLY_CORRECT. Only problem is your project would become GPL. GNU grep version 2.6 introduced UTF-8 support in 2010 (took their time :-o). It's in file regex_internal.h|c. It conditionally includes localcharset.h which is part of the GNU charset library. Seems like a lot of work. I would be inclined to go the other way and plan to some time in the future forsake POSIX 2001 for POSIX 2008, replace the bundled/custom UTF-8 management with system UTF-8 (locale, wchar, wctype) and then be able to use the system regex.h. Then emsys would have the advantage of minimizing its source code size, less dependencies on projects that might not get maintained, significantly reduce its custom code footprint - making it more familiar and simpler for people wanting to hack on it. |
Not only does re.h have very few commands but it is also drastically slower than regex. I have done some wall clock time benchmarking of re.h versus the regex.h (ldd (Ubuntu GLIBC 2.35-0ubuntu3.8) 2.35). Pattern: \bword\b Pattern: ^line Pattern: [A-Z][a-z]+ Pattern: \b\w{6,}\b Pattern: func\w+( Pattern: \b[A-Z][A-Z0-9_]*\b Pattern: \b(int|float|char)\b Pattern: //.*$ Pattern: /*[\s\S]*?*/ On the other hand, I never really expected much from a lightweight little editor. I mean, I would just use grep/sed/awk. So this issue is just for incremental regex search and replace right? |
For now, yeah. I'm not really planning on using it for much else than basic interactive usage. |
tiny-regex-c just barely does what we want it to, and is not really actively maintained upstream (iirc the maintainer answers issues but doesn't merge PRs or do active development). Basically the regex support here is "well, it's good enough to do very basic tasks", not nearly as good as real Emacs, and not extensible because I cba to hop into a hostile new codebase to maintain a regex library (regexes are useful, but I have no interest in learning how they work).
It'd be better if we could either:
I believe vim and other text editors use forks of a dogeared old regex.h that was going around usenet at the time; we may be able to work based on that, if I can find it again.
The text was updated successfully, but these errors were encountered: