Better regex library #30

japanoise · 2024-07-18T13:15:39Z

tiny-regex-c just barely does what we want it to, and is not really actively maintained upstream (iirc the maintainer answers issues but doesn't merge PRs or do active development). Basically the regex support here is "well, it's good enough to do very basic tasks", not nearly as good as real Emacs, and not extensible because I cba to hop into a hostile new codebase to maintain a regex library (regexes are useful, but I have no interest in learning how they work).

It'd be better if we could either:

Replace it with our own regex library (doesn't have to be complex, just have some logic in place to play more nicely with the data structures here)
Use some common regex library - PCRE is required basically everywhere, so it wouldn't be a huge ask, but there's something really nice about having everything inside the one repository (actually, that's half the reason emsys exists).

I believe vim and other text editors use forks of a dogeared old regex.h that was going around usenet at the time; we may be able to work based on that, if I can find it again.

nicholascarroll · 2024-07-19T06:51:22Z

Both options are good. Making (or finding one made) that exactly mimics Emacs would be ideal, as long as it truly matches, while PCRE is also a strong choice.

Another way to look at this to match to how grep works. There is POSIX grep, GNU grep and PCRE style grep (using the -P option).

Emacs

Implementing a similar regex engine from scratch, mimicking Emacs' behavior and features is pretty cool. And maybe someone has done it and shared?

POSIX Grep

The older systems use POSIX basic regex. Just like termios.h, regex.h is part of the POSIX.1-2001 standard. That's the modern, Extended POSIX Regex ('ERE'). "Both BREs and EREs are supported by the Regular Expression Matching interface in the System Interfaces volume of IEEE Std 1003.1-2001 " ~ The Open Group Base Specifications. OSX and BSD also use POSIX grep.

PCRE Grep

There is a program called pcregrep. But you can use PCRE in normal grep by using the -P option. So PCRE is not default grep. And it would be great to stick to the idea of no extra libraries and stay compatible with older POSIX systems.

GNU Grep

GNU grep does not use the regex.h provided by the system. Instead, it uses its the GNU C Library (glibc) regex library, which provides additional features and extensions beyond the standard POSIX regular expressions. You can set POSIXLY_CORRECT and gnu grep will use POSIX. But the two are very similary anyway, see this REGEX comparison chart which goes into details about the differences between GNU ERE and POSIX ERE.

My Vote

In the end my vote goes to POSIX Extended Regex (ERE)
I am a fan of POSIX 2001 compliance because I actually intend to use this toy for real work on legacy servers. I have already got the
CFLAGS+=-std=c99 -D_POSIX_C_SOURCE=200112L enabled in my windowScroll branch (only needed to use a custom stringdup instead of strdup).

So then it is simply:

#include <regex.h>

regcomp(&regex, pattern, REG_EXTENDED)

nicholascarroll · 2024-07-19T20:26:13Z

I just tried it out on MSYS2 / mingw64. Seems to work fine. I got regex.h from
$ pacman -S mingw-w64-x86_64-libsystre
regcomp(&regex, pattern, REG_EXTENDED)

nicholascarroll · 2024-07-20T15:07:20Z

Flaw in my thinking here: to think that a POSIX 2001 system's regex.h would work with your (very well implemented) bundled UTF-8 without a crazy amount of frigging around. Would need to bundle it.

Thinking more about it, GNU regex would theoretically be the best choice cos it supports most of the same constructs as Emacs regex. And you could have a config.def.h options to enable POSIXLY_CORRECT. Only problem is your project would become GPL.

GNU grep version 2.6 introduced UTF-8 support in 2010 (took their time :-o). It's in file regex_internal.h|c. It conditionally includes localcharset.h which is part of the GNU charset library.

Seems like a lot of work.

I would be inclined to go the other way and plan to some time in the future forsake POSIX 2001 for POSIX 2008, replace the bundled/custom UTF-8 management with system UTF-8 (locale, wchar, wctype) and then be able to use the system regex.h. Then emsys would have the advantage of minimizing its source code size, less dependencies on projects that might not get maintained, significantly reduce its custom code footprint - making it more familiar and simpler for people wanting to hack on it.

nicholascarroll · 2024-07-21T19:58:45Z

Not only does re.h have very few commands but it is also drastically slower than regex. I have done some wall clock time benchmarking of re.h versus the regex.h (ldd (Ubuntu GLIBC 2.35-0ubuntu3.8) 2.35).

Pattern: \bword\b
re.h time: 7.587977 seconds
regex.h time: 1.227302 seconds
re.h is 518.26% slower than regex.h

Pattern: ^line
re.h time: 0.000871 seconds
regex.h time: 0.038799 seconds
re.h is 97.76% faster than regex.h

Pattern: [A-Z][a-z]+
re.h time: 16.740947 seconds
regex.h time: 0.720991 seconds
re.h is 2221.94% slower than regex.h

Pattern: \b\w{6,}\b
re.h time: 7.742236 seconds
regex.h time: 0.082319 seconds
re.h is 9305.16% slower than regex.h

Pattern: func\w+(
re.h time: 8.228889 seconds
regex.h time: 1.619219 seconds
re.h is 408.20% slower than regex.h

Pattern: \b[A-Z][A-Z0-9_]*\b
re.h time: 7.806865 seconds
regex.h time: 0.719874 seconds
re.h is 984.48% slower than regex.h

Pattern: \b(int|float|char)\b
re.h time: 7.588203 seconds
regex.h time: 0.045561 seconds
re.h is 16555.04% slower than regex.h

Pattern: //.*$
re.h time: 7.351127 seconds
regex.h time: 0.719812 seconds
re.h is 921.26% slower than regex.h

Pattern: /*[\s\S]*?*/
re.h time: 7.353695 seconds
regex.h time: 0.723316 seconds
re.h is 916.66% slower than regex.h

On the other hand, I never really expected much from a lightweight little editor. I mean, I would just use grep/sed/awk.

So this issue is just for incremental regex search and replace right?

japanoise · 2024-07-22T00:00:33Z

So this issue is just for incremental regex search and replace right?

For now, yeah. I'm not really planning on using it for much else than basic interactive usage.

japanoise added the enhancement New feature or request label Jul 18, 2024

japanoise added the code quality Non-user-facing refactors, code QoL, etc. label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better regex library #30

Better regex library #30

japanoise commented Jul 18, 2024

nicholascarroll commented Jul 19, 2024 •

edited

Loading

nicholascarroll commented Jul 19, 2024

nicholascarroll commented Jul 20, 2024 •

edited

Loading

nicholascarroll commented Jul 21, 2024 •

edited

Loading

japanoise commented Jul 22, 2024

Better regex library #30

Better regex library #30

Comments

japanoise commented Jul 18, 2024

nicholascarroll commented Jul 19, 2024 • edited Loading

Emacs

POSIX Grep

PCRE Grep

GNU Grep

My Vote

nicholascarroll commented Jul 19, 2024

nicholascarroll commented Jul 20, 2024 • edited Loading

nicholascarroll commented Jul 21, 2024 • edited Loading

japanoise commented Jul 22, 2024

nicholascarroll commented Jul 19, 2024 •

edited

Loading

nicholascarroll commented Jul 20, 2024 •

edited

Loading

nicholascarroll commented Jul 21, 2024 •

edited

Loading