Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better regex library #30

Open
japanoise opened this issue Jul 18, 2024 · 5 comments
Open

Better regex library #30

japanoise opened this issue Jul 18, 2024 · 5 comments
Labels
code quality Non-user-facing refactors, code QoL, etc. enhancement New feature or request

Comments

@japanoise
Copy link
Owner

tiny-regex-c just barely does what we want it to, and is not really actively maintained upstream (iirc the maintainer answers issues but doesn't merge PRs or do active development). Basically the regex support here is "well, it's good enough to do very basic tasks", not nearly as good as real Emacs, and not extensible because I cba to hop into a hostile new codebase to maintain a regex library (regexes are useful, but I have no interest in learning how they work).

It'd be better if we could either:

  • Replace it with our own regex library (doesn't have to be complex, just have some logic in place to play more nicely with the data structures here)
  • Use some common regex library - PCRE is required basically everywhere, so it wouldn't be a huge ask, but there's something really nice about having everything inside the one repository (actually, that's half the reason emsys exists).

I believe vim and other text editors use forks of a dogeared old regex.h that was going around usenet at the time; we may be able to work based on that, if I can find it again.

@japanoise japanoise added the enhancement New feature or request label Jul 18, 2024
@nicholascarroll
Copy link
Contributor

nicholascarroll commented Jul 19, 2024

Both options are good. Making (or finding one made) that exactly mimics Emacs would be ideal, as long as it truly matches, while PCRE is also a strong choice.

Another way to look at this to match to how grep works. There is POSIX grep, GNU grep and PCRE style grep (using the -P option).

Emacs

Implementing a similar regex engine from scratch, mimicking Emacs' behavior and features is pretty cool. And maybe someone has done it and shared?

POSIX Grep

The older systems use POSIX basic regex. Just like termios.h, regex.h is part of the POSIX.1-2001 standard. That's the modern, Extended POSIX Regex ('ERE'). "Both BREs and EREs are supported by the Regular Expression Matching interface in the System Interfaces volume of IEEE Std 1003.1-2001 " ~ The Open Group Base Specifications. OSX and BSD also use POSIX grep.

PCRE Grep

There is a program called pcregrep. But you can use PCRE in normal grep by using the -P option. So PCRE is not default grep. And it would be great to stick to the idea of no extra libraries and stay compatible with older POSIX systems.

GNU Grep

GNU grep does not use the regex.h provided by the system. Instead, it uses its the GNU C Library (glibc) regex library, which provides additional features and extensions beyond the standard POSIX regular expressions. You can set POSIXLY_CORRECT and gnu grep will use POSIX. But the two are very similary anyway, see this REGEX comparison chart which goes into details about the differences between GNU ERE and POSIX ERE.

My Vote

In the end my vote goes to POSIX Extended Regex (ERE)
I am a fan of POSIX 2001 compliance because I actually intend to use this toy for real work on legacy servers. I have already got the
CFLAGS+=-std=c99 -D_POSIX_C_SOURCE=200112L enabled in my windowScroll branch (only needed to use a custom stringdup instead of strdup).

So then it is simply:

#include <regex.h>

regcomp(&regex, pattern, REG_EXTENDED)

@nicholascarroll
Copy link
Contributor

I just tried it out on MSYS2 / mingw64. Seems to work fine. I got regex.h from
$ pacman -S mingw-w64-x86_64-libsystre
regcomp(&regex, pattern, REG_EXTENDED)

@nicholascarroll
Copy link
Contributor

nicholascarroll commented Jul 20, 2024

Flaw in my thinking here: to think that a POSIX 2001 system's regex.h would work with your (very well implemented) bundled UTF-8 without a crazy amount of frigging around. Would need to bundle it.

Thinking more about it, GNU regex would theoretically be the best choice cos it supports most of the same constructs as Emacs regex. And you could have a config.def.h options to enable POSIXLY_CORRECT. Only problem is your project would become GPL.

GNU grep version 2.6 introduced UTF-8 support in 2010 (took their time :-o). It's in file regex_internal.h|c. It conditionally includes localcharset.h which is part of the GNU charset library.

Seems like a lot of work.

I would be inclined to go the other way and plan to some time in the future forsake POSIX 2001 for POSIX 2008, replace the bundled/custom UTF-8 management with system UTF-8 (locale, wchar, wctype) and then be able to use the system regex.h. Then emsys would have the advantage of minimizing its source code size, less dependencies on projects that might not get maintained, significantly reduce its custom code footprint - making it more familiar and simpler for people wanting to hack on it.

@nicholascarroll
Copy link
Contributor

nicholascarroll commented Jul 21, 2024

Not only does re.h have very few commands but it is also drastically slower than regex. I have done some wall clock time benchmarking of re.h versus the regex.h (ldd (Ubuntu GLIBC 2.35-0ubuntu3.8) 2.35).

Pattern: \bword\b
re.h time: 7.587977 seconds
regex.h time: 1.227302 seconds
re.h is 518.26% slower than regex.h

Pattern: ^line
re.h time: 0.000871 seconds
regex.h time: 0.038799 seconds
re.h is 97.76% faster than regex.h

Pattern: [A-Z][a-z]+
re.h time: 16.740947 seconds
regex.h time: 0.720991 seconds
re.h is 2221.94% slower than regex.h

Pattern: \b\w{6,}\b
re.h time: 7.742236 seconds
regex.h time: 0.082319 seconds
re.h is 9305.16% slower than regex.h

Pattern: func\w+(
re.h time: 8.228889 seconds
regex.h time: 1.619219 seconds
re.h is 408.20% slower than regex.h

Pattern: \b[A-Z][A-Z0-9_]*\b
re.h time: 7.806865 seconds
regex.h time: 0.719874 seconds
re.h is 984.48% slower than regex.h

Pattern: \b(int|float|char)\b
re.h time: 7.588203 seconds
regex.h time: 0.045561 seconds
re.h is 16555.04% slower than regex.h

Pattern: //.*$
re.h time: 7.351127 seconds
regex.h time: 0.719812 seconds
re.h is 921.26% slower than regex.h

Pattern: /*[\s\S]*?*/
re.h time: 7.353695 seconds
regex.h time: 0.723316 seconds
re.h is 916.66% slower than regex.h

On the other hand, I never really expected much from a lightweight little editor. I mean, I would just use grep/sed/awk.

So this issue is just for incremental regex search and replace right?

@japanoise
Copy link
Owner Author

So this issue is just for incremental regex search and replace right?

For now, yeah. I'm not really planning on using it for much else than basic interactive usage.

@japanoise japanoise added the code quality Non-user-facing refactors, code QoL, etc. label Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code quality Non-user-facing refactors, code QoL, etc. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants