-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strutils new sets #18193
Strutils new sets #18193
Conversation
Used set unions to define most sets. Added 'Vowels' and 'Consonants', and functions called 'isVowelAscii' and 'isConsonantAscii'. Added set 'Punctuation' for punctuation characters alongside with 'isPunctAscii' and 'removePunctAscii'. Added set 'Special' for all non-alphanumerical characters with 'isSpecialAscii' and 'removeSpecialAscii' functions
lib/pure/strutils.nim
Outdated
## The set of characters a newline terminator can start with (carriage | ||
## return, line feed). | ||
|
||
Punctuation* = {'!', ',', '.', ':', ';', '?'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an official source for that? can you add it as a comment in code?
(and do other languages like python, C++ etc implement it the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a set I made up, without a standard... which is a terrible idea now that you mention it...
Haskell does have an isPunctuation
function, however, that works properly on all unicode characters.
I will change it to have all ASCII characters that are categorized as punctuation in Haskell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python3 -c 'import string; print(string.punctuation)'
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
please also do some research for other popular languages; if there's consensus among those, then adding it is a no-brainer, if not, then we need to think a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Haskell lists characters a little differently...
I changed it to that temporarily, will look up some other languages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like C and python are off by 1 char:
C and python agree (i was missing the \
as
python3 -c 'import string; print(string.punctuation)'` was interpretting \ on the shell so didn't show up in output)
=> so the correct definition should be:
const Punctuation = {'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}
C and python in agreement seems like good enough of a standard
proc ispunct(a: cint): cint {.importc, header: "<ctype.h>".}
proc main()=
var s: set[char]
for c in char.low..char.high:
if c.cint.ispunct > 0:
s.incl c
echo s
var s2: set[char]
for c in """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""":
s2.incl c
echo s2
echo s == s2
main()
{'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}
{'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}
true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment that closed the other PR (#16708 (comment)) would be justified if there was no definition or agreement on punctuation, but the fact that both C and python agree (assuming default c locale) is a strong argument.
There's actually a definition of punctuation:
The standard "C" locale considers punctuation characters all graphic characters (as in isgraph) that are not alphanumeric (as in isalnum).
Applications can always choose to use a different set for their needs, but having an accepted standard makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard "C" locale considers punctuation characters all graphic characters (as in isgraph) that are not alphanumeric (as in isalnum).
Then there is no need to add a Punctuation
set that someone will have to look up.how it is specified because Printable - AlphaNumeric
is much clearer and "self-documenting".
@@ -84,31 +84,42 @@ from std/private/strimpl import cmpIgnoreStyleImpl, cmpIgnoreCaseImpl, startsWit | |||
|
|||
|
|||
const | |||
Whitespace* = {' ', '\t', '\v', '\r', '\l', '\f'} | |||
Whitespace* = {' ', '\t', '\v', '\r', '\n', '\f'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, a separate PR would be welcome to change all instances of \l
to \n
in nim repo (and it'd only do that plus maybe related changes)
Co-authored-by: Timothee Cour <timothee.cour2@gmail.com>
…into strutils-new-sets
Should new additions go into |
for CI failures, try: when nimvm: discard
else:
... # things with importc (but test locally before pushing ;-) ) |
Co-authored-by: Timothee Cour <timothee.cour2@gmail.com>
Oof, most tests fail with JS... If I understand it correctly, they fail on the set checks, where we check if they align with the C stdlib... |
this works: from stdtest/testutils import disableVm
=>
from stdtest/testutils import disableVm, whenVMorJs
and then: whenVMorJs: discard
do:
block: # Whitespace
proc isspace(c: cint): cint {.importc, header: "<ctype.h>".}
... (EDIT: i pushed it in your branch) |
The refactorings are fine and would be accepted but I consider
|
python and C actually agree on all those definitions, as noted in review comments. Vowels and Consonants (in a previous version of this PR) have been removed because of lack of standardized definition. this PR follows a widely accepted standard, the posix character class, and lots of languages follow this standard:
etc
they won't change those definitions, that'd break code; if needed they'll simply add another character class
ditto, we'll just add a new character class if ever needed ascii is stable and has been around for a while, it's not a moving target. Nim should follow widely accepted standards and not make client code re-invent the wheel (inconsistently) |
Ok, so then we "only" need to add more of these classes for consistency with Posix. IMHO these "character classes" exist because the respective environments lack Nim's set construct. The classes feel archaic in the days of Unicode. Who cares if the character is an |
input validation/filtering is a typical use case. D and python3 also have such an ascii modules (https://dlang.org/phobos/std_ascii.html, https://docs.python.org/3/library/curses.ascii.html) |
It should be |
you mean refs: #17722 (comment)
(separate topic but stdx/chains instead of std/chains would be nice too) |
No, I mean |
not sure what we gain by multiplying the number of top-level prefixes: import std/strbasics
import experimental/diff
import packages/docutils/rstgen
import stdx/syntaxes # still being debated, see also `extensions` (https://github.com/nim-lang/Nim/pull/17722#issuecomment-820601135)
import dist/ascii not to mention all the ones where std prefix is optional, eg: A module-level metadata, top-level doc comment, or separate index would all be better suited than categorizing modules based on a subjective notion of fitness. After all we also already have That just invites parallel sub-categories, eg: std/js/foo
dist/js/foo which is a bad thing, for the same reason parallel APIs are bad. Most other language's standard libraries manage to do with a single top-level prefix which avoids scope pollution. |
But ok, we can also develop the standard library with rigor instead. "Here, I found X to be useful so I added it and C++ and Python also have it" is pretty far away from what I consider to be acceptable. Especially since these languages all lack Nim's set construct. And it's even worse, I cannot bring up the argument that "we don't need std/asciitables because Python also lacks std/asciitables" as you simply don't care when it's the other way round. Everything should always be added to "std", immediately, giving us the superset of things that are in C++ and Python plus all the things that happen to be useful inside the Nim compiler. It's terrible. |
For the sake of moving forward with this, I'm ok with creating
right now it's in so, ok to move forward with this PR after modifying it so that new sets go to |
Let's go with |
ok, ping @kintrix007 :) (EDIT: ie, introduce |
First of all, sorry for being away for "a short while"... Do we want to move all string sets to |
good that you raise this point. I think the cleanest solution is as follows:
I've been wanting such a module ( The scope of benefits
more generally, I find the addition of lib/std/private truly liberating, much better than the convoluted ways that preceded it with system / include files @Araq wdyt ? |
To me it would make sense to add all sets there. Since if it does not export all, then there is gonna be that awkwardness of having to also implement I am not misunderstanding something, right? |
there is no double-declaration if it's the same symbol (eg via import + export). # main
import t12425b, t12425c, t12425d
echo A1
# t12425b
const A1* = {1,2,3} ## some comment
# t12425c
import t12425b
export A1
# t12425d
import t12425b
export A1
|
ping @kintrix007 |
I'm sorry, I do not really have time to do this around now, but will get to it when I have time and will. |
I will not come back to this, sorry. If this is still needed, which I kind of doubt, someone else is probably gonna do it. |
Made it more consistent how the sets are defined, also changed them to use set unions wherever possible.
AddedVowels
andConsonants
, and functions calledisVowelAscii
andisConsonantAscii
.Added setPunctuation
for punctuation characters alongside withisPunctAscii
andremovePunctAscii
.Added setSpecial
for all non-alphanumerical characters withisSpecialAscii
andremoveSpecialAscii
functions.Added sets
ControlChars
,GraphicChars
,PrintableChars
, andPunctuation
.Just some simple additions that come in handy when dealing with text.
future work