Strutils new sets #18193

kintrix007 · 2021-06-05T21:22:12Z

Made it more consistent how the sets are defined, also changed them to use set unions wherever possible.

~~Added Vowels and Consonants, and functions called isVowelAscii and isConsonantAscii.~~
~~Added set Punctuation for punctuation characters alongside with isPunctAscii and removePunctAscii.~~
~~Added set Special for all non-alphanumerical characters with isSpecialAscii and removeSpecialAscii functions.~~

Added sets ControlChars, GraphicChars, PrintableChars, and Punctuation.

Just some simple additions that come in handy when dealing with text.

future work

ControlChars, PrintableChars, GraphicChars (see comments)

Used set unions to define most sets. Added 'Vowels' and 'Consonants', and functions called 'isVowelAscii' and 'isConsonantAscii'. Added set 'Punctuation' for punctuation characters alongside with 'isPunctAscii' and 'removePunctAscii'. Added set 'Special' for all non-alphanumerical characters with 'isSpecialAscii' and 'removeSpecialAscii' functions

lib/pure/strutils.nim

timotheecour · 2021-06-05T21:41:27Z

lib/pure/strutils.nim

    ## The set of characters a newline terminator can start with (carriage
    ## return, line feed).

+  Punctuation* = {'!', ',', '.', ':', ';', '?'}


is there an official source for that? can you add it as a comment in code?
(and do other languages like python, C++ etc implement it the same)

This is just a set I made up, without a standard... which is a terrible idea now that you mention it...
Haskell does have an isPunctuation function, however, that works properly on all unicode characters.
I will change it to have all ASCII characters that are categorized as punctuation in Haskell.

python3 -c 'import string; print(string.punctuation)'
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

please also do some research for other popular languages; if there's consensus among those, then adding it is a no-brainer, if not, then we need to think a bit

Yeah, Haskell lists characters a little differently...
I changed it to that temporarily, will look up some other languages

~~looks like C and python are off by 1 char:~~

C and python agree (i was missing the \ as
python3 -c 'import string; print(string.punctuation)'` was interpretting \ on the shell so didn't show up in output)

=> so the correct definition should be:

const Punctuation = {'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}

C and python in agreement seems like good enough of a standard

proc ispunct(a: cint): cint {.importc, header: "<ctype.h>".} proc main()= var s: set[char] for c in char.low..char.high: if c.cint.ispunct > 0: s.incl c echo s var s2: set[char] for c in """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""": s2.incl c echo s2 echo s == s2 main()

{'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'} {'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'} true

The comment that closed the other PR (#16708 (comment)) would be justified if there was no definition or agreement on punctuation, but the fact that both C and python agree (assuming default c locale) is a strong argument.

There's actually a definition of punctuation:

The standard "C" locale considers punctuation characters all graphic characters (as in isgraph) that are not alphanumeric (as in isalnum).

Applications can always choose to use a different set for their needs, but having an accepted standard makes sense.

The standard "C" locale considers punctuation characters all graphic characters (as in isgraph) that are not alphanumeric (as in isalnum).

Then there is no need to add a Punctuation set that someone will have to look up.how it is specified because Printable - AlphaNumeric is much clearer and "self-documenting".

lib/pure/strutils.nim

timotheecour · 2021-06-05T21:53:54Z

lib/pure/strutils.nim

@@ -84,31 +84,42 @@ from std/private/strimpl import cmpIgnoreStyleImpl, cmpIgnoreCaseImpl, startsWit


 const
-  Whitespace* = {' ', '\t', '\v', '\r', '\l', '\f'}
+  Whitespace* = {' ', '\t', '\v', '\r', '\n', '\f'}


also, a separate PR would be welcome to change all instances of \l to \n in nim repo (and it'd only do that plus maybe related changes)

lib/pure/strutils.nim

Co-authored-by: Timothee Cour <timothee.cour2@gmail.com>

…into strutils-new-sets

lib/pure/strutils.nim

juancarlospaco · 2021-06-06T12:18:15Z

Should new additions go into strbasics ?.

lib/pure/strutils.nim

timotheecour · 2021-06-06T19:21:48Z

for CI failures, try:

when nimvm: discard
else:
  ... # things with importc

(but test locally before pushing ;-) )

Co-authored-by: Timothee Cour <timothee.cour2@gmail.com>

kintrix007 · 2021-06-06T21:19:29Z

Oof, most tests fail with JS... If I understand it correctly, they fail on the set checks, where we check if they align with the C stdlib...

timotheecour · 2021-06-07T01:13:11Z

Oof, most tests fail with JS... If I understand it correctly, they fail on the set checks, where we check if they align with the C stdlib...

this works:

from stdtest/testutils import disableVm
=>
from stdtest/testutils import disableVm, whenVMorJs

and then:

  whenVMorJs: discard
  do:
    block: # Whitespace
      proc isspace(c: cint): cint {.importc, header: "<ctype.h>".}
    ...

(EDIT: i pushed it in your branch)

Araq · 2021-06-07T09:25:31Z

The refactorings are fine and would be accepted but I consider GraphicChars and others to be a misdesign:

It's clear that these are not universal definitions so people coming from C disagree with people coming from Python. Or maybe an upcoming C23 standard changes these definitions because of locals or unicode, it's hitting a moving target.
So then we have our own defintion what GraphicChars means and we cannot change it ever without breaking code. Suppose I am to implement spec X. Spec X also uses a notion of GraphicChars and defines what it means. Is it the same definition as Nim's? If so, that would be accidentical so I better write my own set definition anyway. If it differs slightly I have no chance anyway but to write my own definition. Either way we're better off not offering this at all, as the good code doesn't use it anyway. We lure people into the wrong solution by making strutils.GraphicGhars available. Much like we encourge bugs by strutils.escape's very existence.

timotheecour · 2021-06-07T19:53:43Z

It's clear that these are not universal definitions so people coming from C disagree with people coming from Python

python and C actually agree on all those definitions, as noted in review comments. Vowels and Consonants (in a previous version of this PR) have been removed because of lack of standardized definition.

this PR follows a widely accepted standard, the posix character class, and lots of languages follow this standard:

D: https://dlang.org/phobos/std_ascii.html (see implementation, i checked it gives same results)
python3: see https://docs.python.org/3/library/string.html (printable, punctuation etc) and https://docs.python.org/3/library/curses.ascii.html (isgraph, ispunct etc)
C: ctype.h: http://www.cplusplus.com/reference/cctype/
C++: <cctype>: std::ispunct etc https://en.cppreference.com/w/cpp/string/byte/ispunct
regular expression: see [:punct:] from https://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes, and also, [:graph:] , [:print:], [:cntrl:] and the rest

etc

Or maybe an upcoming C23 standard changes these definitions because of locals or unicode, it's hitting a moving target.

they won't change those definitions, that'd break code; if needed they'll simply add another character class

So then we have our own defintion what GraphicChars means and we cannot change it ever without breaking code.

ditto, we'll just add a new character class if ever needed

ascii is stable and has been around for a while, it's not a moving target. Nim should follow widely accepted standards and not make client code re-invent the wheel (inconsistently)

Araq · 2021-06-08T15:40:59Z

they won't change those definitions, that'd break code; if needed they'll simply add another character class

Ok, so then we "only" need to add more of these classes for consistency with Posix. IMHO these "character classes" exist because the respective environments lack Nim's set construct. The classes feel archaic in the days of Unicode. Who cares if the character is an
ASCII control character -- what are the use cases? "Other languages copied anachronisms from each other" is not a use case.

timotheecour · 2021-06-08T17:08:46Z

input validation/filtering is a typical use case.
I agree that std/strutils is bloated though so I'd be ok with either moving those new sets to std/strmisc (2nd choice) or adding std/ascii (1st choice, can grow in future if needed yet avoids scope creep as it's right there in the title)

D and python3 also have such an ascii modules (https://dlang.org/phobos/std_ascii.html, https://docs.python.org/3/library/curses.ascii.html)

Araq · 2021-06-08T17:11:54Z

I agree that std/strutils is bloated though so I'd be ok with either moving those new sets to std/strmisc (2nd choice) or adding std/ascii (1st choice, can grow in future if needed yet avoids scope creep as it's right there in the title)

It should be dist/ascii. We should use std for std-worthy stuff. Yes, I know this is a change in policy.

timotheecour · 2021-06-08T17:15:10Z

It should be dist/ascii.

you mean stdx/ascii (in $nim/lib/stdx/ascii.nim) ? if so I'm all for it

refs: #17722 (comment)

Ok, let's use "stdx" then.

(separate topic but stdx/chains instead of std/chains would be nice too)

Araq · 2021-06-09T06:02:53Z

you mean stdx/ascii (in $nim/lib/stdx/ascii.nim) ? if so I'm all for it

No, I mean dist/ascii. Or some similar name that implies "it's simply a module we ship with Nim". stdx still has "standard" in its name implying a level of rigor that we lack.

timotheecour · 2021-06-09T07:09:08Z

not sure what we gain by multiplying the number of top-level prefixes:

import std/strbasics
import experimental/diff
import packages/docutils/rstgen
import stdx/syntaxes # still being debated, see also `extensions` (https://github.com/nim-lang/Nim/pull/17722#issuecomment-820601135)
import dist/ascii

not to mention all the ones where std prefix is optional, eg:
import strutils

A module-level metadata, top-level doc comment, or separate index would all be better suited than categorizing modules based on a subjective notion of fitness. After all we also already have std/chains and other modules which don't quite fit the bill.

That just invites parallel sub-categories, eg:

std/js/foo
dist/js/foo

which is a bad thing, for the same reason parallel APIs are bad.

Most other language's standard libraries manage to do with a single top-level prefix which avoids scope pollution.

Araq · 2021-06-09T09:23:23Z

Most other language's standard libraries manage to do with a single top-level prefix which avoids scope pollution.

https://docs.python.org/3/library/curses.ascii.html

But ok, we can also develop the standard library with rigor instead. "Here, I found X to be useful so I added it and C++ and Python also have it" is pretty far away from what I consider to be acceptable. Especially since these languages all lack Nim's set construct.

And it's even worse, I cannot bring up the argument that "we don't need std/asciitables because Python also lacks std/asciitables" as you simply don't care when it's the other way round. Everything should always be added to "std", immediately, giving us the superset of things that are in C++ and Python plus all the things that happen to be useful inside the Nim compiler. It's terrible.

timotheecour · 2021-06-09T18:42:45Z

For the sake of moving forward with this, I'm ok with creating lib/dist/ and adding lib/dist/ascii.nim so that modules needing it can import it as import dist/ascii (although I would've preferred import std/ascii or import stdx/ascii)

std/asciitables

right now it's in lib/std/private/asciitables.nim, which means it's internal and client code isn't supposed to use it unless they know what they're doing and are ok with breaking changes, but I happen to find asciitables actually quite useful and std-worthy for future work, I often use it in my own code. Whether it shall be moved to std/asciitables, stdx/asciitables or dist/asciitables is rather secondary so long it's somewhere accessible under lib (honored by --lib flag)

so, ok to move forward with this PR after modifying it so that new sets go to lib/dist/ascii.nim and also keeping the other improvements from this PR? (the new tests should be moved, maybe under tests/stdlib/tascii for simplicity, instead of tests/dist/tascii, but I don't care too much which of those 2 files is used)

Araq · 2021-06-10T10:08:08Z

Let's go with import dist/asciisets

timotheecour · 2021-06-10T10:18:06Z

ok, ping @kintrix007 :)

(EDIT: ie, introduce lib/dist and lib/dist/asciisets.nim)

kintrix007 · 2021-06-16T18:21:27Z

ok, ping @kintrix007 :)

(EDIT: ie, introduce lib/dist and lib/dist/asciisets.nim)

First of all, sorry for being away for "a short while"...
But I'm kind of confused what we want to do now.

Do we want to move all string sets to lib/dist/asciisets.nim? If so, that would cause some back-wards incompatibility.
However, if we want to only move the newly created sets, then wouldn't it be awkward to have to import both dist/asciisets and std/strutils to have all sets to work with? As dist/asciisets would only have the sets like PrintableChars and so

timotheecour · 2021-06-16T18:59:43Z

Do we want to move all string sets to lib/dist/asciisets.nim? If so, that would cause some back-wards incompatibility. However, if we want to only move the newly created sets, then wouldn't it be awkward to have to import both dist/asciisets and std/strutils to have all sets to work with? As dist/asciisets would only have the sets like PrintableChars and so

good that you raise this point.

I think the cleanest solution is as follows:

create lib/std/private/constants.nim and add all the sets there
make std/strutils import it and re-export the ones it already defined
make dist/asciisets import it and re-export the ones that make sense there
other modules that also copy-paste the definition of a few of those sets can now import/re-export those instead

I've been wanting such a module (lib/std/private/constants.nim) for other reasons (including some things shared between compiler and stdlib, refs timotheecour#616), to refactor some duplicate definitions, so it could be reused for that purpose too. Note that lib/std/private/constants.nim is an implementation detail (it's right there in the name private) so it's fine to do so.

The scope of lib/std/private/constants.nim could be restricted further if needed, but "useful constants" is good enough for now; again it's an implementation detail

benefits

single source of truth / documentation
lib/std/private/constants.nim is cheap to semcheck, it's only constants and these are not code-gen'd
no duplicate identifier error (it's the same symbol), no backward compatibility issue, no need to rename the sets to avoid clashes (renaming is bad IMO)
only question is what docs would look like, but I think it's either fine, or docgen can be taught to handle re-exports in whatever way we want

more generally, I find the addition of lib/std/private truly liberating, much better than the convoluted ways that preceded it with system / include files

@Araq wdyt ?

kintrix007 · 2021-06-18T09:03:08Z

* make dist/asciisets import it and re-export the ones that make sense there

To me it would make sense to add all sets there. Since if it does not export all, then there is gonna be that awkwardness of having to also implement strutils just for a set it's missing. However, If we define all of the sets there as well, then if we import both strutils and asciisets then some of it would be double-declared... And so would force you to write strutils.Letters or asciisets.Letters while really they are the same.

I am not misunderstanding something, right?

timotheecour · 2021-06-18T16:44:45Z

if we import both strutils and asciisets then some of it would be double-declared... And so would force you to write strutils.Letters or asciisets.Letters while really they are the same.

there is no double-declaration if it's the same symbol (eg via import + export).

# main
import t12425b, t12425c, t12425d
echo A1

# t12425b
const A1* = {1,2,3} ## some comment

# t12425c
import t12425b
export A1

# t12425d
import t12425b
export A1

nim r main works

timotheecour · 2021-07-06T19:24:50Z

ping @kintrix007

see recommended changes

kintrix007 · 2021-07-08T23:54:17Z

I'm sorry, I do not really have time to do this around now, but will get to it when I have time and will.

kintrix007 · 2021-09-16T19:48:24Z

I will not come back to this, sorry. If this is still needed, which I kind of doubt, someone else is probably gonna do it.

kintrix007 added 3 commits June 5, 2021 22:05

fixed incorrect runnableExample

2c33c6a

quickfix

815abe5

timotheecour reviewed Jun 5, 2021

View reviewed changes