Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strutils new sets #18193

Closed
wants to merge 18 commits into from
Closed

Conversation

kintrix007
Copy link

@kintrix007 kintrix007 commented Jun 5, 2021

Made it more consistent how the sets are defined, also changed them to use set unions wherever possible.

Added Vowels and Consonants, and functions called isVowelAscii and isConsonantAscii.
Added set Punctuation for punctuation characters alongside with isPunctAscii and removePunctAscii.
Added set Special for all non-alphanumerical characters with isSpecialAscii and removeSpecialAscii functions.

Added sets ControlChars, GraphicChars, PrintableChars, and Punctuation.

Just some simple additions that come in handy when dealing with text.

future work

  • ControlChars, PrintableChars, GraphicChars (see comments)

Used set unions to define most sets. Added 'Vowels' and 'Consonants', and functions called 'isVowelAscii' and 'isConsonantAscii'. Added set 'Punctuation' for punctuation characters alongside with 'isPunctAscii' and 'removePunctAscii'. Added set 'Special' for all non-alphanumerical characters with 'isSpecialAscii' and 'removeSpecialAscii' functions
lib/pure/strutils.nim Outdated Show resolved Hide resolved
## The set of characters a newline terminator can start with (carriage
## return, line feed).

Punctuation* = {'!', ',', '.', ':', ';', '?'}
Copy link
Member

@timotheecour timotheecour Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an official source for that? can you add it as a comment in code?
(and do other languages like python, C++ etc implement it the same)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a set I made up, without a standard... which is a terrible idea now that you mention it...
Haskell does have an isPunctuation function, however, that works properly on all unicode characters.
I will change it to have all ASCII characters that are categorized as punctuation in Haskell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3 -c 'import string; print(string.punctuation)'
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~

please also do some research for other popular languages; if there's consensus among those, then adding it is a no-brainer, if not, then we need to think a bit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Haskell lists characters a little differently...
I changed it to that temporarily, will look up some other languages

Copy link
Member

@timotheecour timotheecour Jun 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like C and python are off by 1 char:

C and python agree (i was missing the \ as
python3 -c 'import string; print(string.punctuation)'` was interpretting \ on the shell so didn't show up in output)

=> so the correct definition should be:

const Punctuation = {'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}

C and python in agreement seems like good enough of a standard

proc ispunct(a: cint): cint {.importc, header: "<ctype.h>".}
proc main()=
  var s: set[char]
  for c in char.low..char.high:
    if c.cint.ispunct > 0:
      s.incl c
  echo s
  var s2: set[char]
  for c in """!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""":
    s2.incl c
  echo s2
  echo s == s2
main()
{'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}
{'!', '\"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'}
true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment that closed the other PR (#16708 (comment)) would be justified if there was no definition or agreement on punctuation, but the fact that both C and python agree (assuming default c locale) is a strong argument.

There's actually a definition of punctuation:

The standard "C" locale considers punctuation characters all graphic characters (as in isgraph) that are not alphanumeric (as in isalnum).

Applications can always choose to use a different set for their needs, but having an accepted standard makes sense.

Copy link
Contributor

@Clyybber Clyybber Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard "C" locale considers punctuation characters all graphic characters (as in isgraph) that are not alphanumeric (as in isalnum).

Then there is no need to add a Punctuation set that someone will have to look up.how it is specified because Printable - AlphaNumeric is much clearer and "self-documenting".

lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
@@ -84,31 +84,42 @@ from std/private/strimpl import cmpIgnoreStyleImpl, cmpIgnoreCaseImpl, startsWit


const
Whitespace* = {' ', '\t', '\v', '\r', '\l', '\f'}
Whitespace* = {' ', '\t', '\v', '\r', '\n', '\f'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, a separate PR would be welcome to change all instances of \l to \n in nim repo (and it'd only do that plus maybe related changes)

lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
@juancarlospaco
Copy link
Collaborator

Should new additions go into strbasics ?.

lib/pure/strutils.nim Outdated Show resolved Hide resolved
lib/pure/strutils.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member

timotheecour commented Jun 6, 2021

for CI failures, try:

when nimvm: discard
else:
  ... # things with importc

(but test locally before pushing ;-) )

Co-authored-by: Timothee Cour <timothee.cour2@gmail.com>
@kintrix007
Copy link
Author

Oof, most tests fail with JS... If I understand it correctly, they fail on the set checks, where we check if they align with the C stdlib...

@timotheecour
Copy link
Member

timotheecour commented Jun 7, 2021

Oof, most tests fail with JS... If I understand it correctly, they fail on the set checks, where we check if they align with the C stdlib...

this works:

from stdtest/testutils import disableVm
=>
from stdtest/testutils import disableVm, whenVMorJs

and then:

  whenVMorJs: discard
  do:
    block: # Whitespace
      proc isspace(c: cint): cint {.importc, header: "<ctype.h>".}
    ...

(EDIT: i pushed it in your branch)

@Araq
Copy link
Member

Araq commented Jun 7, 2021

The refactorings are fine and would be accepted but I consider GraphicChars and others to be a misdesign:

  • It's clear that these are not universal definitions so people coming from C disagree with people coming from Python. Or maybe an upcoming C23 standard changes these definitions because of locals or unicode, it's hitting a moving target.
  • So then we have our own defintion what GraphicChars means and we cannot change it ever without breaking code. Suppose I am to implement spec X. Spec X also uses a notion of GraphicChars and defines what it means. Is it the same definition as Nim's? If so, that would be accidentical so I better write my own set definition anyway. If it differs slightly I have no chance anyway but to write my own definition. Either way we're better off not offering this at all, as the good code doesn't use it anyway. We lure people into the wrong solution by making strutils.GraphicGhars available. Much like we encourge bugs by strutils.escape's very existence.

@timotheecour
Copy link
Member

timotheecour commented Jun 7, 2021

It's clear that these are not universal definitions so people coming from C disagree with people coming from Python

python and C actually agree on all those definitions, as noted in review comments. Vowels and Consonants (in a previous version of this PR) have been removed because of lack of standardized definition.

this PR follows a widely accepted standard, the posix character class, and lots of languages follow this standard:

etc

Or maybe an upcoming C23 standard changes these definitions because of locals or unicode, it's hitting a moving target.

they won't change those definitions, that'd break code; if needed they'll simply add another character class

So then we have our own defintion what GraphicChars means and we cannot change it ever without breaking code.

ditto, we'll just add a new character class if ever needed

ascii is stable and has been around for a while, it's not a moving target. Nim should follow widely accepted standards and not make client code re-invent the wheel (inconsistently)

@Araq
Copy link
Member

Araq commented Jun 8, 2021

they won't change those definitions, that'd break code; if needed they'll simply add another character class

Ok, so then we "only" need to add more of these classes for consistency with Posix. IMHO these "character classes" exist because the respective environments lack Nim's set construct. The classes feel archaic in the days of Unicode. Who cares if the character is an
ASCII control character -- what are the use cases? "Other languages copied anachronisms from each other" is not a use case.

@timotheecour
Copy link
Member

timotheecour commented Jun 8, 2021

input validation/filtering is a typical use case.
I agree that std/strutils is bloated though so I'd be ok with either moving those new sets to std/strmisc (2nd choice) or adding std/ascii (1st choice, can grow in future if needed yet avoids scope creep as it's right there in the title)

D and python3 also have such an ascii modules (https://dlang.org/phobos/std_ascii.html, https://docs.python.org/3/library/curses.ascii.html)

@Araq
Copy link
Member

Araq commented Jun 8, 2021

I agree that std/strutils is bloated though so I'd be ok with either moving those new sets to std/strmisc (2nd choice) or adding std/ascii (1st choice, can grow in future if needed yet avoids scope creep as it's right there in the title)

It should be dist/ascii. We should use std for std-worthy stuff. Yes, I know this is a change in policy.

@timotheecour
Copy link
Member

timotheecour commented Jun 8, 2021

It should be dist/ascii.

you mean stdx/ascii (in $nim/lib/stdx/ascii.nim) ? if so I'm all for it

refs: #17722 (comment)

Ok, let's use "stdx" then.

(separate topic but stdx/chains instead of std/chains would be nice too)

@Araq
Copy link
Member

Araq commented Jun 9, 2021

you mean stdx/ascii (in $nim/lib/stdx/ascii.nim) ? if so I'm all for it

No, I mean dist/ascii. Or some similar name that implies "it's simply a module we ship with Nim". stdx still has "standard" in its name implying a level of rigor that we lack.

@timotheecour
Copy link
Member

timotheecour commented Jun 9, 2021

not sure what we gain by multiplying the number of top-level prefixes:

import std/strbasics
import experimental/diff
import packages/docutils/rstgen
import stdx/syntaxes # still being debated, see also `extensions` (https://github.com/nim-lang/Nim/pull/17722#issuecomment-820601135)
import dist/ascii

not to mention all the ones where std prefix is optional, eg:
import strutils

A module-level metadata, top-level doc comment, or separate index would all be better suited than categorizing modules based on a subjective notion of fitness. After all we also already have std/chains and other modules which don't quite fit the bill.

That just invites parallel sub-categories, eg:

std/js/foo
dist/js/foo

which is a bad thing, for the same reason parallel APIs are bad.

Most other language's standard libraries manage to do with a single top-level prefix which avoids scope pollution.

@Araq
Copy link
Member

Araq commented Jun 9, 2021

Most other language's standard libraries manage to do with a single top-level prefix which avoids scope pollution.

https://docs.python.org/3/library/curses.ascii.html

But ok, we can also develop the standard library with rigor instead. "Here, I found X to be useful so I added it and C++ and Python also have it" is pretty far away from what I consider to be acceptable. Especially since these languages all lack Nim's set construct.

And it's even worse, I cannot bring up the argument that "we don't need std/asciitables because Python also lacks std/asciitables" as you simply don't care when it's the other way round. Everything should always be added to "std", immediately, giving us the superset of things that are in C++ and Python plus all the things that happen to be useful inside the Nim compiler. It's terrible.

@timotheecour
Copy link
Member

timotheecour commented Jun 9, 2021

For the sake of moving forward with this, I'm ok with creating lib/dist/ and adding lib/dist/ascii.nim so that modules needing it can import it as import dist/ascii (although I would've preferred import std/ascii or import stdx/ascii)

std/asciitables

right now it's in lib/std/private/asciitables.nim, which means it's internal and client code isn't supposed to use it unless they know what they're doing and are ok with breaking changes, but I happen to find asciitables actually quite useful and std-worthy for future work, I often use it in my own code. Whether it shall be moved to std/asciitables, stdx/asciitables or dist/asciitables is rather secondary so long it's somewhere accessible under lib (honored by --lib flag)

so, ok to move forward with this PR after modifying it so that new sets go to lib/dist/ascii.nim and also keeping the other improvements from this PR? (the new tests should be moved, maybe under tests/stdlib/tascii for simplicity, instead of tests/dist/tascii, but I don't care too much which of those 2 files is used)

@Araq
Copy link
Member

Araq commented Jun 10, 2021

Let's go with import dist/asciisets

@timotheecour
Copy link
Member

timotheecour commented Jun 10, 2021

ok, ping @kintrix007 :)

(EDIT: ie, introduce lib/dist and lib/dist/asciisets.nim)

@kintrix007
Copy link
Author

kintrix007 commented Jun 16, 2021

ok, ping @kintrix007 :)

(EDIT: ie, introduce lib/dist and lib/dist/asciisets.nim)

First of all, sorry for being away for "a short while"...
But I'm kind of confused what we want to do now.

Do we want to move all string sets to lib/dist/asciisets.nim? If so, that would cause some back-wards incompatibility.
However, if we want to only move the newly created sets, then wouldn't it be awkward to have to import both dist/asciisets and std/strutils to have all sets to work with? As dist/asciisets would only have the sets like PrintableChars and so

@timotheecour
Copy link
Member

timotheecour commented Jun 16, 2021

Do we want to move all string sets to lib/dist/asciisets.nim? If so, that would cause some back-wards incompatibility. However, if we want to only move the newly created sets, then wouldn't it be awkward to have to import both dist/asciisets and std/strutils to have all sets to work with? As dist/asciisets would only have the sets like PrintableChars and so

good that you raise this point.

I think the cleanest solution is as follows:

  • create lib/std/private/constants.nim and add all the sets there
  • make std/strutils import it and re-export the ones it already defined
  • make dist/asciisets import it and re-export the ones that make sense there
  • other modules that also copy-paste the definition of a few of those sets can now import/re-export those instead

I've been wanting such a module (lib/std/private/constants.nim) for other reasons (including some things shared between compiler and stdlib, refs timotheecour#616), to refactor some duplicate definitions, so it could be reused for that purpose too. Note that lib/std/private/constants.nim is an implementation detail (it's right there in the name private) so it's fine to do so.

The scope of lib/std/private/constants.nim could be restricted further if needed, but "useful constants" is good enough for now; again it's an implementation detail

benefits

  • single source of truth / documentation
  • lib/std/private/constants.nim is cheap to semcheck, it's only constants and these are not code-gen'd
  • no duplicate identifier error (it's the same symbol), no backward compatibility issue, no need to rename the sets to avoid clashes (renaming is bad IMO)
  • only question is what docs would look like, but I think it's either fine, or docgen can be taught to handle re-exports in whatever way we want

more generally, I find the addition of lib/std/private truly liberating, much better than the convoluted ways that preceded it with system / include files

@Araq wdyt ?

@kintrix007
Copy link
Author

kintrix007 commented Jun 18, 2021

* make dist/asciisets import it and re-export the ones that make sense there

To me it would make sense to add all sets there. Since if it does not export all, then there is gonna be that awkwardness of having to also implement strutils just for a set it's missing. However, If we define all of the sets there as well, then if we import both strutils and asciisets then some of it would be double-declared... And so would force you to write strutils.Letters or asciisets.Letters while really they are the same.

I am not misunderstanding something, right?

@timotheecour
Copy link
Member

timotheecour commented Jun 18, 2021

if we import both strutils and asciisets then some of it would be double-declared... And so would force you to write strutils.Letters or asciisets.Letters while really they are the same.

there is no double-declaration if it's the same symbol (eg via import + export).

# main
import t12425b, t12425c, t12425d
echo A1

# t12425b
const A1* = {1,2,3} ## some comment

# t12425c
import t12425b
export A1

# t12425d
import t12425b
export A1

nim r main works

@timotheecour
Copy link
Member

ping @kintrix007

@timotheecour timotheecour dismissed their stale review July 6, 2021 19:54

see recommended changes

@kintrix007
Copy link
Author

I'm sorry, I do not really have time to do this around now, but will get to it when I have time and will.

@kintrix007
Copy link
Author

I will not come back to this, sorry. If this is still needed, which I kind of doubt, someone else is probably gonna do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants