Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

highlite: fix #17890 - tokenize Nim escape seq-s #17919

Merged
merged 2 commits into from
May 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 33 additions & 19 deletions lib/packages/docutils/highlite.nim
Original file line number Diff line number Diff line change
Expand Up @@ -190,31 +190,33 @@ proc nimNextToken(g: var GeneralTokenizer, keywords: openArray[string] = @[]) =
var pos = g.pos
g.start = g.pos
if g.state == gtStringLit:
g.kind = gtStringLit
while true:
if g.buf[pos] == '\\':
g.kind = gtEscapeSequence
inc(pos)
case g.buf[pos]
of '\\':
g.kind = gtEscapeSequence
of 'x', 'X':
Copy link
Member

@timotheecour timotheecour May 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc main*()=
  ## \x123
  ## `\x123`
  runnableExamples:
    echo "abc\x1234"
    echo "abc\x1"
    echo "abc\uabcd3" # bug with this one
    echo "abc\u{12345}def" # and this one
    echo "abc\e"
main()

is there possibility to reuse code (possibly via a lib/std/private/lexerutils) to avoid duplication with what compiler already does?

(a plausible alternative is to not highlight things inside string literals, but see also #17722 which would specify the language with which to highlight)

supporting \u etc can also be defered to future work, since it's more rare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just enabled the existing code by moving case of first \ into surrounding if — Github obviously shows more changes than has been really done.

I can hardly guess why highlite.nim was written with its own separate lexer.

@Araq ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lexer: Transforms "\n" into \10. Skips comments and whitespace (more or less).
Highlighter: Highlights "\n". Highlights comments and whitespace.

Different code for different things.

inc(pos)
if g.buf[pos] in hexChars: inc(pos)
if g.buf[pos] in hexChars: inc(pos)
of '0'..'9':
while g.buf[pos] in {'0'..'9'}: inc(pos)
of '\0':
g.state = gtNone
else: inc(pos)
else:
g.kind = gtStringLit
while true:
case g.buf[pos]
of 'x', 'X':
of '\\':
break
of '\0', '\r', '\n':
g.state = gtNone
break
of '\"':
inc(pos)
if g.buf[pos] in hexChars: inc(pos)
if g.buf[pos] in hexChars: inc(pos)
of '0'..'9':
while g.buf[pos] in {'0'..'9'}: inc(pos)
of '\0':
g.state = gtNone
break
else: inc(pos)
break
of '\0', '\r', '\n':
g.state = gtNone
break
of '\"':
inc(pos)
g.state = gtNone
break
else: inc(pos)
else:
case g.buf[pos]
of ' ', '\t'..'\r':
Expand Down Expand Up @@ -985,6 +987,18 @@ proc getNextToken*(g: var GeneralTokenizer, lang: SourceLanguage) =
of langPython: pythonNextToken(g)
of langCmd: cmdNextToken(g)

proc tokenize*(text: string, lang: SourceLanguage): seq[(string, TokenClass)] =
var g: GeneralTokenizer
initGeneralTokenizer(g, text)
var prevPos = 0
while true:
getNextToken(g, lang)
if g.kind == gtEof:
break
var s = text[prevPos ..< g.pos]
result.add (s, g.kind)
prevPos = g.pos

when isMainModule:
var keywords: seq[string]
# Try to work running in both the subdir or at the root.
Expand Down
13 changes: 13 additions & 0 deletions tests/stdlib/thighlite.nim
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

import unittest
import ../../lib/packages/docutils/highlite

block: # Nim tokenizing"
test "string literals and escape seq":
check("\"ok1\\nok2\\nok3\"".tokenize(langNim) ==
@[("\"ok1", gtStringLit), ("\\n", gtEscapeSequence), ("ok2", gtStringLit),
("\\n", gtEscapeSequence), ("ok3\"", gtStringLit)
])
check("\"\"\"ok1\\nok2\\nok3\"\"\"".tokenize(langNim) ==
@[("\"\"\"ok1\\nok2\\nok3\"\"\"", gtLongStringLit)
])