docgen: sort symbols (fix #17910) #18560

a-mr · 2021-07-22T21:15:58Z

Fixes #17910 : nim doc sorts all symbols alphabetically now both in text and left bar TOC.
Example for os.nim:

zetashift · 2021-07-22T21:22:46Z

Ah, very kind for you to tag me here but I did not have the time to delve deep, however I just want to thank you for the PR! 🚀

Araq · 2021-07-23T07:29:48Z

compiler/docgen.nim

+    let overloadableNames = toSeq(keys(d.tocTable[kind]))
+    for plainName in overloadableNames.sorted():
+      var overloadChoices = d.tocTable[kind][plainName]
+      overloadChoices.sort(cmp)


This looks quite slow, can we do something about it?

no it does not look quite slow, calling N sorts where N = nb overloaded symbols will be dwarfed by other costs (parsing, semcheck, etc); although this could be optimized, there is no point in premature optimization at expense of simplicity.

and indeed:
nim doc --project --doccmd:skip lib/pure/times.nim takes 1.5 seconds (excluding runnableExamples)
the cost for this block of the whole block in if groupedToc: is 0.0035 (lumping other code too, this is an upperbound), ie 1/445 of total cost for a total of 1936 overloads. There are many optimizations worth considering before considering this one.

(0.0035069999999999824, 0.0000020000000000575113, 1936, 1, "D20210723T093627.total")

diff --git a/compiler/docgen.nim b/compiler/docgen.nim index 842f95ce7..dde9ba0bb 100644 --- a/compiler/docgen.nim +++ b/compiler/docgen.nim @@ -1266,7 +1266,7 @@ proc generateTags*(d: PDoc, n: PNode, r: var string) = if not checkForFalse(n[0][0]): generateTags(d, lastSon(n[0]), r) else: discard - +import std/times except getDateStr, getClockStr proc genSection(d: PDoc, kind: TSymKind, groupedToc = false) = const sectionNames: array[skModule..skField, string] = [ "Imports", "Types", "Vars", "Lets", "Consts", "Vars", "Procs", "Funcs", @@ -1280,10 +1280,16 @@ proc genSection(d: PDoc, kind: TSymKind, groupedToc = false) = proc cmp(x, y: TocItem): int = cmpIgnoreCase(x.sortName, y.sortName) if groupedToc: + var dt0 {.global.}: typeof(cpuTime() - cpuTime()) + var tot0 {.global.}: int + let t = cpuTime() + var tot = 0 let overloadableNames = toSeq(keys(d.tocTable[kind])) for plainName in overloadableNames.sorted(): var overloadChoices = d.tocTable[kind][plainName] + # echo "D20210723T093748: ", $overloadChoices.len, " " , plainName overloadChoices.sort(cmp) + tot += overloadChoices.len var content: string for item in overloadChoices: content.add item.content @@ -1291,6 +1297,11 @@ proc genSection(d: PDoc, kind: TSymKind, groupedToc = false) = "sectionid", $ord(kind), "sectionTitle", title, "sectionTitleID", $(ord(kind) + 50), "content", content, "plainName", plainName] + let t2 = cpuTime() + let dt = t2 - t + dt0 += dt + tot0 += tot + echo (dt0, dt, tot0, tot, "D20210723T093627.total") else: for item in d.tocSimple[kind].sorted(cmp): d.toc2[kind].add item.content

There are many optimizations worth considering before considering this one.

I hoped somebody would be prove me wrong by the numbers, however:

Code is easier to optimize when it's still fresh in the developer's head.

Things add up. Today's compiler is significantly slower than a couple of years ago. And the things that made it slower are hard to optimize. At the same time, I still get no help whatsoever in the one architectural change (= IC) that would imply we can stop caring about performance in most of the compiler's code.

I agree with 1., but I'm skeptical of 2.; big-O notation and Amdahl's law is what matters in the end and if the compiler is slower today than years ago, it's more likely to be caused by a small number of bottlenecks (that should be identified and fixed) than to a multitude of small things adding up (eg better error messages, static analysis etc).

On the same vein, optimizing PNode (eg removing comment node) will have a bigger impact on memory/performance than optimizing PSym since much more PNode's are allocated.

As for IC, I'm still convinced this is the wrong approach, and cling shows that IC can be be done with a compiler as a service (CAS) approach even for a language like C++. Even if nim's current IC approach would end up being faster than current nim full re-compilation (which isn't the case today), the overheads involved would still make CAS much faster.

Not only that, but you'd also get a REPL for free in the process, unlike with IC.

"compiler as a service (CAS) approach" is inherently non-scalable wrt memory consumption. It's also completely not required, C# and Delphi (Delphi in 1990 btw) have proven beyond doubt that "save to disk" is fast enough for everything, REPL included. Plus IC forces us to eliminate phase-ordering bugs which plague the Nim compiler.

timotheecour · 2021-07-23T18:25:38Z

how about using postprocessLexical defined below, to sort names involving numbers in a more intuitive way:

import std/[parseutils, strformat, algorithm, strutils, sequtils]

proc postprocessLexical(a: string): string =
  var i = 0
  while i<a.len:
    let c = a[i]
    if c in {'0'..'9'}:
      var num = 0
      i += parseInt(a, num, i)
      formatValue(result, num, "0>4")
    else:
      result.add c
      i.inc

proc main()=
  type A = object
    name, sortName: string
  let names = "uint8 uint16 uint32 uint64".split.mapIt(A(name: it, sortName: it.postprocessLexical))
  doAssert names.sortedByIt(it.sortName).mapIt(it.name) == @["uint8", "uint16", "uint32", "uint64"]
main()

timotheecour

LGTM

the ordering suggested in docgen: sort symbols (fix #17910) #18560 (comment) can be done either here or in future work
I believe the performance concern was addressed in https://github.com/nim-lang/Nim/pull/18560/files#r675704843 and that a slowdown upper-bounded by 1/445 is not a concern

a-mr · 2021-07-24T21:28:39Z

@timotheecour ,
I agree that wrong decimal sorting like uint16 < uint8 can be quite annoying.
I added a function cmpDecimalsIgnoreCase for handling of such cases, I think it may be a bit faster/more robust.

timotheecour · 2021-07-24T23:45:26Z

compiler/docgen.nim

+  ## For sorting with correct handling of cases like 'uint8' and 'uint16'.
+  ## Also handles leading zeroes well.
+  runnableExamples:
+    doAssert cmpDecimalsIgnoreCase("uint8", "uint16") < 0


what about foo08 vs foo8?
right now it gives 0, which is not intuitive because the strings are not equal.
how about instead considering 08 is longer than 8 and having:

doAssert cmpDecimalsIgnoreCase("ab8", "ab08") < 0 doAssert cmpDecimalsIgnoreCase("ab8de", "ab08c") < 0 # sanity check

agree, fixed.

timotheecour · 2021-07-25T15:53:17Z

btw, after a more careful timing so that it only measures the sort, not the other context in the block, i now get:
(0.0004810000000006198, 0.0000020000000000575113, 1313, 1, "D20210723T093627.total")
ie, 1/3000 of the total time for nim doc --project --doccmd:skip lib/pure/times.nim

so, nothing to "fix" here performance wise; not that it's a surprise, given that the cost is:

sum_ni ni * log ni <= n * log n # with n = nb_symbol

diff --git a/compiler/docgen.nim b/compiler/docgen.nim
index 3e5c1c912..2a3f6e6d7 100644
--- a/compiler/docgen.nim
+++ b/compiler/docgen.nim
@@ -1312,6 +1312,8 @@ proc generateTags*(d: PDoc, n: PNode, r: var string) =
       generateTags(d, lastSon(n[0]), r)
   else: discard

+import std/times except getDateStr, getClockStr
+
 proc genSection(d: PDoc, kind: TSymKind, groupedToc = false) =
   const sectionNames: array[skModule..skField, string] = [
     "Imports", "Types", "Vars", "Lets", "Consts", "Vars", "Procs", "Funcs",
@@ -1325,8 +1327,19 @@ proc genSection(d: PDoc, kind: TSymKind, groupedToc = false) =

   proc cmp(x, y: TocItem): int = cmpDecimalsIgnoreCase(x.sortName, y.sortName)
   if groupedToc:
+    var dt0 {.global.}: typeof(cpuTime() - cpuTime())
+    var tot0 {.global.}: int
+    var tot = 0
     let overloadableNames = toSeq(keys(d.tocTable[kind]))
-    for plainName in overloadableNames.sorted(cmpDecimalsIgnoreCase):
+    let t = cpuTime()
+    let names2 = overloadableNames.sorted(cmpDecimalsIgnoreCase)
+    let t2 = cpuTime()
+    tot += names2.len
+    let dt = t2 - t
+    dt0 += dt
+    tot0 += tot
+    echo (dt0, dt, tot0, tot, "D20210723T093627.total")
+    for plainName in names2:
       var overloadChoices = d.tocTable[kind][plainName]
       overloadChoices.sort(cmp)
       var content: string

* more compact names for non-overloaded symbols * more predictable Ascii sort (e.g. `<` instead of `<`)

* docgen: sort symbols (fix nim-lang#17910) * add workaround + change naming * switch to a dedicated sort comparator * fix numbers with unequal string lengths * dedicated `sortName` instead of `plainNameEsc`: * more compact names for non-overloaded symbols * more predictable Ascii sort (e.g. `<` instead of `<`)

docgen: sort symbols (fix nim-lang#17910)

c24e61b

add workaround + change naming

208862a

Araq reviewed Jul 23, 2021

View reviewed changes

timotheecour approved these changes Jul 23, 2021

View reviewed changes

switch to a dedicated sort comparator

cfdd633

timotheecour reviewed Jul 24, 2021

View reviewed changes

fix numbers with unequal string lengths

11f1bb6

timotheecour approved these changes Jul 25, 2021

View reviewed changes

dedicated sortName instead of plainNameEsc:

76599e8

* more compact names for non-overloaded symbols * more predictable Ascii sort (e.g. `<` instead of `<`)

Araq merged commit 10da888 into nim-lang:devel Jul 25, 2021

timotheecour mentioned this pull request Jul 26, 2021

simpler doc links: getFilePermissions <#getFilePermissions,string>_ => $getFilePermissions nim-lang/RFCs#125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docgen: sort symbols (fix #17910) #18560

docgen: sort symbols (fix #17910) #18560

a-mr commented Jul 22, 2021 •

edited

Loading

zetashift commented Jul 22, 2021

Araq Jul 23, 2021

timotheecour Jul 23, 2021 •

edited

Loading

Araq Jul 23, 2021

timotheecour Jul 23, 2021 •

edited

Loading

Araq Jul 23, 2021

timotheecour commented Jul 23, 2021 •

edited

Loading

timotheecour left a comment

a-mr commented Jul 24, 2021

timotheecour Jul 24, 2021 •

edited

Loading

a-mr Jul 25, 2021

timotheecour commented Jul 25, 2021 •

edited

Loading

docgen: sort symbols (fix #17910) #18560

docgen: sort symbols (fix #17910) #18560

Conversation

a-mr commented Jul 22, 2021 • edited Loading

zetashift commented Jul 22, 2021

Araq Jul 23, 2021

Choose a reason for hiding this comment

timotheecour Jul 23, 2021 • edited Loading

Choose a reason for hiding this comment

Araq Jul 23, 2021

Choose a reason for hiding this comment

timotheecour Jul 23, 2021 • edited Loading

Choose a reason for hiding this comment

Araq Jul 23, 2021

Choose a reason for hiding this comment

timotheecour commented Jul 23, 2021 • edited Loading

timotheecour left a comment

Choose a reason for hiding this comment

a-mr commented Jul 24, 2021

timotheecour Jul 24, 2021 • edited Loading

Choose a reason for hiding this comment

a-mr Jul 25, 2021

Choose a reason for hiding this comment

timotheecour commented Jul 25, 2021 • edited Loading

a-mr commented Jul 22, 2021 •

edited

Loading

timotheecour Jul 23, 2021 •

edited

Loading

timotheecour Jul 23, 2021 •

edited

Loading

timotheecour commented Jul 23, 2021 •

edited

Loading

timotheecour Jul 24, 2021 •

edited

Loading

timotheecour commented Jul 25, 2021 •

edited

Loading