Skip to content
This repository has been archived by the owner on Feb 28, 2022. It is now read-only.

Add support for anchors on headings #26

Closed
ghost opened this issue Jul 16, 2018 · 19 comments · Fixed by #240 or #273
Closed

Add support for anchors on headings #26

ghost opened this issue Jul 16, 2018 · 19 comments · Fixed by #240 or #273
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers released

Comments

@ghost
Copy link

ghost commented Jul 16, 2018

Anchors are nicely managed on github.com: when rendering the markdown, github adds an anchor before each heading and shows an icon that gives the link to have a direct access to the corresponding heading.
While working on our Helix documentation, I realised this would be a really useful feature, especially to handle the navigation: you can have a long md file but create a nav that give direct access to a "section" of content below in the page. Without the anchors, I had to split the long md file into various small md files to give direct access.

The pipeline could automatically generate an anchor (<a> tag with an id) for each "section" and "heading". Showing or not an icon would be delegated to the project and could be done with CSS.

This issue was moved by trieloff from adobe/project-helix#154.

@kptdobe kptdobe added enhancement New feature or request good first issue Good for newcomers labels Oct 2, 2018
@kptdobe kptdobe changed the title Anchors not properly managed Support for anchors Oct 2, 2018
@rofe
Copy link
Contributor

rofe commented Dec 12, 2018

No need for an extra <a> tag, you could add the id attribute directly to the heading tag.

@koraa koraa self-assigned this Feb 14, 2019
@koraa koraa removed their assignment Mar 11, 2019
@ramboz
Copy link
Contributor

ramboz commented Mar 29, 2019

Looking into this, a few questions come to mind:

How do we want to sanitize the title so the identifier is both user-friendly and still valid according to the html spec?
My first idea was to just replace any non-alphanumerical character with a dash and turn everything to lowercase.
What’s next? would be transformed to what-s-next-

Do we want to keep sanitized characters at the start/end of the identifier?
¿que pasa? => -que-pasa- or que-pasa?

Do we want to keep repeating dashed in the identifier?
Helix - test => helix---test or helix-test?

Identifiers have to be unique, so how would we want to handle 2 same labeled H2s in different H1s?
I would propose to prefix with the parent H1 identifier. Something like:

h1(id="foo") Foo
  h2(id="foo--baz") Baz
h1(id="bar") Bar
  h2(id="bar--baz") Baz

Are we targeting HTML 4 or HTML4.1+?
The former was quite strict about identifiers (only [a-zA-Z][a-zA-Z0-9_-.:]+ if I remember), while the latter supports anything that is not a space (\s+).
I'd be tempted to go with an HTML4 compatible id to be backward compatible, but would also be ok to relax the rule

@kptdobe
Copy link
Contributor

kptdobe commented Apr 4, 2019

Good questions ;)
I think we should just do it exactly like Github does it. I assume this will answer all the questions, assuming they support each cases. By default, I see they do not prefix the h2 with h1 but could not quickly find a duplicate case.

Also, github.com adds a <a> tag with the id mainly to be able to have it visual and be able to copy/paste the link easily. But default, this link could be hidden.

@ramboz
Copy link
Contributor

ramboz commented Apr 5, 2019

@kptdobe, @rofe: so GitHub only uses the <a> because they actually make a visible link on the header so you can easily copy-paste. While this is usually a great pattern for documentation-related websites and wikis, I've rarely seen this used anywhere else (i.e. product pages, news articles, blogs, etc.).

This also is an opinionated way on how to handle anchors that I'm not sure I would like to impose on our customers. And I think this is also in line with @rofe's earlier comment.

I can look into generating the same identifiers that GitHub would do, but I would just leave these as ids on the headings that project teams can adjust according to their needs either in the pre.js or post-load via JS (which is what I've seen most of the time). What do you think?

# Foo

=> <h1 id="foo">Foo</h1> <- the default
=> <h1><a id="foo" href="#foo">link</a>Foo</h1> <- project version
=> <a id="foo" href="#foo">link</a><h1>Foo</h1> <- project version
etc.

We would propose the first by default, and the project team can decide what to do with it.

@ramboz
Copy link
Contributor

ramboz commented Apr 5, 2019

Good idea to do it like GitHub. They actually use some powerful advanced regex in python that properly supports all unicode blocks for foreign languages[0], but we don't have an easy JS equivalent at the moment. I see 2 possible approaches here:

  • use XRegExp[1] and its support for unicode categories to simplifiy the regexp, but we introduce a depdendency
  • have an exhaustive regexp in vanilla JS using the unicode blocks that represent valid characters in each script (language). Just to give an idea, the regex for just the Chinese characters is given in [2]

I'm preparing a gist with as many examples as I can think of as test cases in [3].

@ramboz ramboz changed the title Support for anchors Add support for anchors on headings Apr 5, 2019
ramboz added a commit that referenced this issue Apr 5, 2019
Adapting existing tests to use the new heading ids

fix #26
@tripodsan
Copy link
Contributor

why do you want to keep the unicode letters in the anchor name?

also see https://mathiasbynens.be/notes/es6-unicode-regex

i think that replace(/\W+/ug, “”) would be enough

@ramboz
Copy link
Contributor

ramboz commented Apr 6, 2019

why do you want to keep the unicode letters in the anchor name?

The issue was about making it work like the github anchors, so if that is the actual requirement, I don't see how to achieve that result without a regex that fully supports all accented letters, but more importantly all foreign scripts as well. You can check the gist linked above for some examples.

replace(/\W+/ug, “”)

I tried that, as well as various combinations of \p{Script=…} and it didn't work for me. Just simple accented characters are not even properly recognized by \W and just stripped.

ES2018 is supposed to get proper unicode support in regex according to https://github.com/tc39/proposal-regexp-unicode-property-escapes, but until then it seems we are quite limited.

@tripodsan
Copy link
Contributor

i think keeping the non ascii characters is overrated :-)
but ok. if this is what we want :/)

@ramboz
Copy link
Contributor

ramboz commented Apr 6, 2019

I had an ascii-only version earlier on (1st commit actually)… I'm ok with any approach, really.

Now if we stick to the HTML5 spec, we should probably support unicode identifiers at some point, especially when we start supporting proper localization. Can't really have Chinese markdowns with Chinese headings and expect to transliterate to ascii identifiers I guess.

@tripodsan
Copy link
Contributor

i agree. you still need to ensure that the anchors are unique.

@ramboz
Copy link
Contributor

ramboz commented Apr 6, 2019

The algorithm should ensure they are, even have unit tests for it.
If there is a collision, I append -1, -2, etc. like github does it, and if you have a collision between the 2nd "Foo" (foo-1) and another header "Foo-1", then the latter becomes foo-1-1 as github does it

@koraa
Copy link
Contributor

koraa commented Apr 6, 2019

Sorry to interrupt, but why not just use lodash's kebabCase? https://lodash.com/docs/4.17.11#kebabCase

EDIT: It does have proper unicode support…

@ramboz
Copy link
Contributor

ramboz commented Apr 6, 2019

@koraa That is an excellent suggestion, didn't think about it. I tried playing with it on the gist I referenced and it actually performs really well for our needs.

Here is a comparative table of the various approaches.
(I highlighted the differences from GitHub with ⚠️)

Heading GitHub XRegExp KebabCase Slugger
Latin script latin-script latin-script latin-script latin-script
simpleAscii simpleascii simpleascii ⚠️simple-ascii ⚠️simple-ascii
fo0 fo0 fo0 ⚠️fo-0 fo0
bar bar bar bar bar
baz baz baz baz baz
qux qux qux qux qux
baz baz-1 baz-1 baz-1 baz-1
0corge 0corge 0corge ⚠️0-corge 0corge
baz-1 baz-1-1 baz-1-1 baz-1-1 baz-1-1
Accented characters accented-characters accented-characters accented-characters accented-characters
Fòô fòô fòô ⚠️foo fòô
Bår bår bår ⚠️bar-1 bår
Bãz bãz bãz ⚠️baz-2 bãz
Qúx qúx qúx ⚠️qux-1 qúx
Çorge çorge çorge ⚠️corge çorge
Special#characters specialcharacters specialcharacters ⚠️special-characters ⚠️special-characters
Punctuation - signs punctuation---signs punctuation---signs ⚠️punctuation-signs punctuation---signs
Foo!! foo foo ⚠️foo-1 foo
¿¿Bar?? bar-1 bar-1 ⚠️bar-2 ⚠️¿¿bar
{Baz} baz-2 baz-2 ⚠️baz-3 baz-2
(Qux) qux-1 qux-1 ⚠️qux-2 qux-1
<Corge> corge corge ⚠️corge-1 corge
-–—Grault—–- -grault- -grault- ⚠️grault -grault-
---GARPLY--- ---garply--- ---garply--- ⚠️garply ---garply---
:;<=>?@
Foreign_Scripts foreign_scripts foreign_scripts ⚠️foreign-scripts foreign_scripts
汉字 漢字 汉字-漢字 汉字-漢字 汉字-漢字 汉字-漢字
**العربية العربية العربية ** العربية iالعربية
देवनागरी देवनागरी ⚠️दवनगर देवनागरी देवनागरी
বাংলা-অসমীয়া বাংলা-অসমীয়া ⚠️বল-অসময বাংলা-অসমীয়া বাংলা-অসমীয়া
Кириллица кириллица кириллица кириллица кириллица
かな カナ かな-カナ かな-カナ かな-カナ かな-カナ
한글 조선글 한글-조선글 한글-조선글 한글-조선글 한글-조선글
తెలుగు తెలుగు ⚠️తలగ తెలుగు తెలుగు
தமிழ் தமிழ் ⚠️தமழ தமிழ் தமிழ்
ગુજરાતી ગુજરાતી ⚠️ગજરત ગુજરાતી ગુજરાતી
ಕನ್ನಡ ಕನ್ನಡ ⚠️ಕನನಡ ಕನ್ನಡ ಕನ್ನಡ
မြန်မာ မြန်မာ ⚠️မနမ မြန်မာ မြန်မာ
മലയാളം മലയാളം ⚠️മലയള മലയാളം മലയാളം
ไทย ไทย ไทย ไทย ไทย
ਗੁਰਮੁਖੀ ਗੁਰਮੁਖੀ ⚠️ਗਰਮਖ ਗੁਰਮੁਖੀ ਗੁਰਮੁਖੀ
ລາວ ລາວ ລາວ ລາວ ລາວ
ଉତ୍କଳ ଉତ୍କଳ ⚠️ଉତକଳ ଉତ୍କଳ ଉତ୍କଳ
Ελληνικό ελληνικό ελληνικό ελληνικό ελληνικό
Ελληνικό ελληνικό-1 ελληνικό-1 ελληνικό-1 ελληνικό-1
Maths maths maths maths maths
E ≠ mc² emc ⚠️e--mc² ⚠️e-≠-mc ⚠️e-≠-mc²
∇² = ±φ --φ ⚠️²--φ ⚠️∇-φ ⚠️∇²--±φ

While lodash's kebabCase does not exactly behave like GitHub, it does offer an easy way to cover the needs of an identifier. It even adds the capability to transform camel case headings into a --separated identifier. Also offers the benefit of not being something we have to maintain ourselves. So unless we actually have a real need to stay as close as possible to the GitHub behavior, I'd be tempted to go with @koraa's suggestion

ramboz added a commit that referenced this issue Apr 6, 2019
Leverage lodash's kebabCase instead of XRegExp dependecy

fix #26
@ramboz
Copy link
Contributor

ramboz commented Apr 6, 2019

I also notice that the :;<=>?@ heading generates no identifier in any of the solutions.
Should I go ahead and introduce a fallback to heading-1 or similar?

ramboz added a commit that referenced this issue Apr 6, 2019
@koraa
Copy link
Contributor

koraa commented Apr 6, 2019

I would be interested in meeting the person that uses such a heading, so I say we leave it in case anyone wants to report that bug ;)
But great edge case search! I am impressed claps

@ramboz
Copy link
Contributor

ramboz commented Apr 6, 2019

@koraa True… I initially thought I could provide a few examples from a mathematician's or physicist's point of view that would write some mathematical formula as heading title (E=mc², ∇² = φ, etc.) but even these actually work with kebabcase (all greek and maths symbols are supported)… so the only edge case I can think of that remains is just a gibberish sequence of special characters which you probably don't need to target anyway with an anchor.

ramboz added a commit that referenced this issue Apr 7, 2019
Remove package-lock.json modifications

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
…r instead

Replacing the lodash kebabcase and custom id collision escpaing with the github slugger module that
achieves boths

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
Refactor the getTextContent method in the heading handler to a static method of the class

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
Adjust the generate id algorithm so it matches what GitHub uses and
extract it to separate file for easier testing

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
ramboz added a commit that referenced this issue Apr 15, 2019
Adapting existing tests to use the new heading ids

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
Adapting existing tests to use the new heading ids

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
Leverage lodash's kebabCase instead of XRegExp dependecy

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
ramboz added a commit that referenced this issue Apr 15, 2019
Remove package-lock.json modifications

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
…r instead

Replacing the lodash kebabcase and custom id collision escpaing with the github slugger module that
achieves boths

fix #26
ramboz added a commit that referenced this issue Apr 15, 2019
Refactor the getTextContent method in the heading handler to a static method of the class

fix #26
adobe-bot pushed a commit that referenced this issue Apr 16, 2019
# [1.5.0](v1.4.0...v1.5.0) (2019-04-16)

### Features

* **html pipe:** add support for anchors on headings ([f52f768](f52f768)), closes [#26](#26)
@adobe-bot
Copy link

🎉 This issue has been resolved in version 1.5.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

trieloff added a commit to adobe/helix-cli that referenced this issue Apr 16, 2019
Fixes the ID attributes so that the new output from pipeline-1.5 works

see adobe/helix-pipeline#26
ramboz added a commit that referenced this issue Apr 19, 2019
Properly support complex headers that include child elements (strong, em, code, etc.) when
automatically generating the heading identifiers

fix #26
ramboz pushed a commit that referenced this issue Apr 19, 2019
# [1.5.0](v1.4.0...v1.5.0) (2019-04-16)

### Features

* **html pipe:** add support for anchors on headings ([f52f768](f52f768)), closes [#26](#26)
tripodsan pushed a commit that referenced this issue Apr 22, 2019
Properly support complex headers that include child elements (strong, em, code, etc.) when
automatically generating the heading identifiers

fix #26
adobe-bot pushed a commit that referenced this issue Apr 22, 2019
## [1.5.3](v1.5.2...v1.5.3) (2019-04-22)

### Bug Fixes

* **html pipe:** Generate proper identifiers for complex headings ([#273](#273)) ([8c1b447](8c1b447)), closes [#26](#26)
@adobe-bot
Copy link

🎉 This issue has been resolved in version 1.5.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

ramboz added a commit that referenced this issue May 3, 2019
adobe-bot pushed a commit that referenced this issue May 29, 2019
# [2.2.0](v2.1.1...v2.2.0) (2019-05-29)

### Bug Fixes

* **html pipe:** Fix merge conflict ([6abb218](6abb218)), closes [#253](#253)
* **html pipe:** Sanitize generated markdown to avoid XSS attacks ([e2d7963](e2d7963)), closes [#253](#253)
* **html pipe:** Sanitize generated markdown to avoid XSS attacks ([8c55d0d](8c55d0d)), closes [#253](#253)

### Features

* **html pipe:** add support for anchors on headings ([65430d4](65430d4)), closes [#26](#26)
* **html pipe:** Allow custom elements and attributes in markdown ([247a6b9](247a6b9)), closes [#253](#253)
@adobe-bot
Copy link

🎉 This issue has been resolved in version 2.2.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers released
Projects
None yet
6 participants