-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide access to a parse tree as an alternative to HTML output #1086
Comments
This idea is hanging in the air for a while. It's never been a priority as it lacked real use-cases, but now as you mentioned those two it makes very real sense. Another problem is that in one or two attempts at doing this it was done as a part of bigger (and controversial) refactorings of the whole core library, so it haven't got merged. I will probably get to it myself eventually, but if you'd fancy giving it a try, I could give some guidance. |
I'm having difficulties in understanding the inner workings of highlight.js, both because there is not a lot of documentation/comments and because the code is written in a really imperative jump-around way. Also, it seems like HTML generation is really baked into the core of highlight.js. |
Yes, it was deliberately done so in the past to squeeze the code size as much as possible. Currently this "optimization target" is much less desirable and that's why I'm all for refactoring this. I'll do a short writeup about the architecture of the core to make it more comprehensible. |
This sounds great, especially since there seems to be no render-agnostic snytax highlighter in JavaScript land right now. |
A status update on this… Before writing anything I decided to first do a pretty significant upgrade to the parser that I was planning to do since a long time ago anyway but a few bugs came up lately bumped the importance of this, and incidentally I've finally came up with an actual plan on how to implement it. And there'd be no point in documenting the current core as it will change immediately after that. (If interested, the change in question is one I described in this blog post, under the "Complex modes" section.) |
Author of slap text editor here. I was going to open a new ticket for this but looks like @derhuerst phrased it very well. highlight.js has many other uses than highlighting in the browser. It would be great if parsing was decoupled from HTML generation. See also slap-editor/editor-widget#134. We are considering porting our current highlighting implementation using highlight.js, in combination with HTML manipulation with cheerio, to another option that would not require any HTML manipulation such as lowlight (which itself depends on highlight.js but does not seem to perform any HTML manipulation). Glad to help out in any way if this feature request moves forward! |
I'm using https://github.com/wooorm/lowlight and writing my RTF formatter ATM so I'd be interested in support for AST out of the box. I think this would allow adding different formatters as pygment does. |
I think our HTML stuff has some unique benefits, but if there were two pieces people could always choice to use the one we bundled or replace it with a different once. I've been digging into the parser a lot, so I'll keep this in mind. @derhuerst Any idea what that parsed but not HTML format might look like? |
Usually you annotate Abstract Syntax Trees (ASTs) generated by a parser with additional fields, e.g. The specific mechanism to add this info of course depends on the specific AST format being used. The most well-known one is ESTree used by Mozilla and the Babel AST format (an extension of ESTree). There's also the All of these AST formats have dozens or even hundreds of libraries built around them, supporting all kinds of use cases, from parsing to transformations to formatting. |
If we add this support to highlight.js natively, do we kill the whole lowlight project? |
Yes, probably! |
Well sure, we wouldn't change our API. We'd likely add a new method or two to return some sort of node tree instead of HTML... and maybe some method to turn a node tree into HTML... then we'd just wrap ourselves by glueing the two together with the old api. |
@wooorm Before building your own parser that I assume is derived from ours did you try to first add the functionality to our source? If so was there a reason you gave up on that approach and went a different way? |
I needed hast, and hast was new (my invention). If I recall correctly, people weren't interested in ASTs as much back then so I rolled my own instead of raising an issue requesting my own format to be supported |
Well I meant the core functionality. Once you had a list of parsed tokens you could turn it 8nto any format pretty easily I’d think. |
The core functionality is the same for highlight.js and lowlight, except in code style, and in one creating HTML and the other a syntax tree. I think my above comment clarifies that no, I did not raise an issue, and gives reasoning for why not? Here is a brief history for lowlight. |
Except it's not: All 3 of these will likely bite you eventually unless you are watching closely and porting the same changes to your rewrite... if I were you I wouldn't want to keep up with things like that... that's one of the disadvantages of rewriting vs reusing the core engine. |
Yes, there are differences, and for the last almost-four-years I’ve worked on porting them over. You assume that I want to maintain lowlight like this for years to come. I’d prefer for highlight to support syntax trees. You already asked this and I responded:
|
Just making sure you were aware. :-) I might play around with this idea just to see what the hook in points are... I don' think this would be a difficult thing to do, just need to figure out how to do it nicely. Our existing system isn't very modular. :) |
And again, all this is considered private API and might stay so... it's mostly to make advanced integrations with stuff like lowlight possible and MUCH simpler. Instead of forking our whole library like you used to now you should be able to do the same thing with 50-100 lines of code. If we made an API public public we might start with the tiny builder API... so we'd support:
Now you only implement (or whatever we named them):
Done. :-) And our internal Tree data structure handles all the messy stack details/close all nodes, etc... |
This issue is about creating a public API for that. My suggestions aren‘t for how things work internally, but for what is exposed |
So question to you: Can you not build lowlight on the builder API alone? |
The layering is about how low-level you want to be. The fastest way to get output is a Builder, that would be what I'd steer most non-advanced implementers to, and I think it makes sense as a simple 3 method public API... the most complex and customizable way is an Emitter. Builder is trivially simple, Emitter requires more effort on the part of the implementer. The way you're talking it sounds like you really want to write a Builder not an Emitter because you don't want the emitter responsibilities. Builders are dumb, emitters have to have brains and be able to have an intelligent dialog with the parser. Such as "hey, add a keyword to the output for me". The parser itself WANTS to deal with an emitter, not a builder. So that it doesn't concern itself with the output concerns. |
I was able to get up to a point, but there are/were other issues which I raised in other issues.
You may do that! My suggestion is to support the bare minimum that we need now, in a low-level way that should be powerful enough to do most things. I posted HTML and hast making above, I can see how emphasize could also be made to work for ANSI coloring. I don’t know about PDFs so I can’t make suggestions for that. Re adapter / builder / emitter: I didn’t study CS; I don’t understand the subtleties you intend between these terms. |
No subtly there, those terms are mostly made up (they may have some meanings I dunno, I did try to give them meaningful names) - but it's just what I decided to call them. The separation of concerns and different layers of abstraction are the real things to take away from what I'm talking about.
Registration of aliases and getting a new instance are solvable - no objections there. :-) Though I worry about regular users making instances they don't need, but I guess that's a documentation issue?
I'd think a semi-private API is far better than forking the whole project. You update to the semi-private API and then pin to the latest compatible version of HLJS. Even if it remains private there are no huge plans to rejigger this stuff all the time. And now updating 90% of the time is just a matter of bumping the version and running your test suite. No more porting over deep internal parser bug fixes/changes. That sounds worlds better than what you are doing now - treating the whole codebase as a HUGE private api. :-) |
Now we have a chicken in egg problem because I don't feel these things are best designed/used in the dark. I'd want a few people to actually use the system before we ever declared it a public API and locked ourselves into it for at least a major revision cycle and perhaps MUCH longer. So I think the cycle has to start with advanced integrators actual using the new stuff in the real world - even if it's not blessed for general public usage and not 100% API stable (from one version to the next). That's what NPM version pinning is more IMHO. |
But even if no one ever uses it I still think the internals of the parser have been much improved by these changes... EVEN if we dropped the token tree in the future and went back to pure strings as of right I'd still use the same API for the emitter interface. (I use interface here in the TypeScript/API sense). |
So question to you: Can you not build lowlight on the builder/walk API alone? My guess it would be a lot less code and a lot closer to what you're asking for. |
Also there are stability/debugging issues as well related to how these concerns are separated. The plain reading of this issues title (I didn't go back and read the whole thread) was requesting a tree output alternative to string output. That's what
Vs an emitter, which is running in parallel with the parsing process. It'd be very easy to turn the parse tree into JSON and pass it around, render it later, etc... so the idea of writing an Emitter (that the parser is talking to real-time) vs writing a Builder (that the Tree walker can use after the fact to generate another form of presentation) are different conceptually also - not just in their complexity of implementation. |
hljs now has an For context: 53% of hljs downloads on npm come from lowlight, it’s used a lot, I don’t want to provide a potentially bad user experience by depending on private APIs. And I think if lowlight would implement this api, “I'd want a few people to actually use the system before we ever declared” is pretty much solved. You are the maintainer of hljs and you are free to implement these features the way you want! |
If I've offended I have not meant to. I'm just discussing and explaining my thinking.
I meant use/integrate with the API - I don't mean end users, I mean integrators. I assume before and after the end user experience for your library will remain 100% unchanged - assuming the new functionality on our end actually works, so those really aren't useful data points. It's not the end user experience of your library users we're talking about, it's the integrator experience - ie, your experience. And at this point you and I are merely two individual data points as having implemented working Emitters. :-) If you can I still urge you to try writing a builder and see how that feels since you're working at a much higher level. It would be helpful to have feedback on that.
This is the problem with features without docs - and truly sorry if I steered ya in the wrong direction originally. :-) It's a fine way to go, but I just don't think it was clear that you're working at the lowest-possible level available (short of forking the library). So the API is indeed a bit more complex at that level. The builder API is a higher level API and therefore much simpler. If I'd had the hindsight of this whole discussion I may have suggested the builder from the get-go, but hindsight is 20/20. :-) |
Do you publish on NPM? Can you not pin your library against a "known good" version of ours? I guess I fail to see how forking the whole library is better/easier than pinning to a specific version of HLJS that provides a specific "known good" API - whether private or not. Again, isn't forking a whole project kind of the "worse case" scenario - in that case you're treating the whole project as a giant private API... if we change a ton of stuff, you're "stuck" on the prior version until you manually update. How is that any different than using a semi-private API - other than the fact that the amount of code you have to maintain is now reduced by an order of magnitude? If we change a ton of stuff in general, you now get all those changes for free. If we change a ton of stuff specifically related to the integration API... you're "stuck" on the prior version until you manually update. Seems like option 2 is better than option 1, or at least no worse... no? Am I missing something? What is the big advantage of "vendoring it" (having a static/forked copy to integrate with) vs having a static versioned copy via npm? I'm not sure the goal here was ever to make a 100% public API that we have to support to the same level of stability that we support our other public APIs. My goal was to simplify the parser, separate the concerns better, and while doing so make it so that integrators (like yourself) wouldn't necessarily have to maintain a whole custom copy of the library just to get between the Even if you insist on maintaining a fork now (for whatever reason), I'm guessing that fork should be 95% easier to maintain. Hooking in your custom code is now a 5 line patch to the core library (assuming we go ahead and add
Where do you find the time? :-) Then surely you understand my hesitation to add and document a public and stable API for a tiny, tiny subset or our users (advanced integrators) and then be locked into that API for a year (or longer). A year is the pace I currently imagine for major releases where we allow ourselves to truly "break" things. |
It's entirely possible you know more about parsers than me if you maintain a lot of them - I'm just doing the best I can with the knowledge and experience I have. :-) I have been programming for professionally for 25+ years, so hopefully that's worth something... :-)
So perhaps the confusion is over "what do we expose". I see emitter as a lower-level implementation concern. It's the interface the parser needs so that it can focus on parsing and not presentation. I made it swappable because I feel that's good design (clean separation) plus at the time I imagined we might have to ALSO ship a pure text emitter in additional to the tree emitter... until I did a few tests and the memory/CPU didn't seem to be all that different with the tree version. A side benefit of that (for someone like yourself) is that now instead of maintaining a port of the whole project - you can just track the emitter/builder API - even if the internals aren't 100% API stable. So when we release a new version you now rewrite 100 lines instead of potentially having to rewrite the whole library. :-) I'm not interested in changing this API all the time for no good reason. I expect it to be pretty stable, but that's different than making a promise to everyone that it's 100% stable.
FYI: I do agree with the thinking here - and that was originally my concern also, but it didn't seem to matter enough to outweigh what I see as a big benefit of having the tree preserved as a potential output. And if someone is using this to constantly to parse millions of lines of code and finding memory/CPU with the new implementation a real pain point, then it's only 50 LOC or so for them to entirely replace the emitter - and problem solved. And I'd perhaps even be fine with that emitter living in the core library (then we could debate the merits of each and which was a better default) but then we'd also need a way to run all the tests against both emitters, and that wasn't a high priority of mine to figure out that part. |
You do communicate a lot, which is mostly a great thing, but it’s many messages with many edits, and for me it’s hard to keep up! I feel like I’m out of depth (I don’t know, or necessarily want to know, much about highlightjs or how it is maintained); I don’t want my opinion to hold too much weight. My way of dealing with many messages is going in reverse and sending short responses back to keep the conversation manageable for me.
The fork is from 4+ years ago. We’re already pinning hljs to feature releases, because over the last 4 years it hasn’t adhered to semver, where breaking changes occurred in feature released. I’m personally strongly against lockfiles and completely locking versions.
Other that initially building lowlight, I have probably spent on 20 hours over 4 years on the 18 releases and it’s been very stable. Pretty sure I’m about to hit those hours discussing this issue. I want a working version that follows a contract where hljs provides something that lowlight can safely work with: it could look like anything, such as how it currently is (but public) or maybe a bit more simple
ANYTHING you expose, is public. This emitter is public. The With the rest of the paragraph: I understand! And I’m 👍 with whatever you decide to implement. You are free to create a tree! That’s what was originally asked, and what you delivered. But the original question was a tree instead of HTML. We now have both. And there are 4 ways to integrate. For a tree, you could go with something that is well defined and highly used. Such as hast, but it was decided to go with a tree. That means that anyone now has to adapt the custom hljs tree to what they want. Be it HTML or hast or ANSI or PDF. And you’re doing both a tree and HTML! I think a simpler solution would be: return whatever is returned by that What I suggest is less public api surface than currently in v10. P.S. lowlight (and thus hljs) is used in many static site generators, such as Gatsby or Next.js, and it’s running on some blogs with thousands of articles: performance is really important, bazillions of lines are being highlighted |
Blessing and a curse. :) One of your messages just sounded a bit frustrated so I was trying to defuse that if possible.
No worries. I try not to let any single voice hold too much weight. :)
If we do that now, let me know, but we're trying not to. Though technically it might depend on what you consider a "breaking change". When you fork the library and have to hand or hand merge things back in I'd imagine far too many things are "breaking changes". On the other hand technically no releases actually break you since you're a fork... :-)
I can understand that I suppose. I was suggested it as a way to "hedge your bets".
Interesting and impressive, I'd have guessed more.
Well if it's up to me then I declare it "good" already, and we're left discussion "public". :-) I am considering simpler method names for the builder though...
That's the hope. :-)
I don't mean "public" in the purely technical sense. Way too much in JS is public in that sense (although I'm aware there are ways to make things more private)... I'm instating referring to "stable" and "documented" vs "beta" and "undocumented"... sometimes I hate technical talk and if the confusion here is my fault, I'm sorry. For example: function highlight(languageName, code, ignore_illegals, continuation) { Someone calling this from v9 to v10 should get almost exactly (or exactly) the same output keys (well we added a few) - other than grammar improvements, bug fixes, or actual regressions (which we'd fix), etc. That API is documented and very stable. I plan to deprecate it by v11, but it'll probably still continue to work past until v12... this despite the fact that probably over 50% of the internals have been rejiggered from v9 to v10. That's the type of documented stability/public that I'm talking about when I say stable/public. A guarantee that sites can upgrade from v10 to v10.1 to v10.2 to v10.283 and everything should "just work" with 0 changes by end users. If you merely want a technically "public" API - defined as "not hidden inside a closure" then you've already got that as of v10 and I have no plans to take it away. But I also don't consider it stable, yet. v10.3 might break it, if it was necessary or useful for us to do so.
Yes, use whichever you prefer. I think for many use cases this does no harm. If you can show otherwise concretely I'm all ears.
LOL. I was speaking to you very technically as someone whose been forking the project for YEARS of what you could do INSTEAD of that. If someone with no context asked me the same question "how do I use the tree" I'd say the easy/stable way was "write a builder", not give them those 4 choices I gave you. :-)
Yes, it was an intentional choice - I wanted the parser and emitter isolated from the presentation layer. Hast (at a quick glance) is awful close to HTML semantics, that seems to be it's key intention:
It also looks a fair bit more complex than the simpler low-level solution I implemented.
Well no, you get HTML for free... :-) but... Yes, again this is mostly intentional. I'm trying to build foundational layers, allow easy access for plugins and extendability - not solve (and then have to maintain!) everything directly. Previously getting any type of tree from the parser was nigh on impossible because of the tight coupling - you decided the easiest way to do it was to fork the whole project and maintain that fork. (I wonder if just parsing the HTML back into a tree would have been simpler.)
Again, this could change with time - IF it proves to be an issue in reality.
I agree completely - you are suggesting a simpler solution, I just think that is at odds with the fact this is a low-level API. I designed it purposely for integration with the parser. That's it's primary mission - help simplify and make the parser happy. If that means you have to write an And if we provided a subclass to extend from then (and assuming addKeyword is really always open/text/close, an assumption I question) you'd have exactly the 4 method API you want.
Again, try writing a builder (3 methods only) and then tell me your thoughts. I'm not saying that's the best approach for
If you had a benchmark suite or server setup to measure this stuff more precisely that'd be nice. Or even if you had a large project that you could build with say the built-in rendered plus a custom text only emitter and show numbers from both, that could prove useful to know. Right now we just use the test suite itself as a rough benchmark - since I run it all the time and would get annoyed if it were any slower. |
Again: (pretty sure I said it earlier) If the tree based emitter is shown to be significantly worse/slower (a significant regression compared to a text only emitter) I think we'd definitely accept a PR to make a text only emitter the default option - if it solved the problem of how to run the test suite against both emitters. But so far I'm unconvinced the low-level API needs any real changes - other than perhaps a small subclass to prevent writing the NOOPs by hand. (I will think about the naming... |
And give me a few days to think/dream on things too... sometimes I do come around a few days later... :-) Though you do have me debating the necessity of That might remove one method from the interface IF you were happy with exposing your raw data object as an attribute on your emitter vs a method or getter. IE: +++ b/src/highlight.js
- result = emitter.toHTML();
+ result = emitter.output;
+++ b/src/lib/token_tree.js
- toHTML() {
+ get output() { I'll give it more thought. Again that whole method being there bothers me since it tightly couples the emitter to the renderer, which isn't good IMHO. This would be far better: result = new options.__renderer({tree: emitter.output }).render() The tree emitter knows nothing about rendering and it's output is merely a tree which HTMLRenderer knows how to render. But then that means you need both a |
I think at some point you said you expected a different API for trees vs HTML, and I don't completely disagree but I think that's something we'd solve in the future with a better API overall for highlight (if we decide to make emitters a more public facing feature): var result = highlight(code, {language: "pascal", emitter: Lowlight.Emitter })
// result.value is HAST tree generated by lowlight |
Three example cases for
I think your making some assumptions about the output stream (that it's HTML or an AST of some kind) and what's important to preserve that I simply am not making, and hence the Actually at a later point if all you had were nodes there would be no trivial way to distinguish a keyword from a named node... |
It would be nice to have a function which returns just the tree which highlight function uses inside, this separates tree creation and html generation for cases when it's not needed. Something like: // returns just the tree
function buildTree(language, code) {}
// uses this function inside and producess html
function highlight(language, code) {
const tree = buildTree(language, code)
return { value: produceHtml(tree) }
} I slightly read the above conversation and I believe this was one of wishes. I need same thing, just the tree without html stuff, a public function for doing just that. |
The current behavior is intended more for advanced integrators - so I'd love to hear more about your actual use case - how and why you're using Highlight.js, etc... Often more is learned by hearing the exact problem itself than the desired solution - because often there are much better solutions to be found. I'd also love to see some evidence that the HTML generation itself is actual a measurable computational expense. I'd wager the parsing/tree building is where MOST of the expense is. If you ONLY need the tree always you could have it with a one line patch to One other choice (if we prove that there is a real cost here) would just be to make Simply providing your own emitter is another choice for only 30-60 LOC for simple cases. |
Until I talk to more people who are using this or trying to I'm hesitant to lock down an official public API and what the "tree" is exactly... |
I think it's been written before but if not my driving motivation here was to clean up the Highlight.js codebase and make it more modular - separating the output concern from the parsing concern. Previously to do anything like this you needed to fork the whole project and then maintain that fork across time, etc. What a pain. Now you can build a custom emitter with only 50-60 LOC or less - using the standard distribution via NPM, because it exposes The end goal was never necessarily to build a Related, my thread on building a plug-in culture rather than doing everything in the core library: The fact that we ship an internal parse tree was perhaps [or not] my getting a bit carried away. Though I did want to make sure the API I built could easily support a tree based system. In hindsight there is some question of whether we should just ship a simpler string emitter... And we may still end up doing that again in the future. The major winning point is that now advanced integrations are FAR easier - and the code is more modular. And that will remain no matter how the data is represented internally. |
My use case is actually simple, I have a sql query and want to highlight it to show in console with colors. For console I don't need any html generation, just parsed tree to iterate and add colors. |
Sounds like a perfect opportunity to me - 3rd party ANSI highlighter plugin. :-) That might actually be super helpful for a CLI use of HLJS. |
@vladshcherbin if you're in need of something right now, than maybe https://github.com/wooorm/emphasize can help |
@wooorm Figures that you already went and did it. 🥇 |
@wooorm yeah, I found it from your discussion above and already tried (it works). Thought that since there's tree generation in highlight itself now, you can probably try to drop lowlight dependency. However, I don't know the details so it's just a thought 🤔 Thank you |
Lowlight wants a very specific kind of tree. :-) But that can be good thing for those who want to integrate with it. I think once Lowlight switches to the new API (hopefully?) it might be the perfect place to point anyone who is asking "just give me the tree" as you were originally. @wooorm Did you not create a separate issue for getting a new instance of HLJS, I thought you had - but I can't find it. |
I've just had a quick look at hightlight.js and didn't any discussion on this. So forgive me if this is not the right place or has already been discussed.
I'd like to propose to make highlight.js more generally usable, or at least pull out the core highlighting logic to make it independent from the output format. Just like the famous Pygments has formatters, the HTML generation would be done by a separate module/package.
highlight.js seems to be the de factor standard from syntax highlighting, even among porject not related to HTML/browsers. See the slap editor and hicat for examples: Both use it and later on manually walk/manipulate the HTML generated by highlight.js.
In the world of lightweight do-one-thing NPM packages, it would make sense to split highlight.js into two parts, one generating an treeish object representation (possibly streaming), and the second generating HTML from it.
The text was updated successfully, but these errors were encountered: