I recon that Documenter has the same problem. Is there anyone who has experience in adding better support to a service like highlightjs which is used in both cases?
Maybe something can be ported from services like Juno or the VSCode extension?
Have the feeling that the community of bloggers is growing where this would be quite beneficial. Documenter and maybe some other services would benefit from it as well.
Documenter is using hightlightjs as well as far as I recall, so it’ll be giving the same results.
Incidentally, I’m currently looking into updating Highlights.jl to work by loading grammars and themes directly from vscode definitions, which, if it actually works, should give us native highlighting on par with vscode with very little work since we won’t have to port hundreds of grammars. It’s a little ways off though.
Right so speed is irrelevant here, basically highlighting is easy and typically over max a few hundred lines of code which will be done in <1ms for any decently coded highlighter.
The key IMO is whether there’s an up to date Julia Lang specification for the highlighting; that’s why Mike is discussing having a system that ports the specs from vscode which makes a lot of sense so that we as a community only have to maintain the specs in some format and possibly write code that translates the specs in a format readable by chroma, prism, highlight or whatever people use.
I have been thinking of using the built-in Julia parser (written in FemtoLisp, not Julia) for parsing. That is to say, given a piece of source code (either from stdin, or a file) like
abstract type Animal end
struct Lizard <: Animal
name::String
end
race(l::Lizard, r::Rabbit) = "$(l.name) wins in wall climbing"
emit a stream of strings annotated as S-expressions, eg something like
that would allow reconstruction of the original verbatim, but with annotation information.
The advantage would be that
it is always valid and up-to-date,
all it would need is the Julia binary (provided this functionality, written in FemtoLisp, is added to Julia, otherwise an external file would be needed)
just using the built-in FemtoLisp is fast, with very little startup time (or one could use it directly)
A trivial piece of shim code could then convert format to whatever an IDE prefers for lexed data.
The disadvantage is that someone would have to write it
Please correct me if I’m wrong, but I don’t think something like this is possible with the current design of the parser. It does have a concept of tokens, which is probably what you are looking for, but they are read one by one from an IO stream and don’t have any context attached to them. It would be possible to check whether something is a keyword, for example, using the parser infrastructure, but I believe you would have to factor out a lot of the logic there to make it work for this use case, which kind of negates a lot of the benefits from interfacing Julia’s parser in the first place. A better approach might be to use CSTParser.jl which seems a lot closer to what you are asking for and has even been discussed as an eventual replacement for the Scheme parser.
I think they do (LineNumberNode, obtained with flisp’s input-port-line), and there is a similar facility for columns (input-port-column), but currently the latter is only used in some special cases (error messages etc). AFACT the stream keeps track of these so they could be attached to the token information if required.
Yes, CSTParser.jl could provide a viable approach, too, that’s another great alternative. The important decision may just be whether we want to implement syntax highlighting within the Julia ecosystem, with the associated benefits (will be done right, full controll, not having to work around limitations of lexers not designed for Julia, programming can be done mostly in Julia) and costs (possibly larger startup time, but this can be mitigated), then the actual library can be found from many alternatives.
In any case, I think it is worth pursuing this. Julia’s syntax is quite difficult to parse with general parsers — we run into this with the Emacs julia-mode all the time.
What I meant by context was less the line number and column, but rather whether a token acts just as a variable name, as a keyword, an infix operator… Currently those get figured out during the construction of the s-expressions, so it’s not as straightforward to retrieve just this information alongside the tokens.
tree-sitter may actually be the simplest, and most useful path to go down here. It gives you fast parsers that produce CSTs that give you all the token info you need to highlight stuff, plus an actual syntax tree that you can do other cool things with.
Even though Atom looks like it’ll eventually be abandoned (which it where tree-sitter originated from I believe) it looks as though it’s got a second life in neovim so it’s probably not going to be abandoned itself and should accumulate parsers for plenty of languages in the long term. Writing new parsers also looks relatively straightforward compared to the regex-nightmares of tmLanguage
Regarding implementing something based off of textmate/vscode grammars that I discussed above: we’d need to wrap the oniguruma regex lib since there appears to be subtle differences in some regex syntax compared to Julia’s PCRE that can’t really be glossed over. I went down a deep rabbit hole getting oniguruma to compile and trying to wrap it – more effort than it’s worth.
So I’ll probably start to wrap tree-sitter grammars and integrate them with Highlight.jl in the near future since, as far as I can tell, basing our highlighting off of that seems reasonably future-proof and a good return on investment of time.