Improve precision of tag analysis #40

sergv · 2018-09-16T08:59:37Z

This PR contains many small things that I've been collecting for some time now. Some errors are no longer detected, a few new things (like type families or GADT-like newtypes (I was surprised, too)) are. Some functions I just tried to optimise. Not sure how to concisely summarise what was done - the best way to summarise is probaby to look at new test cases.

One significant improvement is to do with tracking of quasiquoters. Now lexer is smarter and tries to not confuse list comprehensions like [foo|foo<-xs] for quasiquoters by looking whether part of file after [foo| contains closing quasiquoter bracket |]. If it doesn't then we report a list comprehension and on some files start actually detecting the rest of the definitions.

One thing that does not fall under any of the improvements mentioned in the previous paragraphs but which I should nonetheless mention is addition of parent field to each tag. It does not appear in the output, so it should be safe w.r.t. backwards compatibility, but is very useful for using fast-tags package as a library to index some sources. I reckon it would not slow anything down since it's just a text field that gets populated when we now an entity (e.g. class, type declaration or type family) may have some related entities (children).

…een constructors and function names

…ter starts that could be confused with list comprehensions with greater care

…present. Support more crazy bracket-based layouts when expanding semicolons

…strings

…ake them specialise at the call site

…fied names

…sed with function names

…ine strings with 0 indentation. Simplify string lexing rules a bit

I'm just as surprised to find out that they do actually exist as you currently are.

…,,), etc

elaforge · 2018-09-16T21:54:56Z

Nice work! The tests for all the fixes are much appreciated. I tested against my codebase, and it's gotten a bit more accurate. Speed seems to be roughly the same.

It looks like CI is unhappy because in src/FastTags/LexerTypes.hs you need to explicitly import mempty for base with ghc 7.8.4. I think you can just put that in, maybe behind a version guard to avoid redundant import warnings. After that I think you should be able to merge yourself, so go right ahead. Or I could push the button, it's all the same to me :)

Unless you have some other changes in mind, I'll then bump the version and upload to hackage.

elaforge · 2018-09-16T21:59:33Z

Just out of curiosity, what do you use the parent field for?

sergv · 2018-09-16T23:06:50Z

I've fixed the 7.8. Please push the button :)

I'm using parent field for alternative tagging approach where a server sits in background, tokenises all files and tracks all module headers/exports/imports/etc. Thus seach queries get analysed in specific import context and this hopefully reduces number of ambiguous results (e.g. trying to look up insert will surely produce a lot of candidates, yet GHC known how to resolve that - so would the server (hopefully)).

I'm developing the server at https://github.com/sergv/haskell-tags-server. It is editor agnostic but I'm still struggling to pick an interactino protocol/protocols that would make all editors reasonably happy. For now I'm using BERT but stil haven't produced a decent interaction with Emacs. Actually, if you have any ideas about which protocol would be reasonable or easy to support in Vim that would be great, since as a primarily non-vimer I have no clue.

elaforge · 2018-09-16T23:36:09Z

The tags server idea is interesting. I have solved the same problem with the --fully-qualified tag, and it works well for my project. But that's only because I almost always import qualified, and I almost never re-export. To deal with re-exports I would have to track imports and exports as your tags server does, and if I did that I might be on the road for supported unqualified imports... but if you have a persistent server which already does that then I should give it a try, as soon as it's ready for that.

I don't know about protocols, I don't use vim plugins so I don't know if there's a consensus. I gather than neovim added a msgpack API and does plugins via that, so if you are ok with targeting neovim then that is probably the way to go.

The other thing I'm thinking about is how to keep tags fast (so therefore incremental) across git checkouts. Currently I rebuild tags from scratch after every checkout and it's ok for 150k lines, but probably will get quite annoying at 1 million. I have some incomplete ideas about a global tags for the pristine branch, and then an overlay for local changes, but any more complete ideas you might have would be welcome!

sergv · 2018-09-17T07:29:40Z

That's interesting about tagging 1 million lines of code. Incremental update should likely work on a per file basis, regenerating only those files that changed on git checkout. Otherwise there's just too many files and lexing/analysis in fast tags could only be so fast (maybe it could be improved 2x, but 10x will be very tough).

A background server could help: it would listen for modifications via file notification and take action in the background as soon as anything changes. Provided you won't immediately need correct tags after performing a git checkout, there will be time to update index while you're e.g. looking for a file you need.

elaforge · 2018-09-22T20:16:23Z

I was thinking of creating tags as part of the build process, and then you reuse them for unchanged builds in the same way you reuse the library .so or .a files. But there would have to be an intermediate combination, since I don't think vim could efficiently scan hundreds of tags files. But that could be another build target, e.g. given a hackage snapshot, build merged tags of the whole thing and then you just put that file in your search path.

The tricky thing would be to keep track of the changes and tracking across git pulls and branch checkouts. I very frequently want some notion of generated output in source control which is still linked to commit... here is yet another use for that! Otherwise, this is just the same problem as wanting to build anything, except that build systems are generally too high latency to ask for an update every single time you want to follow a tag!

sergv added 29 commits August 27, 2018 11:47

Add dedicated tag for type families

91178a7

Add tracking of parents for datatypes, type families and classes

97b7d15

Introduce self-documenting enum instead of a bool to distinguish betw…

85ad0a3

…een constructors and function names

Fix some name shadowing warnings

50f8d7c

Improve dropWithStrippingBalanced

0554308

Never detect underscores as function names or operators

8db9154

Fix detection of function pattern like (x f y)

9523cfb

Improve handling of backslashes and end of line

06116dd

Catch exceptions when analyzing each file

87c287a

Expand semicolons into newlines to detect more tags. Process quasiquo…

ea9c5ca

…ter starts that could be confused with list comprehensions with greater care

Support quasi-quoters defined by unicode syntax

f9cb1cc

Correctly compute indentation of next line when {-...-} comments are …

f453df6

…present. Support more crazy bracket-based layouts when expanding semicolons

Ignore numberic values during analysis just like with characters and …

9455e4d

…strings

Remove useless 'fast' flag

0e835f8

Improve detection of functions named 'pattern'

56a9103

Add a few test cases for empty data type context

f98734d

Optimise popRParen. Inline all small polymorphic lexer functions to m…

844f6fe

…ake them specialise at the call site

Properly handle toplevel splices

79780f5

Properly handle qualified constructors and single quotes before quali…

c0e99d6

…fied names

Check lexing of operators accidentally merged with comments

064ed46

Properly tokenize lambda backslashes so that they will never be confu…

c1c175e

…sed with function names

Properly handle '\r' within multiline strings. Properly handle multil…

1ba593f

…ine strings with 0 indentation. Simplify string lexing rules a bit

Sort out recognition of constructor operators vs vanilla operators

7cc54d6

Strip datatype contexts with greater care

08b1964

Add some more tests

7d8bff7

Export separate functions for tokenization and processing tokens

1eab45d

Recognise GADT-like newtypes

f2a866a

I'm just as surprised to find out that they do actually exist as you currently are.

Correctly handle multiple type parameters before an infix constructor

b615eda

Detect special Haskell type and constructor names like [], (), (,), (…

f34d25f

…,,), etc

sergv added 2 commits September 16, 2018 23:47

Fix build for GHC 7.8

3440b7a

Fix redundant import

b97e5b9

elaforge merged commit cefab1e into elaforge:master Sep 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve precision of tag analysis #40

Improve precision of tag analysis #40

sergv commented Sep 16, 2018

elaforge commented Sep 16, 2018

elaforge commented Sep 16, 2018

sergv commented Sep 16, 2018

elaforge commented Sep 16, 2018

sergv commented Sep 17, 2018

elaforge commented Sep 22, 2018

Improve precision of tag analysis #40

Improve precision of tag analysis #40

Conversation

sergv commented Sep 16, 2018

elaforge commented Sep 16, 2018

elaforge commented Sep 16, 2018

sergv commented Sep 16, 2018

elaforge commented Sep 16, 2018

sergv commented Sep 17, 2018

elaforge commented Sep 22, 2018