Auto-language detection #24

helixbass · 2023-06-10T14:56:27Z

In this PR:

"auto-detect" language (basically iterate project files of all recognized types and try parsing them against the query and don't hard-fail if some of the languages can't parse that query)
add tree-sitter language/parser for Javascript
print legitimate errors for a few failure cases

To test:
Existing behavior should continue to work
If you try omitting the --language command-line option in a project with only a single supported file-type (eg a Rust project) then you should see the same matches as with --language rust. And hopefully performance (of a release build) is relatively comparable (it looked a bit slower when running against the Typescript compiler in Rust codebase but when running against a smaller Typescript codebase it seemed indistinguishable, I'm not going to worry about it too much for now)
If you try running in a directory with multiple types of supported files (Rust/Typescript/JS) (eg the newly added tests/fixtures/mixed_project/ directory) with a query that should only be parseable by one of those language's tree-sitter grammars then you should see the expected results for those files but not see any errors related to trying to parse for the other encountered supported file types
If you try running in a directory with multiple types of supported files with a query that should be parseable by more than one of those language's tree-sitter grammars (eg (arrow_function) @a for Typescript/JS) then you should see the expected results for files of all of those file types

Closes #22

Based on specify-files

helixbass · 2023-06-10T14:59:57Z

src/language.rs

+    }
+
+    fn extensions(&self) -> Vec<&'static str> {
+        vec!["js", "jsx", "vue", "cjs", "mjs"]


This method is for mapping back to a known supported language when iterating through project files

I got this list from the ignore source code, I think there will probably be more subtly to some of this tree-sitter grammar <-> filename mapping stuff (eg Typescript vs TSX grammar) but this seems like a reasonable dirty approach for the moment?

helixbass · 2023-06-10T15:01:15Z

src/language.rs

+    SupportedLanguageJavascript
+}
+
+pub fn get_all_supported_languages() -> HashMap<SupportedLanguageName, Box<dyn SupportedLanguage>> {


I started to think about making this eg a static Lazy singleton and then decided I was being premature optimization-y

helixbass · 2023-06-10T15:02:20Z

src/language.rs

+    ])
+}
+
+pub fn maybe_supported_language_from_path(path: &Path) -> Option<Box<dyn SupportedLanguage>> {


This is the core thing mapping project file path back to supported language, so this is getting called once per project file

helixbass · 2023-06-10T15:07:01Z

src/lib.rs

@@ -91,25 +99,77 @@ fn get_output_mode(args: &Args) -> OutputMode {
    }
 }

+struct MaybeInitializedCaptureIndex(AtomicU32);


This is maybe questionable

The weird thing this is trying to address is that we can't validate a provided --capture (capture name) command-line argument (and sort of "parse-don't-validate"/resolve it to the corresponding capture index) until we have a successfully parsed tree-sitter query. But now that happens "on-the-fly" in parallel-world

So it seems like we want a shared atomic-y lazy-initialized resolved capture index (or the failure case also has to be considered, where they provided an invalid capture name and we'd like that to just fail "once" with a nicely printed error message)?

helixbass · 2023-06-10T15:08:25Z

src/lib.rs

+
+impl MaybeInitializedCaptureIndex {
+    fn mark_failed(&self) {
+        self.0.store(u32::MAX - 1, Ordering::Relaxed);


What I did (I guess in order to avoid having multiple atomics going on at which point the whole Ordering thing gets harder to reason about) was just have a single AtomicUsize and encode the "uninitialized" and "invalid capture name" states into its value

helixbass · 2023-06-10T15:27:07Z

src/lib.rs

@@ -129,6 +189,38 @@ pub fn run(args: Args) {
                .unwrap();
            buffer_writer.print(printer.get_mut()).unwrap();
        });
+
+    error_if_no_successful_query_parsing(&query_or_failure_by_language);


Turn "n lazy soft fails" into a "hard fail"

helixbass · 2023-06-10T15:31:55Z

src/lib.rs

+                .expect("Walker should've been pre-filtered to just supported file types");
+            let query = return_if_none!(get_and_cache_query_for_language(
+                &query_source,
+                &query_or_failure_by_language,


There is currently no optimization for when you set the --language flag other than the fact that the walker limits itself to just returning files of that type

ie we're still using this Mutex<HashMap<...>> even when we "know" that there's only one language/parsed query we're interested in

On the one hand it seemed a little silly to tear out that "fast path" but it's hard to picture how it wouldn't have made the code more complex and so I deemed it premature optimization to try and maintain that

helixbass · 2023-06-10T15:35:06Z

src/lib.rs

+        .lock()
+        .unwrap()
+        .entry(language.name())
+        .or_insert_with(|| maybe_get_query(query_source, language.language()).map(Arc::new))


I thought a little bit about this in terms of performance - I don't know how slow query-parsing is (vs project file iteration) but there could be competition for this mutex especially "at the beginning" (if a bunch of project files get into rayon-world and get stuck waiting for the mutex to get access to the parsed query for their language)

But (in the "effectively single-language" scenario) they'd all have to in some sense wait for that query-parsing time anyway so doesn't seem terrible

If there are multiple supported languages being encountered then you could have parsing less-than-optimally parallelized I think, seems like then maybe having per-language mutexes would be the optimization there (rather than one mutex for n supported languages)?

helixbass · 2023-06-10T15:36:37Z

tests/output.rs

+        .current_dir(get_fixture_dir_path_from_name(fixture_dir_name))
+        .assert()
+        .failure()
+        // .stderr(predicate::function(|stderr: &str| {


Since it's not always super easy to visually parse out the expected vs actual when this fails (let alone how they're different) it's tempting to leave this "debug helper" commented-out here like this for quick toggling

Really it would be nice if it showed a "pretty" diff of expected vs actual

helixbass · 2023-06-10T15:37:58Z

tests/output.rs

+        r#"
+            $ tree-sitter-grep --query-source '(arrow_function) @arrow_function'
+            javascript_src/index.js:1:const js_foo = () => {}
+            typescript_src/index.tsx:1:const foo = () => {}


Pretty cool cross-language queries!

(no idea how often that would be practically useful but sort of technically interesting)

peterstuart

I'm not sure how it'd affect performance (probably faster in some ways, slower in others), but I think it'd be simpler to try and parse the queries against all supported languages before starting the path walking, so:

don't need to iterate over files that aren't needed
don't have to deal with lazily parsing the queries per thread
could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)

WDYT?

peterstuart · 2023-06-11T11:55:44Z

src/language.rs

+            SupportedLanguageName::Javascript,
+            Box::new(get_javascript_language()) as Box<dyn SupportedLanguage>,
+        ),
+    ])


Alternatively, you can just .into() the array to convert into a HashMap.

Sure updated to use .into()

peterstuart · 2023-06-11T11:56:47Z

src/language.rs

+    let extension = path.extension().and_then(OsStr::to_str)?;
+    get_all_supported_languages()
+        .into_values()
+        .find(|language| language.extensions().contains(&extension))


I'm sure it makes no difference performance-wise for this amount of data, but the natural collection I'd expected to look this up in is a hash map of extension -> language?

Ya I think I was actually doing that (writing a helper to produce such a hash map with the idea of then maybe "lazy singleton"-izing it) but I think (not sure if it was for some specific reason) I just decided this was easier for now

Language detection is automatic as of helixbass/tree-sitter-grep#24.

helixbass

I'm not sure how it'd affect performance (probably faster in some ways, slower in others), but I think it'd be simpler to try and parse the queries against all supported languages before starting the path walking, so:

don't need to iterate over files that aren't needed

don't have to deal with lazily parsing the queries per thread

could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)

WDYT?

Ya that is certainly intriguing. I opened an issue. All of this may be "naive about optimization guy" but my instinct tends to go the other way:

I sort of like the lazy parsing

And seems like if the # of supported languages grows (I guess I'm thinking it would 10-20-ish to start? I think I'd said something like "all languages with tree-sitter grammar/parser crates" but that's actually a lot (a lot of which are obscure languages) so I don't know "where's the exact right place to land" as far as which to bundle with such a bundling approach) that would only increase that "up-front cost"

And if the "we're in effectively a single-language project but just didn't want to write out the --language flag" case would in fact be common (as I'd guess it would to the point of being "predominant") then it seems like you're just incurring up-front cost of trying to parse the query in n additional languages (which is something even if parallelized) since you wouldn't really be "stripping anything out"/saving anything in terms of iterating over less files (that would only apply in a directory where there are multiple types of supported files but the query only parses for a subset of them?)

could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)

This is valid but (even if on stderr) I think could still be just "noise" for a lot of people (like "ya no shit it's a Rust query, I just want to see the matches"), but others would probably like it?

helixbass · 2023-06-11T16:55:29Z

src/language.rs

+            SupportedLanguageName::Javascript,
+            Box::new(get_javascript_language()) as Box<dyn SupportedLanguage>,
+        ),
+    ])


Sure updated to use .into()

helixbass · 2023-06-11T16:57:20Z

src/language.rs

+    let extension = path.extension().and_then(OsStr::to_str)?;
+    get_all_supported_languages()
+        .into_values()
+        .find(|language| language.extensions().contains(&extension))


Ya I think I was actually doing that (writing a helper to produce such a hash map with the idea of then maybe "lazy singleton"-izing it) but I think (not sure if it was for some specific reason) I just decided this was easier for now

helixbass · 2023-06-11T17:19:12Z

src/lib.rs

+                let capture_index = query.capture_index_for_name(capture_name);
+                if capture_index.is_none() {
+                    self.mark_failed();
+                    fail(&format!("invalid capture name '{}'", capture_name));


Ok I updated this to use .compare_exchange() to avoid such a race. And in fact "validated" it by artificially introducing a long sleep, seeing the race happen (ie multiple printings of the failure message), and then no longer seeing it happen after adding the .compare_exchange()

helixbass · 2023-06-11T17:22:03Z

src/lib.rs

+impl MaybeInitializedCaptureIndex {
+    const UNINITIALIZED: u32 = u32::MAX;
+
+    const FAILED: u32 = u32::MAX - 1;


These u32::MAX/u32::MAX - 1's (while encapsulated) were seeming a bit "strewn about" so added named constants for them

helixbass · 2023-06-11T17:23:01Z

src/lib.rs

+                    } else {
+                        // whichever other thread "won the race" will have called this fail()
+                        // so we'll be getting killed shortly?
+                        thread::sleep(Duration::from_millis(100_000));


Let me know if you disagree with this, I am certainly not seeing this cause anything to actually visibly sleep for 100s

helixbass added 3 commits June 10, 2023 10:06

auto language

8d0127d

failure tests

1b1725f

test multiple languages

7bf6810

helixbass commented Jun 10, 2023

View reviewed changes

peterstuart approved these changes Jun 11, 2023

View reviewed changes

peterstuart added a commit to peterstuart/tree-sitter-grep.el that referenced this pull request Jun 11, 2023

Remove language prompt

04b74e6

Language detection is automatic as of helixbass/tree-sitter-grep#24.

Base automatically changed from specify-files to master June 11, 2023 16:48

helixbass added 3 commits June 11, 2023 12:51

merge master

bb183de

use .into()

1cf4227

use .compare_exchange() to avoid race

34bc0c8

helixbass mentioned this pull request Jun 11, 2023

Auto-language: try parsing query in all languages "up-front"? #27

Open

helixbass commented Jun 11, 2023

View reviewed changes

formatting

e27d06c

helixbass merged commit 9a4dc58 into master Jun 11, 2023

helixbass deleted the auto-language branch June 11, 2023 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-language detection #24

Auto-language detection #24

helixbass commented Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

helixbass Jun 10, 2023

peterstuart left a comment

peterstuart Jun 11, 2023

helixbass Jun 11, 2023

peterstuart Jun 11, 2023

helixbass Jun 11, 2023

helixbass left a comment

helixbass Jun 11, 2023

helixbass Jun 11, 2023

helixbass Jun 11, 2023

helixbass Jun 11, 2023

helixbass Jun 11, 2023

Auto-language detection #24

Auto-language detection #24

Conversation

helixbass commented Jun 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterstuart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helixbass left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment