Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-language detection #24

Merged
merged 7 commits into from
Jun 11, 2023
Merged

Auto-language detection #24

merged 7 commits into from
Jun 11, 2023

Conversation

helixbass
Copy link
Owner

In this PR:

  • "auto-detect" language (basically iterate project files of all recognized types and try parsing them against the query and don't hard-fail if some of the languages can't parse that query)
  • add tree-sitter language/parser for Javascript
  • print legitimate errors for a few failure cases

To test:
Existing behavior should continue to work
If you try omitting the --language command-line option in a project with only a single supported file-type (eg a Rust project) then you should see the same matches as with --language rust. And hopefully performance (of a release build) is relatively comparable (it looked a bit slower when running against the Typescript compiler in Rust codebase but when running against a smaller Typescript codebase it seemed indistinguishable, I'm not going to worry about it too much for now)
If you try running in a directory with multiple types of supported files (Rust/Typescript/JS) (eg the newly added tests/fixtures/mixed_project/ directory) with a query that should only be parseable by one of those language's tree-sitter grammars then you should see the expected results for those files but not see any errors related to trying to parse for the other encountered supported file types
If you try running in a directory with multiple types of supported files with a query that should be parseable by more than one of those language's tree-sitter grammars (eg (arrow_function) @a for Typescript/JS) then you should see the expected results for files of all of those file types

Closes #22

Based on specify-files

}

fn extensions(&self) -> Vec<&'static str> {
vec!["js", "jsx", "vue", "cjs", "mjs"]
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is for mapping back to a known supported language when iterating through project files

I got this list from the ignore source code, I think there will probably be more subtly to some of this tree-sitter grammar <-> filename mapping stuff (eg Typescript vs TSX grammar) but this seems like a reasonable dirty approach for the moment?

SupportedLanguageJavascript
}

pub fn get_all_supported_languages() -> HashMap<SupportedLanguageName, Box<dyn SupportedLanguage>> {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to think about making this eg a static Lazy singleton and then decided I was being premature optimization-y

])
}

pub fn maybe_supported_language_from_path(path: &Path) -> Option<Box<dyn SupportedLanguage>> {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core thing mapping project file path back to supported language, so this is getting called once per project file

@@ -91,25 +99,77 @@ fn get_output_mode(args: &Args) -> OutputMode {
}
}

struct MaybeInitializedCaptureIndex(AtomicU32);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe questionable

The weird thing this is trying to address is that we can't validate a provided --capture (capture name) command-line argument (and sort of "parse-don't-validate"/resolve it to the corresponding capture index) until we have a successfully parsed tree-sitter query. But now that happens "on-the-fly" in parallel-world

So it seems like we want a shared atomic-y lazy-initialized resolved capture index (or the failure case also has to be considered, where they provided an invalid capture name and we'd like that to just fail "once" with a nicely printed error message)?

src/lib.rs Outdated

impl MaybeInitializedCaptureIndex {
fn mark_failed(&self) {
self.0.store(u32::MAX - 1, Ordering::Relaxed);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I did (I guess in order to avoid having multiple atomics going on at which point the whole Ordering thing gets harder to reason about) was just have a single AtomicUsize and encode the "uninitialized" and "invalid capture name" states into its value

@@ -129,6 +189,38 @@ pub fn run(args: Args) {
.unwrap();
buffer_writer.print(printer.get_mut()).unwrap();
});

error_if_no_successful_query_parsing(&query_or_failure_by_language);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn "n lazy soft fails" into a "hard fail"

.expect("Walker should've been pre-filtered to just supported file types");
let query = return_if_none!(get_and_cache_query_for_language(
&query_source,
&query_or_failure_by_language,
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no optimization for when you set the --language flag other than the fact that the walker limits itself to just returning files of that type

ie we're still using this Mutex<HashMap<...>> even when we "know" that there's only one language/parsed query we're interested in

On the one hand it seemed a little silly to tear out that "fast path" but it's hard to picture how it wouldn't have made the code more complex and so I deemed it premature optimization to try and maintain that

.lock()
.unwrap()
.entry(language.name())
.or_insert_with(|| maybe_get_query(query_source, language.language()).map(Arc::new))
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a little bit about this in terms of performance - I don't know how slow query-parsing is (vs project file iteration) but there could be competition for this mutex especially "at the beginning" (if a bunch of project files get into rayon-world and get stuck waiting for the mutex to get access to the parsed query for their language)

But (in the "effectively single-language" scenario) they'd all have to in some sense wait for that query-parsing time anyway so doesn't seem terrible

If there are multiple supported languages being encountered then you could have parsing less-than-optimally parallelized I think, seems like then maybe having per-language mutexes would be the optimization there (rather than one mutex for n supported languages)?

.current_dir(get_fixture_dir_path_from_name(fixture_dir_name))
.assert()
.failure()
// .stderr(predicate::function(|stderr: &str| {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's not always super easy to visually parse out the expected vs actual when this fails (let alone how they're different) it's tempting to leave this "debug helper" commented-out here like this for quick toggling

Really it would be nice if it showed a "pretty" diff of expected vs actual

r#"
$ tree-sitter-grep --query-source '(arrow_function) @arrow_function'
javascript_src/index.js:1:const js_foo = () => {}
typescript_src/index.tsx:1:const foo = () => {}
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool cross-language queries!

(no idea how often that would be practically useful but sort of technically interesting)

Copy link
Collaborator

@peterstuart peterstuart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how it'd affect performance (probably faster in some ways, slower in others), but I think it'd be simpler to try and parse the queries against all supported languages before starting the path walking, so:

  • don't need to iterate over files that aren't needed
  • don't have to deal with lazily parsing the queries per thread
  • could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)

WDYT?

src/language.rs Outdated
SupportedLanguageName::Javascript,
Box::new(get_javascript_language()) as Box<dyn SupportedLanguage>,
),
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can just .into() the array to convert into a HashMap.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure updated to use .into()

let extension = path.extension().and_then(OsStr::to_str)?;
get_all_supported_languages()
.into_values()
.find(|language| language.extensions().contains(&extension))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure it makes no difference performance-wise for this amount of data, but the natural collection I'd expected to look this up in is a hash map of extension -> language?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think I was actually doing that (writing a helper to produce such a hash map with the idea of then maybe "lazy singleton"-izing it) but I think (not sure if it was for some specific reason) I just decided this was easier for now

peterstuart added a commit to peterstuart/tree-sitter-grep.el that referenced this pull request Jun 11, 2023
Language detection is automatic as of
helixbass/tree-sitter-grep#24.
Base automatically changed from specify-files to master June 11, 2023 16:48
Copy link
Owner Author

@helixbass helixbass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how it'd affect performance (probably faster in some ways, slower in others), but I think it'd be simpler to try and parse the queries against all supported languages before starting the path walking, so:

  • don't need to iterate over files that aren't needed
  • don't have to deal with lazily parsing the queries per thread
  • could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)

WDYT?

Ya that is certainly intriguing. I opened an issue. All of this may be "naive about optimization guy" but my instinct tends to go the other way:

I sort of like the lazy parsing

And seems like if the # of supported languages grows (I guess I'm thinking it would 10-20-ish to start? I think I'd said something like "all languages with tree-sitter grammar/parser crates" but that's actually a lot (a lot of which are obscure languages) so I don't know "where's the exact right place to land" as far as which to bundle with such a bundling approach) that would only increase that "up-front cost"

And if the "we're in effectively a single-language project but just didn't want to write out the --language flag" case would in fact be common (as I'd guess it would to the point of being "predominant") then it seems like you're just incurring up-front cost of trying to parse the query in n additional languages (which is something even if parallelized) since you wouldn't really be "stripping anything out"/saving anything in terms of iterating over less files (that would only apply in a directory where there are multiple types of supported files but the query only parses for a subset of them?)

could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)

This is valid but (even if on stderr) I think could still be just "noise" for a lot of people (like "ya no shit it's a Rust query, I just want to see the matches"), but others would probably like it?

src/language.rs Outdated
SupportedLanguageName::Javascript,
Box::new(get_javascript_language()) as Box<dyn SupportedLanguage>,
),
])
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure updated to use .into()

let extension = path.extension().and_then(OsStr::to_str)?;
get_all_supported_languages()
.into_values()
.find(|language| language.extensions().contains(&extension))
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think I was actually doing that (writing a helper to produce such a hash map with the idea of then maybe "lazy singleton"-izing it) but I think (not sure if it was for some specific reason) I just decided this was easier for now

src/lib.rs Outdated
let capture_index = query.capture_index_for_name(capture_name);
if capture_index.is_none() {
self.mark_failed();
fail(&format!("invalid capture name '{}'", capture_name));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I updated this to use .compare_exchange() to avoid such a race. And in fact "validated" it by artificially introducing a long sleep, seeing the race happen (ie multiple printings of the failure message), and then no longer seeing it happen after adding the .compare_exchange()

impl MaybeInitializedCaptureIndex {
const UNINITIALIZED: u32 = u32::MAX;

const FAILED: u32 = u32::MAX - 1;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These u32::MAX/u32::MAX - 1's (while encapsulated) were seeming a bit "strewn about" so added named constants for them

} else {
// whichever other thread "won the race" will have called this fail()
// so we'll be getting killed shortly?
thread::sleep(Duration::from_millis(100_000));
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you disagree with this, I am certainly not seeing this cause anything to actually visibly sleep for 100s

@helixbass helixbass merged commit 9a4dc58 into master Jun 11, 2023
@helixbass helixbass deleted the auto-language branch June 11, 2023 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Auto-detect" language?
2 participants