-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-language detection #24
Conversation
} | ||
|
||
fn extensions(&self) -> Vec<&'static str> { | ||
vec!["js", "jsx", "vue", "cjs", "mjs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is for mapping back to a known supported language when iterating through project files
I got this list from the ignore
source code, I think there will probably be more subtly to some of this tree-sitter grammar <-> filename mapping stuff (eg Typescript vs TSX grammar) but this seems like a reasonable dirty approach for the moment?
SupportedLanguageJavascript | ||
} | ||
|
||
pub fn get_all_supported_languages() -> HashMap<SupportedLanguageName, Box<dyn SupportedLanguage>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to think about making this eg a static
Lazy
singleton and then decided I was being premature optimization-y
]) | ||
} | ||
|
||
pub fn maybe_supported_language_from_path(path: &Path) -> Option<Box<dyn SupportedLanguage>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the core thing mapping project file path back to supported language, so this is getting called once per project file
@@ -91,25 +99,77 @@ fn get_output_mode(args: &Args) -> OutputMode { | |||
} | |||
} | |||
|
|||
struct MaybeInitializedCaptureIndex(AtomicU32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe questionable
The weird thing this is trying to address is that we can't validate a provided --capture
(capture name) command-line argument (and sort of "parse-don't-validate"/resolve it to the corresponding capture index) until we have a successfully parsed tree-sitter query. But now that happens "on-the-fly" in parallel-world
So it seems like we want a shared atomic-y lazy-initialized resolved capture index (or the failure case also has to be considered, where they provided an invalid capture name and we'd like that to just fail "once" with a nicely printed error message)?
src/lib.rs
Outdated
|
||
impl MaybeInitializedCaptureIndex { | ||
fn mark_failed(&self) { | ||
self.0.store(u32::MAX - 1, Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I did (I guess in order to avoid having multiple atomics going on at which point the whole Ordering
thing gets harder to reason about) was just have a single AtomicUsize
and encode the "uninitialized" and "invalid capture name" states into its value
@@ -129,6 +189,38 @@ pub fn run(args: Args) { | |||
.unwrap(); | |||
buffer_writer.print(printer.get_mut()).unwrap(); | |||
}); | |||
|
|||
error_if_no_successful_query_parsing(&query_or_failure_by_language); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turn "n lazy soft fails" into a "hard fail"
.expect("Walker should've been pre-filtered to just supported file types"); | ||
let query = return_if_none!(get_and_cache_query_for_language( | ||
&query_source, | ||
&query_or_failure_by_language, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no optimization for when you set the --language
flag other than the fact that the walker limits itself to just returning files of that type
ie we're still using this Mutex<HashMap<...>>
even when we "know" that there's only one language/parsed query we're interested in
On the one hand it seemed a little silly to tear out that "fast path" but it's hard to picture how it wouldn't have made the code more complex and so I deemed it premature optimization to try and maintain that
.lock() | ||
.unwrap() | ||
.entry(language.name()) | ||
.or_insert_with(|| maybe_get_query(query_source, language.language()).map(Arc::new)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a little bit about this in terms of performance - I don't know how slow query-parsing is (vs project file iteration) but there could be competition for this mutex especially "at the beginning" (if a bunch of project files get into rayon
-world and get stuck waiting for the mutex to get access to the parsed query for their language)
But (in the "effectively single-language" scenario) they'd all have to in some sense wait for that query-parsing time anyway so doesn't seem terrible
If there are multiple supported languages being encountered then you could have parsing less-than-optimally parallelized I think, seems like then maybe having per-language mutexes would be the optimization there (rather than one mutex for n supported languages)?
.current_dir(get_fixture_dir_path_from_name(fixture_dir_name)) | ||
.assert() | ||
.failure() | ||
// .stderr(predicate::function(|stderr: &str| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's not always super easy to visually parse out the expected vs actual when this fails (let alone how they're different) it's tempting to leave this "debug helper" commented-out here like this for quick toggling
Really it would be nice if it showed a "pretty" diff of expected vs actual
r#" | ||
$ tree-sitter-grep --query-source '(arrow_function) @arrow_function' | ||
javascript_src/index.js:1:const js_foo = () => {} | ||
typescript_src/index.tsx:1:const foo = () => {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool cross-language queries!
(no idea how often that would be practically useful but sort of technically interesting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how it'd affect performance (probably faster in some ways, slower in others), but I think it'd be simpler to try and parse the queries against all supported languages before starting the path walking, so:
- don't need to iterate over files that aren't needed
- don't have to deal with lazily parsing the queries per thread
- could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)
WDYT?
src/language.rs
Outdated
SupportedLanguageName::Javascript, | ||
Box::new(get_javascript_language()) as Box<dyn SupportedLanguage>, | ||
), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you can just .into()
the array to convert into a HashMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure updated to use .into()
let extension = path.extension().and_then(OsStr::to_str)?; | ||
get_all_supported_languages() | ||
.into_values() | ||
.find(|language| language.extensions().contains(&extension)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure it makes no difference performance-wise for this amount of data, but the natural collection I'd expected to look this up in is a hash map of extension -> language?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I think I was actually doing that (writing a helper to produce such a hash map with the idea of then maybe "lazy singleton"-izing it) but I think (not sure if it was for some specific reason) I just decided this was easier for now
Language detection is automatic as of helixbass/tree-sitter-grep#24.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how it'd affect performance (probably faster in some ways, slower in others), but I think it'd be simpler to try and parse the queries against all supported languages before starting the path walking, so:
- don't need to iterate over files that aren't needed
- don't have to deal with lazily parsing the queries per thread
- could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)
WDYT?
Ya that is certainly intriguing. I opened an issue. All of this may be "naive about optimization guy" but my instinct tends to go the other way:
I sort of like the lazy parsing
And seems like if the # of supported languages grows (I guess I'm thinking it would 10-20-ish to start? I think I'd said something like "all languages with tree-sitter grammar/parser crates" but that's actually a lot (a lot of which are obscure languages) so I don't know "where's the exact right place to land" as far as which to bundle with such a bundling approach) that would only increase that "up-front cost"
And if the "we're in effectively a single-language project but just didn't want to write out the --language
flag" case would in fact be common (as I'd guess it would to the point of being "predominant") then it seems like you're just incurring up-front cost of trying to parse the query in n additional languages (which is something even if parallelized) since you wouldn't really be "stripping anything out"/saving anything in terms of iterating over less files (that would only apply in a directory where there are multiple types of supported files but the query only parses for a subset of them?)
could print some info before the results about which languages it's trying to use (maybe on stderr so it doesn't "dirty" up the stdout results)
This is valid but (even if on stderr) I think could still be just "noise" for a lot of people (like "ya no shit it's a Rust query, I just want to see the matches"), but others would probably like it?
src/language.rs
Outdated
SupportedLanguageName::Javascript, | ||
Box::new(get_javascript_language()) as Box<dyn SupportedLanguage>, | ||
), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure updated to use .into()
let extension = path.extension().and_then(OsStr::to_str)?; | ||
get_all_supported_languages() | ||
.into_values() | ||
.find(|language| language.extensions().contains(&extension)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I think I was actually doing that (writing a helper to produce such a hash map with the idea of then maybe "lazy singleton"-izing it) but I think (not sure if it was for some specific reason) I just decided this was easier for now
src/lib.rs
Outdated
let capture_index = query.capture_index_for_name(capture_name); | ||
if capture_index.is_none() { | ||
self.mark_failed(); | ||
fail(&format!("invalid capture name '{}'", capture_name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I updated this to use .compare_exchange()
to avoid such a race. And in fact "validated" it by artificially introducing a long sleep, seeing the race happen (ie multiple printings of the failure message), and then no longer seeing it happen after adding the .compare_exchange()
impl MaybeInitializedCaptureIndex { | ||
const UNINITIALIZED: u32 = u32::MAX; | ||
|
||
const FAILED: u32 = u32::MAX - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These u32::MAX
/u32::MAX - 1
's (while encapsulated) were seeming a bit "strewn about" so added named constants for them
} else { | ||
// whichever other thread "won the race" will have called this fail() | ||
// so we'll be getting killed shortly? | ||
thread::sleep(Duration::from_millis(100_000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you disagree with this, I am certainly not seeing this cause anything to actually visibly sleep for 100s
In this PR:
To test:
Existing behavior should continue to work
If you try omitting the
--language
command-line option in a project with only a single supported file-type (eg a Rust project) then you should see the same matches as with--language rust
. And hopefully performance (of a release build) is relatively comparable (it looked a bit slower when running against the Typescript compiler in Rust codebase but when running against a smaller Typescript codebase it seemed indistinguishable, I'm not going to worry about it too much for now)If you try running in a directory with multiple types of supported files (Rust/Typescript/JS) (eg the newly added
tests/fixtures/mixed_project/
directory) with a query that should only be parseable by one of those language's tree-sitter grammars then you should see the expected results for those files but not see any errors related to trying to parse for the other encountered supported file typesIf you try running in a directory with multiple types of supported files with a query that should be parseable by more than one of those language's tree-sitter grammars (eg
(arrow_function) @a
for Typescript/JS) then you should see the expected results for files of all of those file typesCloses #22
Based on
specify-files