Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: pass file list through stdin #707

Closed
dpc opened this issue Apr 6, 2023 · 15 comments · Fixed by #708
Closed

Feature request: pass file list through stdin #707

dpc opened this issue Apr 6, 2023 · 15 comments · Fixed by #708

Comments

@dpc
Copy link
Contributor

dpc commented Apr 6, 2023

I have a long list of files (newline or \0 separated) to run through typos, and I don't see a good way to pass them other than a bash loop executing typos one by one (which is a waste of time).

Passing like:

typos $files

have issues around globbing and command line argument limit, and can't handle \0.

Would be nice is if I could:

echo "$files" | typos --stdin-paths   # for newline separated paths
echo "$files0" | typos --stdin-paths0 # for zero-separated paths
@epage
Copy link
Collaborator

epage commented Apr 7, 2023

Note that this would still be subject to #347.

Do you have prior art for a flag like this for me to look over how they present and document it?

@dpc
Copy link
Contributor Author

dpc commented Apr 7, 2023

I'll try to think of something. I'm sure I used commands like this in the pasts, but had to be something less popular than bash staples.

For most typical unix commands it's not necessary as calling them in a loop or via xargs is good enough. But with typos I think what happens is that overhead of loading some dictionaries etc. on every call is too much.

It's basically like having xargs and xargs -0 built-in to reuse setup time.

@epage
Copy link
Collaborator

epage commented Apr 8, 2023

I don't have the time right now but it would be interesting to profile startup time. There is no dictionary loading; its just part of the binary.

@dpc
Copy link
Contributor Author

dpc commented Apr 8, 2023

Oh, interesting.

Made me wonder if I haven't messed up something, so I added an echo on every call in my script

[I] {NS} dpc@mutex:~/l/f/main |typo-improvements
> just typos | wc -l
370
[I] {NS} dpc@mutex:~/l/f/main |typo-improvements
> typos --files | wc -l
352

So they roughly do the same work.

I can also see on the screen that each invocation of typos for each file takes about as much time as running typos for all files all at once.

> time typos

________________________________________________________
Executed in  168.08 millis    fish           external
   usr time  285.64 millis    2.30 millis  283.33 millis
   sys time   19.02 millis    0.07 millis   18.95 millis

In my (bash) script (per-file):

real	0m0.031s
user	0m0.017s
sys	0m0.014s

So it does look like startup time.

@epage
Copy link
Collaborator

epage commented Apr 8, 2023

So if I'm understanding you correctly, time typos <file> is taking 31ms while time typos is taking 168ms for 352 files?

One difference is it looks like your just file is running typos serially while native typos invocations are parallelized.

Similarly, if you passed all files on the command line or use #708, everything will be executed serially.

A quick comparison

  • clap $ typos Cargo.toml: 45ms
  • clap $ typos: 620ms
  • clap $ typos -j11,142ms

@dpc
Copy link
Contributor Author

dpc commented Apr 8, 2023

typos flame

@dpc
Copy link
Contributor Author

dpc commented Apr 8, 2023

The exact times are a bit scattered, and I ran typos for all files first, so it might have been a bit cold-cashed.

The above is flamegraph typos on my repo. I'm not sure if I'm reading it right, but it spends lots of time in winnon. I'm confused why _start is after all that winnon parsing.

@dpc
Copy link
Contributor Author

dpc commented Apr 8, 2023

> time typos -j1

________________________________________________________
Executed in  300.03 millis    fish           external
   usr time  280.08 millis    2.38 millis  277.70 millis
   sys time   20.07 millis    0.09 millis   19.98 millis

[I] {NS} dpc@mutex:~/l/f/main |typo-improvements
> time typos

________________________________________________________
Executed in  171.42 millis    fish           external
   usr time  293.08 millis    1.12 millis  291.97 millis
   sys time   21.03 millis    1.03 millis   20.00 millis

Not much of a difference here. Some, but I have 20 cores, so I would expect much higher.

@dpc
Copy link
Contributor Author

dpc commented Apr 8, 2023

________________________________________________________
Executed in    2.42 secs    fish           external
   usr time    7.54 secs    2.35 millis    7.53 secs
   sys time    5.32 secs    0.08 millis    5.32 secs

Using GNU parallel improved things somewhat, but still x10 slower.

@epage
Copy link
Collaborator

epage commented Apr 8, 2023

I'm not sure if I'm reading it right, but it spends lots of time in winnon.

We use winnow to parse the files which has a lot of fallback cases. There is room for improvement there (e.g. using the dispatch macro where it makes sense instead of alt) but there is also just a lot being done to ignore various syntaxes

@epage
Copy link
Collaborator

epage commented Oct 11, 2023

For me, my biggest hang up has been on the prior art / the design of the feature. The flag has been a little odd to me.

Thinking about this more, I realized that I could expand my scope of prior art if I looked to commands that take things in as files. rg --file exists but that is for patterns, and not paths. cspell --file-list is a thing.

I would lean towards

  • --file-list <PATH> where - is assumed to be stdin
  • --file-list conflicts with positional path arguments
  • --file-list prevents the default positional path argument behavior
  • We read --file-list as UTF-8 and newline delimited
    • Any non-UTF8 or newline-inclusive paths would be deferred out

Thoughts?

@dpc
Copy link
Contributor Author

dpc commented Oct 11, 2023

I have stuffed typos into a pre-commit hook and justfile so I don't really call it manually. From my PoV it's not really a typo-of (:wink:) command where I would worry all that much about CLI API convenience. rg is something I call ad-hoc when I'm looking for something (even then I rather use helix's built-in search).

I would lean towards

I'm fine with anything but since I'm here already:

--file-list

--file-list seems confusing to me. First - one can spellcheck directories and symlinks, etc. so "paths" seems more accurate. Then "list" seems just redundant. It's not going to be a map. So possibly plural noun would already carry all the meaning.

--paths or --paths-from ?

--file-list where - is assumed to be stdin

I find using - as some magic constant for stdin a clunky hack from the past. It has a potential of breaking scripts for no good reason and has poor discoverability. It's not even necessarily shorter (--file-list - vs --stdin-paths; BTW --paths-file & --paths-stdin would auto-complete well for discoverability, so would --paths-from and --paths-from-stdin). The only thing it has going for it is "our grandparents did it this way". But they also auto-generated thousand lines long build script using 2 layers of macro-processing.

We read --file-list as UTF-8 and newline delimited

I hope we retain 0-separated string support. Useful. Also kind of a crude hack from the past, but I don't know of any better practical solution. File paths can contain newline, but can't contain \0, making it a reliable path separator. And file paths don't need to be utf-8.

@epage
Copy link
Collaborator

epage commented Oct 11, 2023

Overall, I think my proposal remains unchanged based on this feedback.

--file-list seems confusing to me. First - one can spellcheck directories and symlinks, etc. so "paths" seems more accurate. Then "list" seems just redundant. It's not going to be a map. So possibly plural noun would already carry all the meaning.

I don't think we support directories. Symlinks to files seems to be fine for "files".

I find using - as some magic constant for stdin a clunky hack from the past. It has a potential of breaking scripts for no good reason and has poor discoverability.

Its a common enough convention and we are already using it.

For whatever you think of it, clig also recommends using -

I hope we retain 0-separated string support. Useful. Also kind of a crude hack from the past, but I don't know of any better practical solution. File paths can contain newline, but can't contain \0, making it a reliable path separator. And file paths don't need to be utf-8.

Unless this is something specifically causing problems, I still think this should be deferred out.

@dpc
Copy link
Contributor Author

dpc commented Oct 11, 2023

Deal. Done. :D

@dpc
Copy link
Contributor Author

dpc commented Oct 12, 2023

Thank you and till our next fruitful collaboration! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants