Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control what is extracted with the extractcode command #15

Closed
pombredanne opened this issue Jul 5, 2015 · 6 comments
Closed

Control what is extracted with the extractcode command #15

pombredanne opened this issue Jul 5, 2015 · 6 comments

Comments

@pombredanne
Copy link
Member

extractcode supports selecting what is extracted but this is not exposed as a command line option.
There should be a way to control what is extracted possibly with an expanded option --extract=<kind> or but making extraction a separate of a sub command

@pombredanne pombredanne changed the title Control what is extracted with --extract Control what is extracted with the extractcode command Aug 12, 2015
@agneet42
Copy link

agneet42 commented Mar 3, 2017

I'd to like to work on this. By my understanding, this aims to provide more user-options.
Could you guide me a little further? @pombredanne

@agschrei
Copy link
Contributor

Considering that this issue has seemingly gone stale, I am not sure how useful my contribution here is.
However, I would also like to get involved on this before filing a feature request in a new issue.

One problem I continue to have with extractcode is that it tries to extract sparse archives that are sometimes contained within source distributions, for example in the docker-ce source:
docker-ce-19.03.8/components/engine/vendor/archive/tar/testdata/pax-sparse-big.tar-extract/pax-sparse

These seemingly innocuous archives bloat to several gigabytes in size when extracted and regularly fill our scratchpad.
At the same time, since these are test archives, they don't provide any additional value in terms of identifying licenses/copyrights, etc.

My simple solution for this problem would be to add a CLI flag to extractcode that allows excluding files matching certain regex file patterns from being extracted.

An example might be:

extractcode  myArchiveWithSparseTestFiles.tar --exclude="*sparse*;pax*"

Where the exclusion patterns are separated by semicolons.
Basically, this is the same approach that already exists in scancode as the --ignore flag.

I also assume that this issue is not unique to our use-case so if you agree that such an option provides additional value to users, I am more than willing to contribute it upstream @pombredanne

@steven-esser
Copy link
Contributor

@agschrei There is an open PR that deals with a similar issue: #1946

This is done by a prospective GSoC student and AFAIK is a working solution. There is some cleanup to do on the PR iteself, and I have pinged the original author to see if he has some extra code changes. Otherwise I will clean it up and hopefully have it merged soon.

We use extractcode often internally and have the same need for this feature that you do.

@steven-esser
Copy link
Contributor

@agschrei If interested, you can test out that particular branch here: https://github.com/JRavi2/scancode-toolkit/tree/add-ignore-flag

You may want to test out this solution on your end to see if there are any bugs in a real use case.

@agschrei
Copy link
Contributor

@MaJuRG Thanks a ton for your quick response - I had simply missed #1946 when searching for existing issues that deal with extractcode CLI.
I am now subscribed to that issue and will comment on our experience with the PR once I have had a chance to test.
Keep up the good work!

@steven-esser
Copy link
Contributor

Thanks. Looking forward to your response!

pombredanne added a commit that referenced this issue Jan 12, 2022
Update .gitignore to ignore Jupyter temp files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants