Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use nbstripout in Github action as a test ? #158

Closed
12rambau opened this issue Oct 27, 2021 · 20 comments
Closed

Can I use nbstripout in Github action as a test ? #158

12rambau opened this issue Oct 27, 2021 · 20 comments

Comments

@12rambau
Copy link

I'm using the pre-commit hook of nbstripout and black to improve code quality of my python applications. As some of the maintainer are not installing pre-commit they are still pushing code with outputs.

Would it be possible to use nbstripout in Github action to check the presence of output ?

@joaonc
Copy link

joaonc commented Oct 30, 2021

This question is probably better suited for StackOverflow.
That said, check out the --dry-run option. However, this currently always return 0, which may make it not usable as a GH action (not super familiar w/ these though, so may be wrong). #159

@kynan
Copy link
Owner

kynan commented Jan 2, 2022

Interesting use case. Have you had a chance to explore this further?

@kynan kynan added help wanted state:needs follow up This issue has interesting suggestions / ideas worth following up on state:waiting Waiting for response for reporter type:enhancement labels Jan 2, 2022
@12rambau
Copy link
Author

12rambau commented Jan 5, 2022

not yet, but that's definitely on my to-do list.

@kynan
Copy link
Owner

kynan commented Sep 24, 2022

@12rambau Are you still interested in this? Feel like working on it?

@12rambau
Copy link
Author

I'm still interested but I was swamped with many things all year, I'll do some open source work starting this week and this repository is on my todo list

@kynan
Copy link
Owner

kynan commented Sep 24, 2022

Perfect! Keep me posted and let me know if you have questions, get stuck etc. :)

@arobrien
Copy link
Contributor

I created a script to do this as a stop-gap measure until we can get this into nbstripout. I'm putting in a few Pull requests now to try and make that happen, starting with #176 to get tests working on windows, and then aiming to do a re-organisation to reduce code duplication and move functionality out of main() after that.

This allows me to call python nbstripout_wrapper.py --check notebooks in my CI, and users can run python nbstripout_wrapper.py notebooks to strip all of their notebooks manually if they don't want to implement a pre-commit hook for whatever reason.

It's a bit quick-and-dirty, and it ends up re-implementing (copy-paste) a lot of nbstripout code, but key improvements were to:

  • implement recursive walking of a directory, so you can pass directories instead of just paths
  • keep track of which files would have been modified, and if running with --check, return 1 if any files would have been modified.
""""
Re-packages nbstripout functionality to be more useful for CI/CD pipelines.

Supports most of the same options as nbstripout, but does not support installation as a git hook.

Based on nbstripout version 0.6.1
This is currently very fragile to changes in nbstripout internal structure, so it is recommended to fix the version in
requirements.txt
"""

from argparse import ArgumentParser
from copy import deepcopy
import io
from pathlib import Path
from typing import Any, List
import warnings

from nbformat import NotebookNode, read, write, NO_CONVERT
from nbstripout import strip_output
from nbstripout._nbstripout import _parse_size


def process_args():
    parser = ArgumentParser()
    parser.add_argument(
        "--check",
        action="store_true",
        help="Run check without modifying any files. Returns 1 if files would have been modified.",
    )
    parser.add_argument("--keep-count", action="store_true", help="Do not strip the execution count/prompt number")
    parser.add_argument("--keep-output", action="store_true", help="Do not strip output", default=None)
    parser.add_argument(
        "--extra-keys",
        default="",
        help="Space separated list of extra keys to strip " "from metadata, e.g. metadata.foo cell.metadata.bar",
    )
    parser.add_argument(
        "--drop-empty-cells",
        action="store_true",
        help="Remove cells where `source` is empty or contains only whitepace",
    )
    parser.add_argument(
        "--drop-tagged-cells", default="", help="Space separated list of cell-tags that remove an entire cell"
    )
    parser.add_argument(
        "--strip-init-cells", action="store_true", help="Remove cells with `init_cell: true` metadata (default: False)"
    )
    parser.add_argument("--max-size", metavar="SIZE", help="Keep outputs smaller than SIZE", default="0")
    parser.add_argument("paths", nargs="*", help="Files or folders to check with nbstripout")
    return parser.parse_args()


ALLOWED_EXTENSIONS = [".ipynb"]

extra_keys = [
    "metadata.signature",
    "metadata.widgets",
    "cell.metadata.collapsed",
    "cell.metadata.ExecuteTime",
    "cell.metadata.execution",
    "cell.metadata.heading_collapsed",
    "cell.metadata.hidden",
    "cell.metadata.scrolled",
]


def read_notebook_file(p: Path):
    with io.open(p, "r", encoding="utf8") as f:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=UserWarning)
            return read(f, as_version=NO_CONVERT)


def write_notebook_file(nb: NotebookNode, p: Path):
    with io.open(p, "w", encoding="utf8", newline="") as f:
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=UserWarning)
            write(nb, f)


def process_file(p: Path, is_check: bool, stripout_args: List[Any]):
    if p.suffix not in ALLOWED_EXTENSIONS:
        return []

    nb_input = read_notebook_file(p)

    nb_input_copy = deepcopy(nb_input)  # stip_output modifies input, so save a copy for comparison later
    nb_output = strip_output(nb_input, *stripout_args)

    if not is_check:
        write_notebook_file(nb_output, p)

    if not (nb_input_copy == nb_output):
        print(f"File: {p} was stripped by nbstripout.")
        return [p]
    return []


def process_folder(p: Path, is_check: bool, stripout_args: List[Any]):
    if p.is_dir():
        return [f for e in p.iterdir() for f in process_folder(e, is_check, stripout_args)]
    return process_file(p, is_check, stripout_args)


def main():
    args = process_args()
    is_check: bool = args.check
    if is_check:
        print("Checking Notebooks for outputs using nbstripout:")
    else:
        print("Cleaning Notebooks outputs using nbstripout:")

    stripout_args = [
        args.keep_output,
        args.keep_count,
        extra_keys,
        args.drop_empty_cells,
        args.drop_tagged_cells.split(),
        args.strip_init_cells,
        _parse_size(args.max_size),
    ]

    modified_files = []

    for p_str in args.paths:
        p = Path(p_str)
        if not p.exists():
            raise FileNotFoundError(p)
        modified_files.append(process_folder(p, is_check, stripout_args))

    modified_files = [f for arr in modified_files for f in arr]

    if is_check:
        if modified_files:
            print(f"{len(modified_files)} files would have been modified by nbstripout.")
            exit(1)
        else:
            print("Success: all Notebooks are already clean.")
            exit(0)

    print(f"{len(modified_files)} files were modified by nbstripout.")
    exit(0)


if __name__ == "__main__":
    main()

@kynan
Copy link
Owner

kynan commented Nov 13, 2022

@arobrien Thanks for that contribution! I'm certainly open to extending / refactoring nbstripout to support that use case. I'll have to rely on contributors to do this work though as it's not a use case I have so I'm not really in a position to provide the requirements. It seems to me that you're motivated enough to get this implemented, which is great :)

More than happy to discuss how you'd like to approach this before you start sending PRs.

@kynan
Copy link
Owner

kynan commented Nov 13, 2022

FWIW, I was reluctant to implement any functionality to recursively walk a directory tree or anything like that as you can simply use e.g. find to achieve the same (see #127 for additional thoughts), but I might be convinced otherwise to e.g. support Windows users.

@kynan kynan added this to the Backlog milestone Nov 13, 2022
@arobrien
Copy link
Contributor

@kynan yes, I've got a little bit of capacity at the moment to help out.

As for approach, I think maintining current functionality is most important, but if we can make nbstripout work the same as black, at least as far as the Black The Basics help page then that would be great. Your licence seems to be pretty much exactly the same as the MIT licence used by black, so copying code shouldn't be a problem (I'm curious why you don't just use the standard MIT licence). Key features of black mentioned in that document:

  • Black accepts source_file_or_directory on the command line, as well as - to indicate stdin, with options for include/exclude rules when walking a directory
  • Black returns exit codes if run with --check
  • Black can output diffs
  • Black supports configuration via pyproject.toml

I can see that in the strictest sense none of these features are necessary if you combine them with tools like find, or more complex command line scripts, but I would suggest that Black serves as a model of how this sort of destructive linting tool can be easily worked into both IDEs and Continuous Integration pipelines. Black also serves as a model to how to decide design questions such as option lookup hierarchy.

I see a potential roadmap for achieving this as being:

  1. Improve testing for current use cases
  2. Re-organise the code to separate out main areas of concern:
    • Entry point/command line parsing
    • File list iteration/processing
    • Modification of the file/JSON structure
    • Git integration/installation/status
  3. Implement features one at a time, in small steps:
    1. Complete configuration file processing started in Support config files #169
    2. Implement directory walk the same
      way as in black (process arguments into a file list, and then iterate over that list) https://github.com/psf/black/blob/d97b7898b34b67eb3c6839998920e17ac8c77908/src/black/__init__.py#L614
    3. Implement --check
    4. etc...

@kynan
Copy link
Owner

kynan commented Nov 14, 2022

I've got a little bit of capacity at the moment to help out.

Great to hear!

I think maintining current functionality is most important

Fully agree!

If we can make nbstripout work the same as black, [...] then that would be great

Also agree :)

copying code shouldn't be a problem

I recommend against this, regardless of the license.

I'm curious why you don't just use the standard MIT licence

It is exactly the MIT License :)

@kynan
Copy link
Owner

kynan commented Nov 14, 2022

@arobrien your roadmap looks reasonable, thanks!

@12rambau
Copy link
Author

I see that folks have clearly taken over this matter and thanks for all progresses you made so far. I just want to make a small update on my initial use case.

The problem was the following: How to perform checks equivalent to the one done in pre-commits in the CI so that users that forgot to install it get a failed test. The solution was already existing in GitHub actions:
https://github.com/pre-commit/action

I now use it in all my repositories and I get the test I was expecting for nbstripout: https://github.com/12rambau/sepal_ui/blob/452bc1df1277308c789bc324251464928e7e49fa/.github/workflows/unit.yml#L13

I leave it up to you to continue working on it or close it.

@kynan
Copy link
Owner

kynan commented Jan 15, 2023

Thanks for the update @12rambau! It's not entirely clear to me where nbstripout is invoked in the example you linked to?

@12rambau
Copy link
Author

the github action process is calling the .pre-commit-config.yaml file of the repository. I just wanted to show how easy it is to reproduce the same behavior as the local pre-commit checks.

nbstripout is called in my config file here: https://github.com/12rambau/sepal_ui/blob/main/.pre-commit-config.yaml

@kynan
Copy link
Owner

kynan commented Jan 16, 2023

Gotcha, that's the connection I had missed! Do I understand correctly that you're happy with this workflow and don't require any (further) changes in nbstripout to support your use case?

@12rambau
Copy link
Author

Yes, I'm more than happy with this workflow and yes from my side everything I need is already covered by the current implementation of nbstripout.

@kynan
Copy link
Owner

kynan commented Nov 16, 2024

Another update: as of release 0.8.0, there is a --verify flag (thanks to @jspaezp, see #195), an enhanced dry run mode which exits with return code 1 if nbstripout would have stripped anything and 0 otherwise.

Also, as of 6d181dc, --dry-run only prints output if the notebook would have actually been modified.

@12rambau @arobrien @joaonc Does this satisfy your use cases?

@12rambau
Copy link
Author

it absolutely does ! thanks for working on it.

@kynan
Copy link
Owner

kynan commented Jan 18, 2025

I'll close this for now. Please reopen if there's any outstanding issues or anything that needs fixing / can be improved.

@kynan kynan added resolution:fixed and removed help wanted state:waiting Waiting for response for reporter state:needs follow up This issue has interesting suggestions / ideas worth following up on labels Jan 18, 2025
@kynan kynan modified the milestones: Backlog, 0.8.0 Jan 18, 2025
@kynan kynan self-assigned this Jan 18, 2025
@kynan kynan closed this as completed Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants