Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for formatting reStructuredText code snippets #9003

Merged
merged 6 commits into from
Dec 5, 2023

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Dec 5, 2023

(This is not possible to actually use until #8854 is merged.)

ruff_python_formatter: add reStructuredText docstring formatting support

This commit makes use of the refactoring done in prior commits to slot
in reStructuredText support. Essentially, we add a new type of code
example and look for both literal blocks and code block directives.
Literal blocks are treated as Python by default because it seems to be a
common practice.

That is, literal blocks like this:

def example():
    """
    Here's an example::

        foo( 1 )

    All done.
    """
    pass

Will get reformatted. And code blocks (via reStructuredText directives)
will also get reformatted:

def example():
    """
    Here's an example:

    .. code-block:: python

        foo( 1 )

    All done.
    """
    pass

When looking for a code block, it is possible for it to become invalid.
In which case, we back out of looking for a code example and print the
lines out as they are. As with doctest formatting, if reformatting the
code would result in invalid Python or if the code collected from the
block is invalid, then formatting is also skipped.

A number of tests have been added to check both the formatting and
resetting behavior. Mixed indentation is also tested a fair bit, since
one of my initial attempts at dealing with mixed indentation ended up
not working.

I recommend working through this PR commit-by-commit. There is in
particular a somewhat gnarly refactoring before reST support is added.

Closes #8859

@BurntSushi BurntSushi added docstring Related to docstring linting or formatting formatter Related to the formatter labels Dec 5, 2023
@BurntSushi BurntSushi added this to the Formatter: Stable milestone Dec 5, 2023
Copy link
Contributor

github-actions bot commented Dec 5, 2023

ruff-ecosystem results

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

My only concern is that I find it difficult to assess how the changes introducing the action queue impact performance.

I think we can get a good answer to this by simply enabling docstring formatting for our benchmarks until we find the time to add dedicated docstring tests. This gives us at least an idea on how expensive the feature is for docstring not containing code examples.

self.kind = Some(CodeExampleKind::Rst(litblock));
queue.push_back(CodeExampleAddAction::Print { original });
} else {
queue.push_back(CodeExampleAddAction::Print { original });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we'll push to the VecDeque even in case where the docstring contains no code exampels? I would be interested in understanding the performance implication of allocating here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it will. Although it will push to the queue for every line, it should only allocate the first time. Even in the case where a code example is present, I don't expect the queue to ever grow beyond more than a couple elements, so the allocation should be amortized. But it's only amortized within each docstring.

It does indeed look like the happy path in the status quo does not allocate. I still perceive this to likely be a marginal cost, but I'll see about adding some benchmarks to this PR.

// Pad to the next multiple of tab_width
seen_indent_len += 8 - (seen_indent_len.rem_euclid(8));
line = &line[1..];
} else if char.is_whitespace() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python only supports \t and (space) as valid indentation characters (and \f form feeds that reset the indentation). It's unclear to me if supporting all white space is okay because we're inside a docstring or should limit it to Python-whitespace only (the same applies for indent_with_suffix, we have trim_whitespace_start helper to do so)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. I see. I hadn't known about this or the trim_whitespace_start helper.

One possible issue I see here is that the docstring whitespace normalization uses just the bare trim_start() and trim_end() routines in places. Is that correct, or should those be switched to the Python specific definition too? Same deal with indentation_length. It uses char::is_whitespace from std, which covers all the Unicode forms of whitespace too.

In any case, I've updated my uses of whitespace to trim_whitespace_start and is_python_whitespace. (Probably that module should get re-tooled a bit. i.e., Add an extension trait for char and rename trim_whitespace_start and assorted methods to trim_python_start.)

@MichaReiser
Copy link
Member

@zanieb should we remove the formatter label from the PR, considering that the functionality isn't available yet or will we remove the changelog entry manually?

@charliermarsh
Copy link
Member

@MichaReiser - We clean the changelogs manually, so no issue keeping the label on there.

This fixes a bug where if the very last line in a docstring is a
doctest, then it wasn't getting formatted.

We do a little more than fix that bug in this commit. In particular, we
industrialize how we handle actions by using a queue. Instead of trying
to do too much with each action, we make each one a bit simpler and
build the infrastructure necessary to permit more arbitrary composition.
This in particular lets us handle weird corner cases like "the code
example and the docstring end at the same time" without much fuss.

In essence, actions like Format and Reset no longer carry the "original"
line. Instead of doing that, we generate two actions: the first to
format and the second (if necessary) to print the original line.

This infrastructure will become more apparently useful when dealing with
reStructuredText blocks.
There was really no good reason to have it panic if the code block given
is empty. Instead, we can just return `None` and everything will get
handled correctly. (It will turn into a `Reset` with zero lines, which
will in turn be a no-op.)

In practice I do believe empty code blocks are not possible, but I felt
like this makes the code a bit more robust.
The comment in the source code should explain, but basically,
if the user has tab indentation enabled, then this winds up
screwing with code snippet formatting. The essential problem
seems to be that docstring normalization strips tabs, and this
interplay between code snippet reformatting with tabs winds up
making formatting non-idempotent.

We justify this hard-coded option by pointing out that tabs
get stripped as part of docstring normalization anyway.
Small style tweak.

. o O { too bad `imports_granularity` isn't stable in rustfmt yet }
This commit makes use of the refactoring done in prior commits to slot
in reStructuredText support. Essentially, we add a new type of code
example and look for *both* literal blocks and code block directives.
Literal blocks are treated as Python by default because it seems to be a
common practice[1].

That is, literal blocks like this:

```
def example():
    """
    Here's an example::

        foo( 1 )

    All done.
    """
    pass
```

Will get reformatted. And code blocks (via reStructuredText directives)
will also get reformatted:

```
def example():
    """
    Here's an example:

    .. code-block:: python

        foo( 1 )

    All done.
    """
    pass
```

When looking for a code block, it is possible for it to become invalid.
In which case, we back out of looking for a code example and print the
lines out as they are. As with doctest formatting, if reformatting the
code would result in invalid Python or if the code collected from the
block is invalid, then formatting is also skipped.

A number of tests have been added to check both the formatting and
resetting behavior. Mixed indentation is also tested a fair bit, since
one of my initial attempts at dealing with mixed indentation ended up
not working.

Closes #8859

[1]: adamchainz/blacken-docs#195
This is what seems consistent with the prevailing code.
@BurntSushi
Copy link
Member Author

BurntSushi commented Dec 5, 2023

In lieu of micro-benchmarks, I decided to do a little ad hoc benchmarking with hyperfine on dagster.

Using this branch with a cherry-pick of #8854, I ran formatting with the default config and formatting with format-code-in-docstrings enabled:

$ hyperfine \
    --warmup 3 \
    --prepare 'git reset --hard master' \
    --cleanup 'git reset --hard master' \
    'ruff format --config /tmp/emptyruff.toml ./' \
    'ruff format --config /tmp/ruff.toml ./'
Benchmark 1: ruff format --config /tmp/emptyruff.toml ./
  Time (mean ± σ):      84.4 ms ±   1.8 ms    [User: 1084.2 ms, System: 155.3 ms]
  Range (min … max):    80.8 ms …  88.2 ms    16 runs

Benchmark 2: ruff format --config /tmp/ruff.toml ./
  Time (mean ± σ):      85.1 ms ±   2.1 ms    [User: 1104.4 ms, System: 160.0 ms]
  Range (min … max):    82.1 ms …  89.5 ms    16 runs

Summary
  ruff format --config /tmp/emptyruff.toml ./ ran
    1.01 ± 0.03 times faster than ruff format --config /tmp/ruff.toml ./

Where:

$ git remote -v
origin  git@github.com:dagster-io/dagster (fetch)
origin  git@github.com:dagster-io/dagster (push)

$ git rev-parse HEAD
dbb064c2ddda74265b8174edd9775e1302ca6ba0

$ cat /tmp/emptyruff.toml
$ cat /tmp/ruff.toml
[format]
format-code-in-docstrings = true

I ran it multiple times, and the same result occurred. So there is just a very slight observable slow-down here. But, this particular pile of code does have a large number of reStructuredText code blocks. So you'd expect it to possibly run a little more slowly since it is doing more work.

On a profile, I can see run_action_queue, but it is barely a blip. To be sure, I tweaked the top-level entry point for code snippet formatting to do this:

if !self.code_example.kind.is_none()
    || CodeExampleDoctest::new(line).is_some()
    || CodeExampleRst::new(line).is_some()
{
    self.code_example.add(line, &mut self.action_queue);
    self.run_action_queue()
} else {
    self.print_one(&line.as_output())
}

So basically, as long as we weren't already collecting a code example and the current line didn't look like the start of one, then we could avoid the queue and just print the line directly. I then baked this off against what I had:

$ hyperfine \
    --warmup 3 \
    --prepare 'git reset --hard master' \
    --cleanup 'git reset --hard master' \
    'ruff-rst-formatting-with-queue format --config /tmp/emptyruff.toml ./' \
    'ruff-rst-formatting-with-queue format --config /tmp/ruff.toml ./' \
    'ruff-rst-formatting-with-fast-path format --config /tmp/ruff.toml ./'
Benchmark 1: ruff-rst-formatting-with-queue format --config /tmp/emptyruff.toml ./
  Time (mean ± σ):      84.9 ms ±   2.2 ms    [User: 1077.6 ms, System: 163.4 ms]
  Range (min … max):    80.1 ms …  89.4 ms    17 runs

Benchmark 2: ruff-rst-formatting-with-queue format --config /tmp/ruff.toml ./
  Time (mean ± σ):      85.2 ms ±   2.5 ms    [User: 1119.1 ms, System: 148.7 ms]
  Range (min … max):    81.7 ms …  92.3 ms    17 runs

Benchmark 3: ruff-rst-formatting-with-fast-path format --config /tmp/ruff.toml ./
  Time (mean ± σ):      86.0 ms ±   2.2 ms    [User: 1098.4 ms, System: 163.7 ms]
  Range (min … max):    81.4 ms …  89.7 ms    16 runs

Summary
  ruff-rst-formatting-with-queue format --config /tmp/emptyruff.toml ./ ran
    1.00 ± 0.04 times faster than ruff-rst-formatting-with-queue format --config /tmp/ruff.toml ./
    1.01 ± 0.04 times faster than ruff-rst-formatting-with-fast-path format --config /tmp/ruff.toml ./

I ran this a few times and sometimes the fast path would be a hair faster and other times it would be flipped. Running without docstring formatting enabled at all was consistently faster by 1.00-1.01 times.

I think this satisfies me personally that the queue overhead is probably negligible, at least until we have some data suggesting otherwise. (I'm sure there are more synthetic benchmarks one could construct that might show a bigger difference.)

@BurntSushi
Copy link
Member Author

Going to bring this, but @MichaReiser feel free to leave more feedback and I'll address it in a follow-up PR. :-)

@BurntSushi BurntSushi merged commit c48ba69 into main Dec 5, 2023
17 checks passed
@BurntSushi BurntSushi deleted the ag/fmt/rest branch December 5, 2023 19:14
@MichaReiser
Copy link
Member

Thanks for running the manual benchmark. This looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docstring Related to docstring linting or formatting formatter Related to the formatter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

docstring code formatter: add support for reStructuredText Python code snippets
3 participants