Add docstring validation script (from pandas) #238

datapythonista · 2019-10-21T04:03:58Z

Updated the pandas validation script to be generic. To validate all docstrings of a project will probably require some work, to consider how the rst lists all public objects.

Removed from the original script, the next pandas-only features:

Setting max_open_warning setting from matplotlib
PEP-8 validation of examples (will add in a follow up, but will require a dependency, and also pandas auto-imports modules, and requires updating many docstrings in the tests
Check to see if numpy or pandas where imported in the examples
Check to see if docstrings mentioned private classes (in pandas NDFrame)
GitHub link in the json report
Changed pprint_thing by str when rendering the wrong parameter error message
Source file in report is not relative to pandas path
Methods validated by introspecting Series and DataFrame (besides the ones obtained from the API rst

Tried with a scikit-learn class and it runs the validation correctly:

################################################################################
############## Docstring (sklearn.linear_model.LinearRegression)  ##############
################################################################################

Ordinary least squares Linear Regression.

Parameters
----------
fit_intercept : boolean, optional, default True
    whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (e.g. data is expected to be already centered).

normalize : boolean, optional, default False
    This parameter is ignored when ``fit_intercept`` is set to False.
    If True, the regressors X will be normalized before regression by
    subtracting the mean and dividing by the l2-norm.
    If you wish to standardize, please use
    :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on
    an estimator with ``normalize=False``.

copy_X : boolean, optional, default True
    If True, X will be copied; else, it may be overwritten.

n_jobs : int, optional, default 1
    The number of jobs to use for the computation.
    If -1 all CPUs are used. This will only provide speedup for
    n_targets > 1 and sufficient large problems.

Attributes
----------
coef_ : array, shape (n_features, ) or (n_targets, n_features)
    Estimated coefficients for the linear regression problem.
    If multiple targets are passed during the fit (y 2D), this
    is a 2D array of shape (n_targets, n_features), while if only
    one target is passed, this is a 1D array of length n_features.

intercept_ : array
    Independent term in the linear model.

Notes
-----
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) wrapped as a predictor object.

################################################################################
################################## Validation ##################################
################################################################################

6 Errors found:
	Closing quotes should be placed in the line after the last text in the docstring (do not close the quotes in the same line as the text, or leave a blank line between the last text and the quotes)
	Double line break found; please use only one blank line to separate sections or paragraphs, and do not leave blank lines at the end of docstrings
	Parameter "fit_intercept" type should use "bool" instead of "boolean"
	Parameter "fit_intercept" description should start with a capital letter
	Parameter "normalize" type should use "bool" instead of "boolean"
	Parameter "copy_X" type should use "bool" instead of "boolean"
3 Warnings found:
	No extended summary found
	See Also section not found
	No examples section found

…d to be removed)

larsoner · 2019-10-21T15:36:40Z

To validate all docstrings of a project will probably require some work, to consider how the rst lists all public objects.

It seems like the general thing would be to allow projects to enumerate their own public objects (somehow). Parsing RST might only be necessary for some projects, might involve going into multiple files, etc. So I'd rather not include any RST parsing.

Basically I would make the API so that you pass a list of public functions that should be checked (either the functions themselves or the names), and it validates the docstrings of those.

larsoner · 2019-10-21T15:37:32Z

So I'd rather not include any RST parsing.

By this I mean no RST parsing for discovering what should be documented, clearly you'll need to parse the RST of functions/classes/methods to see if the docstring is correct :)

larsoner · 2019-10-21T15:40:03Z

... actually maybe the first PR's only job should be to parse and check a single function, method, or class's (__init__) docstring to see if it's correct.

Then how to enumerate over these can be the domain of each project. It's not many lines to do this and it will vary project-by-project.

datapythonista · 2019-10-21T18:26:27Z

Ok, I think then I won't implement the validation as a script, but to be imported by scripts in each project.

I'll keep the validate_all function, so we have consistent formatting. But will receive an iterable returning the objects to validate.

In future iterations we can see if something else makes more sense.

larsoner · 2019-10-21T18:43:29Z

In future iterations we can see if something else makes more sense.

Agreed -- it likely will, but it will be nice at first to have this simple function to start with. The other things (e.g., python -m numpydoc -v script support, or parsing python_reference.rst file(s) or so) can easily build on it but it makes review and testing easier in the meantime.

… and receiving them as a parameter

…available

brandondavid · 2019-10-21T20:25:04Z

no RST parsing for discovering what should be documented

This approach also nicely sails around the autodoc -vs- autosummary difference I encountered while adapting the original pandas script for use with scipy.

P.S. I am actively using the current version of validate.py on scipy.stats and it seems to be working perfectly. If you'd like me to test anything in particular, please let me know.

…ed concept of warning

datapythonista · 2019-10-21T21:49:17Z

I simplified the script a bit more. I finally just left the function that validates a single docstring for now. I need to have a look at what's wrong in the CI, but tests are passing.

Things that probably make sense in follow up PRs:

Call the new validation from __main__.py
Move everything in validate.Docstring to docsrape.NumpyDocString
Add back flake8 validation
See if it makes sense to add functionality to validate all docstrings (probably not I'd say)

datapythonista · 2019-10-22T07:26:07Z

Is there any reason to keep compatibility with python 2 in numpydoc? Didn't see all the python 3 stuff that was implemented in the script recently, but it's a significant regression to make it python 2 compatible.

larsoner · 2019-10-22T12:00:50Z

In #235 I think we came to the consensus that we can drop support for < 3.5. That will make things simpler because you can use the newer inspect functions, right?

datapythonista · 2019-10-22T14:08:03Z

Yes, that's the main issue at the moment. There were also problems with the imports...

Can I open a separate PR where I replace the travis

datapythonista · 2019-10-22T14:09:13Z

Yes, that's the main issue at the moment. There were also problems with the imports...

Should I open a separate PR where I replace the travis 2.7 build by a 3.5 one then?

larsoner · 2019-10-22T14:16:35Z

@datapythonista if you have time please feel free, if not I can try to do it in the next couple of days

datapythonista · 2019-10-23T06:02:51Z

This should be ready now, added all the reasonable tests that were missing, and manually tested with pandas and sklearn docstrings, and all look good.

Let me know if you see anything, but I think this should be a good first version of the validation.

larsoner

Working on integrating this into MNE-Python now, which currently has its own validation. Thoughts thus far:

We need a mechanism for skipping some checks. We could tell people to monkeypatch numpydoc.validate, but it seems like it would be nicer to allow passing a list of str keys to skip to validate.
I'm a bit worried about trying to execute people's examples -- there are some existing utilities for this (doctest principally but also sphinx builds of docs) and I was a bit surprised it even tried. We should either have a switch to disable this, or maybe automatically make it disable if and only if EX01 and EX02 are in the ignore list.

larsoner · 2019-10-23T05:19:44Z

numpydoc/validate.py

+try:
+    from io import StringIO
+except ImportError:
+    from cStringIO import StringIO


Remove now that 3.5 is required

larsoner · 2019-10-23T13:40:39Z

numpydoc/validate.py

+            except ImportError:
+                pass
+            else:
+                continue


Couldn't this be:

for maxsplit in range(1, name.count(".") + 1): module, *func_parts = name.rsplit(".", maxsplit) try: obj = importlib.import_module(module) except ImportError: pass else: break else: raise ImportError('No module ...')

larsoner · 2019-10-23T14:04:13Z

... actually I guess the ignoring part can be done just by culling the list of obtained at the end. But it would maybe be nice to allow disabling running of examples somehow.

larsoner · 2019-10-23T14:32:47Z

So far it seems to do everything our old code did plus a lot more!

Just one false alarm:

SS02 : mne.beamformer._dics.tf_dics : Summary does not start with a capital letter

But the start is a numeral (5D time-frequency beamforming)

larsoner · 2019-10-23T15:14:58Z

Okay got another:

class SizeMixin(object):
    """Estimate MNE object sizes."""

This one I get:

GL02 : mne.utils.mixin.SizeMixin : Closing quotes should be placed in the line after the last text in the docstring

But if I add a newline after the . then pydocstyle complains:

D200: One-line docstring should fit on one line with quotes

So it seems like the GL02 check here should only run if there is more than one line in the docstring

datapythonista · 2019-10-23T16:10:24Z

Thanks for the reviews and the tests @larsoner. I addressed your comments, including getting rid of running the examples, and also avoid errors for capitalization with numbers.

I'm a bit unsure about:

class SizeMixin(object):
    """Estimate MNE object sizes."""

This will fail in pandas anyway because it's lacking examples... I guess what you say makes sense anyway, because you can be ignoring the errors about lacking examples and others. But it still feels a bit wrong have this special case. Personally I'd prefer to fail for that case. Is pydocstyle consistency important? Not a big deal to change it if you have a strong opinion about not raising that error, but I don't think the change it's worth the extra complexity.

larsoner · 2019-10-23T16:21:30Z

Is pydocstyle consistency important? Not a big deal to change it if you have a strong opinion about not raising that error, but I don't think the change it's worth the extra complexity.

Currently all other checks are consistent (at least for our code) but this one. If it's not implemented here, then we and anyone else who has docstrings like this will have to implement their own workaround to look for that error type, check how many lines there are, and make an exception at our end (so that we don't lose the valid checks for not ending on a newline).

So if it's not an issue for Pandas, my vote would be to keep things as consistent as possible. If you don't want it to be correct by default, you could add an argument to validate like allow_single_line=False by default. But even then, it seems weird/ecosystem-inconsistent to go against what pydocstyle recommends by default in this case.

datapythonista · 2019-10-23T16:28:48Z

Ok, fair enough. Not a big deal for pandas, just didn't want to add complexity to the script if it wasn't really needed. In pandas docstrings will fail anyway for the lack of examples, so we won't allow anything we don't allow now.

Btw, forgot to answer. I didn't mention, but you're right in assuming that we can filter errors after the call. I initially thought on passing the list of excludes, but that won't save execution time, and will make the code much more complex. So, as you say, I think the best is to filter the errors you don't care about after that function is called.

datapythonista · 2019-10-23T16:54:19Z

Ok, fixed now. Let me know if there is anything else. Thanks!

larsoner

Looking at the changes quickly everything looks good!

Given that this was used in Pandas, I think the validation is most important on other code bases.

For MNE-Python, which has ~120k lines of code, it works well for me now -- catches all the stuff we used to catch, plus found a bunch of other stuff.
@datapythonista do you want to make a similar PR to Pandas to use this (e.g., using pip install https://api.github.com/repos/datapythonista/numpydoc/zipball/validation in a CI) and make sure it all works well? It seems like a good way to test that it does everything you need correctly.

For SciPy, I implemented a simple public function crawler + validator for SciPy and hit this error:

scipy/_lib/tests/test_import_cycles.py:83: in check_parameters_match
    for err in validate(name_)['errors']
../numpydoc/numpydoc/validate.py:569: in validate
    if not doc.yields and "yield" in doc.method_source:
../numpydoc/numpydoc/validate.py:366: in method_source
    source = inspect.getsource(self.obj)
/usr/lib/python3.8/inspect.py:985: in getsource
    lines, lnum = getsourcelines(object)
...
OSError: could not get source code

It's failing for scipy.optimize.anderson, which is wrapped here so it's likely that this can be fixed at the SciPy end.

For SciPy, after wrapping the validate call in a try/except OSError, I then hit a ParseError for some stuff in scipy.stats. There is some magical generation stuff there that looks to blame, so also I think not numpydoc's problem.

For SciPy, after wrapping the validate call in a try/except (OSError, ParseError): it ran and gave:

E           ES01 : scipy.cluster.hierarchy.average : No extended summary found
E           ES01 : scipy.cluster.hierarchy.complete : No extended summary found
...
E           YD01 : scipy.special._basic.clpmn : No Yields section found
E           YD01 : scipy.stats.stats.iqr : No Yields section found
E           7458 errors

So I think it should be usable for SciPy as well!

@jnothman do you want to see how well this works for sklearn?

At this point I'm +1 for merge, since if there are any bugs, we can fix them iteratively. It already seems to work quite well.

datapythonista · 2019-10-23T17:28:35Z

Thanks a lot for the review and tests, and all the detailed info.

I think for pandas it'll require a decent amount of work to replace what we have now with this. I'd prefer to get this merged before that happens. The main thing is to keep the rest of the validations that we don't want to have here.

I assume there will be minor things to change, besides new validations, but since this won't be breaking anyone's code, I think it's fine to get this merged.

What I'd follow up next is calling this with python -m numpydoc <object_to_validate>. And after that, I'll give a try to move stuff in Docstring to NumPyDocstring, since I think this code organization is good to start, but not optimal (those classes seem to solve the same problem). Then I think it'll be time to discuss on how we can better add the validation to the CI of each project, and whether it makes sense to add something else here.

Does this make sense?

larsoner · 2019-10-23T19:44:41Z

Does this make sense?

Yes this plan sounds good to me. Let's give @jnothman and @rgommers a couple of days to look (or request more time to look), if nobody complains let's merge Friday. Feel free to ping me to do it

rgommers · 2019-10-23T20:35:37Z

I won't have time to look into this in the near future. This sounds like a very nice improvement though, please go ahead with it:)

datapythonista · 2019-10-25T00:38:09Z

I won't be connected tomorrow, but would be good to get this merged then (or soon), since we have people working on the script on pandas, and they'll have to work here once this is merged.

May be @rth wants to have a look to see if this makes sense for sklearn, and the plan in #238 (comment)

larsoner · 2019-10-25T03:34:44Z

Okay let's keep iterating on this as need be, I think it's clear enough this is a good start. Thanks @datapythonista !

rth · 2019-10-25T16:44:27Z

Very nice work!

May be @rth wants to have a look to see if this makes sense for sklearn

From what I saw it definitely does, I'll open separate issues if run into any limitations.

rth · 2019-10-30T13:52:35Z

FYI, made a PR to use this validation tool in scikit-learn CI scikit-learn/scikit-learn#15404

datapythonista added 2 commits October 20, 2019 18:45

First version of the script with no pandas stuff (some tests will nee…

3b52d72

…d to be removed)

Updating tests

e4ccb18

datapythonista added 5 commits October 21, 2019 14:13

Removing the part of the validation that gets the public API objects,…

51aaebc

… and receiving them as a parameter

Updating tests

d5f793c

Fixed bug making examples fail when pytest was called in verbose mode

fa57b58

Replacing pandas by a stdlib module to pass tests when pandas is not …

55248ae

…available

Making code py2 compatible

2a392be

Simplified script (just one validate function + Docstring), and remov…

819ce71

…ed concept of warning

brandondavid mentioned this pull request Oct 22, 2019

validate_docstrings.py as a separate package? pandas-dev/pandas#28822

Closed

datapythonista added 3 commits October 22, 2019 01:36

Fixing imports in py2

6bfaad6

Changing import to see if py2 is happy

174688b

Restoring imports, and calling pytest as a module

5602863

datapythonista closed this Oct 22, 2019

datapythonista added 2 commits October 22, 2019 23:53

Getting new changes from pandas sprint, and removing py2 stuff

1161558

Fixing import error in tests

bb6df11

datapythonista reopened this Oct 23, 2019

Adding tests and removing unused code (improving coverage)

b47a45a

larsoner reviewed Oct 23, 2019

View reviewed changes

datapythonista added 2 commits October 23, 2019 09:45

Better implementation of module import based on code review

bfed573

Remove running examples

f3ef8f6

larsoner mentioned this pull request Oct 23, 2019

MRG, STY: Many style fixes mne-tools/mne-python#6977

Merged

Require first letter to be upper case only if it's a letter

3405253

Allow one liner docstrings with quotes in the same line.

ce8e66a

larsoner approved these changes Oct 23, 2019

View reviewed changes

datapythonista mentioned this pull request Oct 25, 2019

DOC: Docstring script to not require ending period for bullet points pandas-dev/pandas#25786

Closed

2 tasks

larsoner merged commit 7da2a4b into numpy:master Oct 25, 2019

larsoner added the type: Enhancement label Oct 25, 2019

rth mentioned this pull request Oct 30, 2019

Docstring validation with numpydoc scikit-learn/scikit-learn#15404

Merged

datapythonista mentioned this pull request Nov 4, 2019

Docstrings checking ibis-project/ibis#2011

Closed

brandondavid mentioned this pull request Nov 5, 2019

DOC: numpydoc validation of morestats.py scipy/scipy#11017

Merged

rprimet mentioned this pull request Sep 22, 2021

Add numpydoc checks to CI jobs alphacsc/alphacsc#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docstring validation script (from pandas) #238

Add docstring validation script (from pandas) #238

datapythonista commented Oct 21, 2019

larsoner commented Oct 21, 2019

larsoner commented Oct 21, 2019

larsoner commented Oct 21, 2019

datapythonista commented Oct 21, 2019

larsoner commented Oct 21, 2019

brandondavid commented Oct 21, 2019

datapythonista commented Oct 21, 2019

datapythonista commented Oct 22, 2019

larsoner commented Oct 22, 2019

datapythonista commented Oct 22, 2019

datapythonista commented Oct 22, 2019

larsoner commented Oct 22, 2019

datapythonista commented Oct 23, 2019

larsoner left a comment

larsoner Oct 23, 2019

larsoner Oct 23, 2019

larsoner commented Oct 23, 2019

larsoner commented Oct 23, 2019

larsoner commented Oct 23, 2019

datapythonista commented Oct 23, 2019

larsoner commented Oct 23, 2019

datapythonista commented Oct 23, 2019

datapythonista commented Oct 23, 2019

larsoner left a comment

datapythonista commented Oct 23, 2019

larsoner commented Oct 23, 2019

rgommers commented Oct 23, 2019

datapythonista commented Oct 25, 2019

larsoner commented Oct 25, 2019

rth commented Oct 25, 2019

rth commented Oct 30, 2019

Add docstring validation script (from pandas) #238

Add docstring validation script (from pandas) #238

Conversation

datapythonista commented Oct 21, 2019

larsoner commented Oct 21, 2019

larsoner commented Oct 21, 2019

larsoner commented Oct 21, 2019

datapythonista commented Oct 21, 2019

larsoner commented Oct 21, 2019

brandondavid commented Oct 21, 2019

datapythonista commented Oct 21, 2019

datapythonista commented Oct 22, 2019

larsoner commented Oct 22, 2019

datapythonista commented Oct 22, 2019

datapythonista commented Oct 22, 2019

larsoner commented Oct 22, 2019

datapythonista commented Oct 23, 2019

larsoner left a comment

Choose a reason for hiding this comment

larsoner Oct 23, 2019

Choose a reason for hiding this comment

larsoner Oct 23, 2019

Choose a reason for hiding this comment

larsoner commented Oct 23, 2019

larsoner commented Oct 23, 2019

larsoner commented Oct 23, 2019

datapythonista commented Oct 23, 2019

larsoner commented Oct 23, 2019

datapythonista commented Oct 23, 2019

datapythonista commented Oct 23, 2019

larsoner left a comment

Choose a reason for hiding this comment

datapythonista commented Oct 23, 2019

larsoner commented Oct 23, 2019

rgommers commented Oct 23, 2019

datapythonista commented Oct 25, 2019

larsoner commented Oct 25, 2019

rth commented Oct 25, 2019

rth commented Oct 30, 2019