Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docstring validation script (from pandas) #238

Merged
merged 18 commits into from
Oct 25, 2019
Merged

Add docstring validation script (from pandas) #238

merged 18 commits into from
Oct 25, 2019

Conversation

datapythonista
Copy link
Contributor

xref #213

Updated the pandas validation script to be generic. To validate all docstrings of a project will probably require some work, to consider how the rst lists all public objects.

Removed from the original script, the next pandas-only features:

  • Setting max_open_warning setting from matplotlib
  • PEP-8 validation of examples (will add in a follow up, but will require a dependency, and also pandas auto-imports modules, and requires updating many docstrings in the tests
  • Check to see if numpy or pandas where imported in the examples
  • Check to see if docstrings mentioned private classes (in pandas NDFrame)
  • GitHub link in the json report
  • Changed pprint_thing by str when rendering the wrong parameter error message
  • Source file in report is not relative to pandas path
  • Methods validated by introspecting Series and DataFrame (besides the ones obtained from the API rst

Tried with a scikit-learn class and it runs the validation correctly:

################################################################################
############## Docstring (sklearn.linear_model.LinearRegression)  ##############
################################################################################

Ordinary least squares Linear Regression.

Parameters
----------
fit_intercept : boolean, optional, default True
    whether to calculate the intercept for this model. If set
    to False, no intercept will be used in calculations
    (e.g. data is expected to be already centered).

normalize : boolean, optional, default False
    This parameter is ignored when ``fit_intercept`` is set to False.
    If True, the regressors X will be normalized before regression by
    subtracting the mean and dividing by the l2-norm.
    If you wish to standardize, please use
    :class:`sklearn.preprocessing.StandardScaler` before calling ``fit`` on
    an estimator with ``normalize=False``.

copy_X : boolean, optional, default True
    If True, X will be copied; else, it may be overwritten.

n_jobs : int, optional, default 1
    The number of jobs to use for the computation.
    If -1 all CPUs are used. This will only provide speedup for
    n_targets > 1 and sufficient large problems.

Attributes
----------
coef_ : array, shape (n_features, ) or (n_targets, n_features)
    Estimated coefficients for the linear regression problem.
    If multiple targets are passed during the fit (y 2D), this
    is a 2D array of shape (n_targets, n_features), while if only
    one target is passed, this is a 1D array of length n_features.

intercept_ : array
    Independent term in the linear model.

Notes
-----
From the implementation point of view, this is just plain Ordinary
Least Squares (scipy.linalg.lstsq) wrapped as a predictor object.

################################################################################
################################## Validation ##################################
################################################################################

6 Errors found:
	Closing quotes should be placed in the line after the last text in the docstring (do not close the quotes in the same line as the text, or leave a blank line between the last text and the quotes)
	Double line break found; please use only one blank line to separate sections or paragraphs, and do not leave blank lines at the end of docstrings
	Parameter "fit_intercept" type should use "bool" instead of "boolean"
	Parameter "fit_intercept" description should start with a capital letter
	Parameter "normalize" type should use "bool" instead of "boolean"
	Parameter "copy_X" type should use "bool" instead of "boolean"
3 Warnings found:
	No extended summary found
	See Also section not found
	No examples section found

@larsoner
Copy link
Collaborator

To validate all docstrings of a project will probably require some work, to consider how the rst lists all public objects.

It seems like the general thing would be to allow projects to enumerate their own public objects (somehow). Parsing RST might only be necessary for some projects, might involve going into multiple files, etc. So I'd rather not include any RST parsing.

Basically I would make the API so that you pass a list of public functions that should be checked (either the functions themselves or the names), and it validates the docstrings of those.

@larsoner
Copy link
Collaborator

So I'd rather not include any RST parsing.

By this I mean no RST parsing for discovering what should be documented, clearly you'll need to parse the RST of functions/classes/methods to see if the docstring is correct :)

@larsoner
Copy link
Collaborator

... actually maybe the first PR's only job should be to parse and check a single function, method, or class's (__init__) docstring to see if it's correct.

Then how to enumerate over these can be the domain of each project. It's not many lines to do this and it will vary project-by-project.

@datapythonista
Copy link
Contributor Author

Ok, I think then I won't implement the validation as a script, but to be imported by scripts in each project.

I'll keep the validate_all function, so we have consistent formatting. But will receive an iterable returning the objects to validate.

In future iterations we can see if something else makes more sense.

@larsoner
Copy link
Collaborator

In future iterations we can see if something else makes more sense.

Agreed -- it likely will, but it will be nice at first to have this simple function to start with. The other things (e.g., python -m numpydoc -v script support, or parsing python_reference.rst file(s) or so) can easily build on it but it makes review and testing easier in the meantime.

@brandondavid
Copy link

no RST parsing for discovering what should be documented

This approach also nicely sails around the autodoc -vs- autosummary difference I encountered while adapting the original pandas script for use with scipy.

P.S. I am actively using the current version of validate.py on scipy.stats and it seems to be working perfectly. If you'd like me to test anything in particular, please let me know.

@datapythonista
Copy link
Contributor Author

I simplified the script a bit more. I finally just left the function that validates a single docstring for now. I need to have a look at what's wrong in the CI, but tests are passing.

Things that probably make sense in follow up PRs:

  • Call the new validation from __main__.py
  • Move everything in validate.Docstring to docsrape.NumpyDocString
  • Add back flake8 validation
  • See if it makes sense to add functionality to validate all docstrings (probably not I'd say)

@datapythonista
Copy link
Contributor Author

Is there any reason to keep compatibility with python 2 in numpydoc? Didn't see all the python 3 stuff that was implemented in the script recently, but it's a significant regression to make it python 2 compatible.

@larsoner
Copy link
Collaborator

In #235 I think we came to the consensus that we can drop support for < 3.5. That will make things simpler because you can use the newer inspect functions, right?

@datapythonista
Copy link
Contributor Author

Yes, that's the main issue at the moment. There were also problems with the imports...

Can I open a separate PR where I replace the travis

@datapythonista
Copy link
Contributor Author

Yes, that's the main issue at the moment. There were also problems with the imports...

Should I open a separate PR where I replace the travis 2.7 build by a 3.5 one then?

@larsoner
Copy link
Collaborator

@datapythonista if you have time please feel free, if not I can try to do it in the next couple of days

@datapythonista
Copy link
Contributor Author

This should be ready now, added all the reasonable tests that were missing, and manually tested with pandas and sklearn docstrings, and all look good.

Let me know if you see anything, but I think this should be a good first version of the validation.

Copy link
Collaborator

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on integrating this into MNE-Python now, which currently has its own validation. Thoughts thus far:

  • We need a mechanism for skipping some checks. We could tell people to monkeypatch numpydoc.validate, but it seems like it would be nicer to allow passing a list of str keys to skip to validate.
  • I'm a bit worried about trying to execute people's examples -- there are some existing utilities for this (doctest principally but also sphinx builds of docs) and I was a bit surprised it even tried. We should either have a switch to disable this, or maybe automatically make it disable if and only if EX01 and EX02 are in the ignore list.

try:
from io import StringIO
except ImportError:
from cStringIO import StringIO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove now that 3.5 is required

except ImportError:
pass
else:
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be:

        for maxsplit in range(1, name.count(".") + 1):
            module, *func_parts = name.rsplit(".", maxsplit)
            try:
                obj = importlib.import_module(module)
            except ImportError:
                pass
            else:
                break
        else:
            raise ImportError('No module ...')

@larsoner
Copy link
Collaborator

... actually I guess the ignoring part can be done just by culling the list of obtained at the end. But it would maybe be nice to allow disabling running of examples somehow.

@larsoner
Copy link
Collaborator

So far it seems to do everything our old code did plus a lot more!

Just one false alarm:

SS02 : mne.beamformer._dics.tf_dics : Summary does not start with a capital letter

But the start is a numeral (5D time-frequency beamforming)

@larsoner
Copy link
Collaborator

Okay got another:

class SizeMixin(object):
    """Estimate MNE object sizes."""

This one I get:

GL02 : mne.utils.mixin.SizeMixin : Closing quotes should be placed in the line after the last text in the docstring

But if I add a newline after the . then pydocstyle complains:

D200: One-line docstring should fit on one line with quotes

So it seems like the GL02 check here should only run if there is more than one line in the docstring

@datapythonista
Copy link
Contributor Author

Thanks for the reviews and the tests @larsoner. I addressed your comments, including getting rid of running the examples, and also avoid errors for capitalization with numbers.

I'm a bit unsure about:

class SizeMixin(object):
    """Estimate MNE object sizes."""

This will fail in pandas anyway because it's lacking examples... I guess what you say makes sense anyway, because you can be ignoring the errors about lacking examples and others. But it still feels a bit wrong have this special case. Personally I'd prefer to fail for that case. Is pydocstyle consistency important? Not a big deal to change it if you have a strong opinion about not raising that error, but I don't think the change it's worth the extra complexity.

@larsoner
Copy link
Collaborator

Is pydocstyle consistency important? Not a big deal to change it if you have a strong opinion about not raising that error, but I don't think the change it's worth the extra complexity.

Currently all other checks are consistent (at least for our code) but this one. If it's not implemented here, then we and anyone else who has docstrings like this will have to implement their own workaround to look for that error type, check how many lines there are, and make an exception at our end (so that we don't lose the valid checks for not ending on a newline).

So if it's not an issue for Pandas, my vote would be to keep things as consistent as possible. If you don't want it to be correct by default, you could add an argument to validate like allow_single_line=False by default. But even then, it seems weird/ecosystem-inconsistent to go against what pydocstyle recommends by default in this case.

@datapythonista
Copy link
Contributor Author

Ok, fair enough. Not a big deal for pandas, just didn't want to add complexity to the script if it wasn't really needed. In pandas docstrings will fail anyway for the lack of examples, so we won't allow anything we don't allow now.

Btw, forgot to answer. I didn't mention, but you're right in assuming that we can filter errors after the call. I initially thought on passing the list of excludes, but that won't save execution time, and will make the code much more complex. So, as you say, I think the best is to filter the errors you don't care about after that function is called.

@datapythonista
Copy link
Contributor Author

Ok, fixed now. Let me know if there is anything else. Thanks!

Copy link
Collaborator

@larsoner larsoner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the changes quickly everything looks good!

Given that this was used in Pandas, I think the validation is most important on other code bases.

  • For MNE-Python, which has ~120k lines of code, it works well for me now -- catches all the stuff we used to catch, plus found a bunch of other stuff.
  • @datapythonista do you want to make a similar PR to Pandas to use this (e.g., using pip install https://api.github.com/repos/datapythonista/numpydoc/zipball/validation in a CI) and make sure it all works well? It seems like a good way to test that it does everything you need correctly.
  • For SciPy, I implemented a simple public function crawler + validator for SciPy and hit this error:
    scipy/_lib/tests/test_import_cycles.py:83: in check_parameters_match
        for err in validate(name_)['errors']
    ../numpydoc/numpydoc/validate.py:569: in validate
        if not doc.yields and "yield" in doc.method_source:
    ../numpydoc/numpydoc/validate.py:366: in method_source
        source = inspect.getsource(self.obj)
    /usr/lib/python3.8/inspect.py:985: in getsource
        lines, lnum = getsourcelines(object)
    ...
    OSError: could not get source code
    
    It's failing for scipy.optimize.anderson, which is wrapped here so it's likely that this can be fixed at the SciPy end.
  • For SciPy, after wrapping the validate call in a try/except OSError, I then hit a ParseError for some stuff in scipy.stats. There is some magical generation stuff there that looks to blame, so also I think not numpydoc's problem.
  • For SciPy, after wrapping the validate call in a try/except (OSError, ParseError): it ran and gave:
    E           ES01 : scipy.cluster.hierarchy.average : No extended summary found
    E           ES01 : scipy.cluster.hierarchy.complete : No extended summary found
    ...
    E           YD01 : scipy.special._basic.clpmn : No Yields section found
    E           YD01 : scipy.stats.stats.iqr : No Yields section found
    E           7458 errors
    
    So I think it should be usable for SciPy as well!
  • @jnothman do you want to see how well this works for sklearn?

At this point I'm +1 for merge, since if there are any bugs, we can fix them iteratively. It already seems to work quite well.

@datapythonista
Copy link
Contributor Author

Thanks a lot for the review and tests, and all the detailed info.

I think for pandas it'll require a decent amount of work to replace what we have now with this. I'd prefer to get this merged before that happens. The main thing is to keep the rest of the validations that we don't want to have here.

I assume there will be minor things to change, besides new validations, but since this won't be breaking anyone's code, I think it's fine to get this merged.

What I'd follow up next is calling this with python -m numpydoc <object_to_validate>. And after that, I'll give a try to move stuff in Docstring to NumPyDocstring, since I think this code organization is good to start, but not optimal (those classes seem to solve the same problem). Then I think it'll be time to discuss on how we can better add the validation to the CI of each project, and whether it makes sense to add something else here.

Does this make sense?

@larsoner
Copy link
Collaborator

Does this make sense?

Yes this plan sounds good to me. Let's give @jnothman and @rgommers a couple of days to look (or request more time to look), if nobody complains let's merge Friday. Feel free to ping me to do it

@rgommers
Copy link
Member

I won't have time to look into this in the near future. This sounds like a very nice improvement though, please go ahead with it:)

@datapythonista
Copy link
Contributor Author

I won't be connected tomorrow, but would be good to get this merged then (or soon), since we have people working on the script on pandas, and they'll have to work here once this is merged.

May be @rth wants to have a look to see if this makes sense for sklearn, and the plan in #238 (comment)

@larsoner
Copy link
Collaborator

Okay let's keep iterating on this as need be, I think it's clear enough this is a good start. Thanks @datapythonista !

@larsoner larsoner merged commit 7da2a4b into numpy:master Oct 25, 2019
@rth
Copy link
Contributor

rth commented Oct 25, 2019

Very nice work!

May be @rth wants to have a look to see if this makes sense for sklearn

From what I saw it definitely does, I'll open separate issues if run into any limitations.

@rth
Copy link
Contributor

rth commented Oct 30, 2019

FYI, made a PR to use this validation tool in scikit-learn CI scikit-learn/scikit-learn#15404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants