-
-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docstring validation script (from pandas) #238
Conversation
It seems like the general thing would be to allow projects to enumerate their own public objects (somehow). Parsing RST might only be necessary for some projects, might involve going into multiple files, etc. So I'd rather not include any RST parsing. Basically I would make the API so that you pass a list of public functions that should be checked (either the functions themselves or the names), and it validates the docstrings of those. |
By this I mean no RST parsing for discovering what should be documented, clearly you'll need to parse the RST of functions/classes/methods to see if the docstring is correct :) |
... actually maybe the first PR's only job should be to parse and check a single function, method, or class's ( Then how to enumerate over these can be the domain of each project. It's not many lines to do this and it will vary project-by-project. |
Ok, I think then I won't implement the validation as a script, but to be imported by scripts in each project. I'll keep the validate_all function, so we have consistent formatting. But will receive an iterable returning the objects to validate. In future iterations we can see if something else makes more sense. |
Agreed -- it likely will, but it will be nice at first to have this simple function to start with. The other things (e.g., |
… and receiving them as a parameter
This approach also nicely sails around the autodoc -vs- autosummary difference I encountered while adapting the original pandas script for use with scipy. P.S. I am actively using the current version of validate.py on scipy.stats and it seems to be working perfectly. If you'd like me to test anything in particular, please let me know. |
…ed concept of warning
I simplified the script a bit more. I finally just left the function that validates a single docstring for now. I need to have a look at what's wrong in the CI, but tests are passing. Things that probably make sense in follow up PRs:
|
Is there any reason to keep compatibility with python 2 in numpydoc? Didn't see all the python 3 stuff that was implemented in the script recently, but it's a significant regression to make it python 2 compatible. |
In #235 I think we came to the consensus that we can drop support for |
Yes, that's the main issue at the moment. There were also problems with the imports... Can I open a separate PR where I replace the travis |
Yes, that's the main issue at the moment. There were also problems with the imports... Should I open a separate PR where I replace the travis 2.7 build by a 3.5 one then? |
@datapythonista if you have time please feel free, if not I can try to do it in the next couple of days |
This should be ready now, added all the reasonable tests that were missing, and manually tested with pandas and sklearn docstrings, and all look good. Let me know if you see anything, but I think this should be a good first version of the validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on integrating this into MNE-Python now, which currently has its own validation. Thoughts thus far:
- We need a mechanism for skipping some checks. We could tell people to monkeypatch
numpydoc.validate
, but it seems like it would be nicer to allow passing a list of str keys to skip tovalidate
. - I'm a bit worried about trying to execute people's examples -- there are some existing utilities for this (
doctest
principally but alsosphinx
builds of docs) and I was a bit surprised it even tried. We should either have a switch to disable this, or maybe automatically make it disable if and only ifEX01
andEX02
are in the ignore list.
numpydoc/validate.py
Outdated
try: | ||
from io import StringIO | ||
except ImportError: | ||
from cStringIO import StringIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove now that 3.5 is required
numpydoc/validate.py
Outdated
except ImportError: | ||
pass | ||
else: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be:
for maxsplit in range(1, name.count(".") + 1):
 module, *func_parts = name.rsplit(".", maxsplit)
 try:
 obj = importlib.import_module(module)
 except ImportError:
 pass
 else:
 break
else:
raise ImportError('No module ...')
... actually I guess the ignoring part can be done just by culling the list of obtained at the end. But it would maybe be nice to allow disabling running of examples somehow. |
So far it seems to do everything our old code did plus a lot more! Just one false alarm:
But the start is a numeral ( |
Okay got another:
This one I get:
But if I add a newline after the
So it seems like the GL02 check here should only run if there is more than one line in the docstring |
Thanks for the reviews and the tests @larsoner. I addressed your comments, including getting rid of running the examples, and also avoid errors for capitalization with numbers. I'm a bit unsure about: class SizeMixin(object):
"""Estimate MNE object sizes.""" This will fail in pandas anyway because it's lacking examples... I guess what you say makes sense anyway, because you can be ignoring the errors about lacking examples and others. But it still feels a bit wrong have this special case. Personally I'd prefer to fail for that case. Is |
Currently all other checks are consistent (at least for our code) but this one. If it's not implemented here, then we and anyone else who has docstrings like this will have to implement their own workaround to look for that error type, check how many lines there are, and make an exception at our end (so that we don't lose the valid checks for not ending on a newline). So if it's not an issue for Pandas, my vote would be to keep things as consistent as possible. If you don't want it to be correct by default, you could add an argument to |
Ok, fair enough. Not a big deal for pandas, just didn't want to add complexity to the script if it wasn't really needed. In pandas docstrings will fail anyway for the lack of examples, so we won't allow anything we don't allow now. Btw, forgot to answer. I didn't mention, but you're right in assuming that we can filter errors after the call. I initially thought on passing the list of excludes, but that won't save execution time, and will make the code much more complex. So, as you say, I think the best is to filter the errors you don't care about after that function is called. |
Ok, fixed now. Let me know if there is anything else. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the changes quickly everything looks good!
Given that this was used in Pandas, I think the validation is most important on other code bases.
- For MNE-Python, which has ~120k lines of code, it works well for me now -- catches all the stuff we used to catch, plus found a bunch of other stuff.
- @datapythonista do you want to make a similar PR to Pandas to use this (e.g., using
pip install https://api.github.com/repos/datapythonista/numpydoc/zipball/validation
in a CI) and make sure it all works well? It seems like a good way to test that it does everything you need correctly. - For SciPy, I implemented a simple public function crawler + validator for SciPy and hit this error:
It's failing for
scipy/_lib/tests/test_import_cycles.py:83: in check_parameters_match for err in validate(name_)['errors'] ../numpydoc/numpydoc/validate.py:569: in validate if not doc.yields and "yield" in doc.method_source: ../numpydoc/numpydoc/validate.py:366: in method_source source = inspect.getsource(self.obj) /usr/lib/python3.8/inspect.py:985: in getsource lines, lnum = getsourcelines(object) ... OSError: could not get source code
scipy.optimize.anderson
, which is wrapped here so it's likely that this can be fixed at the SciPy end. - For SciPy, after wrapping the
validate
call in atry/except OSError
, I then hit aParseError
for some stuff inscipy.stats
. There is some magical generation stuff there that looks to blame, so also I think notnumpydoc
's problem. - For SciPy, after wrapping the
validate
call in atry/except (OSError, ParseError):
it ran and gave:So I think it should be usable for SciPy as well!E ES01 : scipy.cluster.hierarchy.average : No extended summary found E ES01 : scipy.cluster.hierarchy.complete : No extended summary found ... E YD01 : scipy.special._basic.clpmn : No Yields section found E YD01 : scipy.stats.stats.iqr : No Yields section found E 7458 errors
- @jnothman do you want to see how well this works for
sklearn
?
At this point I'm +1 for merge, since if there are any bugs, we can fix them iteratively. It already seems to work quite well.
Thanks a lot for the review and tests, and all the detailed info. I think for pandas it'll require a decent amount of work to replace what we have now with this. I'd prefer to get this merged before that happens. The main thing is to keep the rest of the validations that we don't want to have here. I assume there will be minor things to change, besides new validations, but since this won't be breaking anyone's code, I think it's fine to get this merged. What I'd follow up next is calling this with Does this make sense? |
I won't have time to look into this in the near future. This sounds like a very nice improvement though, please go ahead with it:) |
I won't be connected tomorrow, but would be good to get this merged then (or soon), since we have people working on the script on pandas, and they'll have to work here once this is merged. May be @rth wants to have a look to see if this makes sense for sklearn, and the plan in #238 (comment) |
Okay let's keep iterating on this as need be, I think it's clear enough this is a good start. Thanks @datapythonista ! |
Very nice work!
From what I saw it definitely does, I'll open separate issues if run into any limitations. |
FYI, made a PR to use this validation tool in scikit-learn CI scikit-learn/scikit-learn#15404 |
xref #213
Updated the pandas validation script to be generic. To validate all docstrings of a project will probably require some work, to consider how the
rst
lists all public objects.Removed from the original script, the next pandas-only features:
max_open_warning
setting from matplotlibpprint_thing
bystr
when rendering the wrong parameter error messageSeries
andDataFrame
(besides the ones obtained from the APIrst
Tried with a scikit-learn class and it runs the validation correctly: