Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Enable Spellcheck in dockbuild #21402

Closed
wants to merge 14 commits into from

Conversation

FabioRosado
Copy link
Contributor

This is the initial PR to try and fix the issue #21354 the sphinx spelling extension needs every name to be divided and added to each line in order to get the name marked as spelt correctly. The file names_wordlist.txt is updated on every run of the function get_authors() located insidescripts/announce.py.

Since the issue #21396 is related to my previous PR I attempted to resolve this issue here as well, I followed the suggestion of @jorisvandenbossche and added a try/except to the conf.py if the dependency doesn't exist then nothing will happen.

@datapythonista since you submitted a PR(#21397) to try and fix the issue, perhaps you have an opinion about this bit of code. Also, if you would like to do the changes on your own PR I will just delete this bit and use your solution 😄 👍

I would like some feedback on this first attempt, the generator [names.extend(re.sub('\W+', ' ', x).split()) for x in cur.union(pre)] seems a bit smelly but does the trick, should this be refactored to use a long form and make it more readable? I extended the list to names because split() was creating a list inside the main list and this makes every name to be split and added to the same list (this is probably very memory inefficient though)

I tried to run the script from my machine but I got an issue with it, probably I'm doing something wrong but when I run ./scripts/announce.py $GITHUB v0.15.1..v0.23.0 I get the error message:

usage: announce.py [-h] [--repo REPO] revision_range 
announce.py: error: unrecognized arguments: v0.15.1..v0.23.0

Not sure what I'm doing wrong, to be honest. If I run just the get_authors() function everything seems to be fine (including the v0.15.1..v0.23.0 format) and the names_wordlist is updated successfully.

I will look forward to read your opinions about this PR 👍

@pep8speaks
Copy link

pep8speaks commented Jun 9, 2018

Hello @FabioRosado! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 24, 2018 at 14:32 Hours UTC

@@ -67,6 +67,17 @@ def get_authors(revision_range):
cur.discard('Homu')
pre.discard('Homu')

# Update doc/source/names_wordlist.txt with the names of every author
names = []
[names.extend(re.sub('\W+', ' ', x).split()) for x in cur.union(pre)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a generator expression instead of a list comprehension since you only iterate over names once

names = []
[names.extend(re.sub('\W+', ' ', x).split()) for x in cur.union(pre)]

path = os.path.sep.join(os.path.abspath(__file__).split(os.path.sep)[:-2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use os.path.dirname to go up a couple directories or alternately os.path.join using ../.. (latter is more succinct though not completely sure of portability)

@gfyoung gfyoung added Testing pandas testing functions or related to the test suite Docs labels Jun 9, 2018
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FabioRosado do you plan to add more to this PR to actually enable it on CI?
In that case, I would quickly submit another PR with just the try except change in conf.py if you want (so at least the doc build works again)

extensions.append('sphinxcontrib.spelling')
spelling_word_list_filename = ['spelling_wordlist.txt', 'names_wordlist.txt']
spelling_ignore_pypi_package_names = True
except ModuleNotFoundError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also catch ImportError?
Because if you have installed sphinxcontrib.spelling, but not pyenchant, you get this (which will be a common case I think, since this is what you get if you simply pip/conda install it)

spelling_word_list_filename = ['spelling_wordlist.txt', 'names_wordlist.txt']
spelling_ignore_pypi_package_names = True
except ModuleNotFoundError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can print a message about no spell-check being done because it is not installed?

@FabioRosado
Copy link
Contributor Author

Heya yeah that is my plan but I have some busy days at work I plan to work on this tomorrow or Wednesday.

I appreciate your help @jorisvandenbossche I need to check datapythonista’s PR and see if he changed anything and then I’ll check if I need to add something or not.
Also, good point on the ImportError I didn’t thought about that.

I was planning to add some sort of message saying why it failed but I wasn’t sure if I should add a print or some sort of logging (I just need to check what is used in pandas)

@datapythonista
Copy link
Member

@FabioRosado, I can work on the simple PR to make the documentation build work again later this afternoon. Then you can check for a better solution when you have time.

@FabioRosado
Copy link
Contributor Author

@datapythonista that sounds awesome I didn't want to step in on your PR that's why I added that bit as a thing that could be done in case you did it differently haha my schedule just changed tomorrow so I will have to work on this on Wednesday :/

@datapythonista
Copy link
Member

No worries. Just updated the PR so the documentation can be built when the dependencies for the spellcheck are not present, as this is kind of urgent. Then, with no rush and whenever you have time, you can work on a good approach to manage the dependencies for the spellcheck.

@codecov
Copy link

codecov bot commented Jun 14, 2018

Codecov Report

Merging #21402 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21402   +/-   ##
=======================================
  Coverage    91.9%    91.9%           
=======================================
  Files         153      153           
  Lines       49547    49547           
=======================================
  Hits        45537    45537           
  Misses       4010     4010
Flag Coverage Δ
#multiple 90.3% <ø> (ø) ⬆️
#single 41.78% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1aa08c...1290adf. Read the comment docs.

for name in all_names:
wordlist.write('{}\n'.format(name))

update_name_wordlist()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this from my code will fix on the next commit

@FabioRosado
Copy link
Contributor Author

On this update, I have decided to create another function in scripts/announce.py that uses the get_authors in order to get all the names since version 0.3.0. I wanted to use the names already on the wordlist (because it contains some names located in release.rst that aren't in get_authors) so I turned those names into a set, then updated the set with the names that I got from get_authors.

The plan to use sets is to try and avoid duplicated names. I have also updated the regex to avoid some characters but not all non-words (names such as O'Leary as marked as a single word).

Now when the command python make.py spellcheck is run the function update_name_wordlist() will be called before the spelling extension - so all names will be updated before spellcheck is run.

I also updated the bit that datapythonista wrote to just add ModuleNotFoundError and replaced logging.warn by logging.warning - because the .warn method is deprecated.

I'm not sure if the code in update_name_wordlist is the most efficient one so I'm more than happy to change something if you think there are better ways to do this.

Finally, I have been reading about CI and how to add it to the docbuild but I am unsure how to do this, I have also read the scripts located in the ci folder. I'd appreciate some guidance in this matter as I have no idea how to proceed from here (my apologies).

Finally, is it worth it to make a test in order to try and bump coverage up since it went down a bit?

Thank you!

@@ -18,7 +18,7 @@ Fixed Regressions
**Comparing Series with datetime.date**

We've reverted a 0.23.0 change to comparing a :class:`Series` holding datetimes and a ``datetime.date`` object (:issue:`21152`).
In pandas 0.22 and earlier, comparing a Series holding datetimes and ``datetime.date`` objects would coerce the ``datetime.date`` to a datetime before comapring.
In pandas 0.22 and earlier, comparing a Series holding datetimes and ``datetime.date`` objects would coerce the ``datetime.date`` to a datetime before comparing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think you want this whatsnew or the one for 24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing the spellcheck to see if all the names were marked as correct, on the process of the spellcheck it marked this and the other whatsnew with some grammatical errors, so I checked and it seems that they should be fixed that's why I included it here - sorry for the confusion since this change wasn't really related to the PR I could revert the changes and submit a new one with just those changes if you would like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK - it's simple enough that I think that's fine. It just wasn't readily apparent to me that's why those notes were included but that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand I should have explained in my first comment. Could you help me with the CI integration, would it suffice to write a test and run the spellcheck and assert that no exception is raised?

ci/build_docs.sh Outdated
@@ -69,6 +69,10 @@ if [ "$DOC" ]; then
pandas/core/reshape/reshape.py \
pandas/core/reshape/tile.py

echo "Running spellcheck on documentation"
Copy link
Contributor Author

@FabioRosado FabioRosado Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to follow the example on line 35 where make.py is being called. If I'm not mistaken we need to cd into the doc folder before running the make.py spellcheck since the script cd back to $TRAVIS_BUILD_DIR.

This is the first time I tried to write something like this so I'd like your opinion about it and if enabling the spellcheck within CI should be done in a different way.

Thank you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you just move this closer to line 35 so you don't need to keep changing directories?

@FabioRosado
Copy link
Contributor Author

I added the code down there because I thought I needed to let the whole script run and then do the spell check at the end. I have moved that bit closer to ./make.py so this should run once the docs are build

@@ -1,1652 +1,1987 @@
Critchley
Villanova
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this list at all? (as you are now getting it from the authors)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so at build time, why don't we just generate the list (to a file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the list is build when spellcheck is called. I have updated the spellcheck function located in make.py to call the function get_name_wordlist() which is located in scripts/announce.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so let's delete the list then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we still need the list, the only reason why we do need it is due to the fact that there are names(around 5-7) in the file release.rst that are not author names (or at least they are different so they don't get added when we run the function get_names_wordlist.

If these names are not added somewhere then the build will always fail. If you are worried about having a file with all the names of every contributor I could just delete most of them and add only those in release.rst since the file will be updated with all the names.

Perhaps this file could be good to have just in case people need to add their own names (like they use a different one or something and it gets flagged as a spelling error like those in release.rst), so with this file people would know that they have to add those names in there and not in the spelling_wordlist.txt, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could just delete most of them

Yes, I would only keep those that are not generated by the script (so the ones that are needed to get the script running)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would only keep those that are not generated by the script (so the ones that are needed to get the script running)

Actually, another option is to have them all in there, and have the script update it on demand (so if the release manager adds the names to the notes, the scripts needs to be run). That would save the extra step of creating it on each doc build.

@FabioRosado
Copy link
Contributor Author

Yes we do need the list, like I said on my previous PR, the Sphinx extension reads the words to be whitelisted from a txt file and currently there is no way to use stdout to whitelist words 😄

@jorisvandenbossche
Copy link
Member

The doc build on Travis is failing with "ModuleNotFoundError: No module named 'git'" (check the last build to see the log output of the doc build)

@FabioRosado
Copy link
Contributor Author

FabioRosado commented Jun 19, 2018

@jorisvandenbossche The build that failed is that #21511 or the build from my previous commit? If it was the latter I was getting a problem from the lint script, but removing the ModuleNotFoundError from the try/except (when trying to import the spelling library) seem to fix the issue.

But now that you mention that, i'm importing the function from announce.py which uses gitpython
and pygithub, I guess this could potentially be a problem, I'll try to import these dependencies along with sphinx spelling and just log if they weren't found.

-- EDIT --
I have tested my code on python 3.6 and I am currently having issues with importing the function update_names_wordlist() from the announce.py file. On 3.5 there is no issue so I will need to try and figure out how to update the list when the spellcheck runs.

Ideally, I would like the list to be updated when users run the command python make.py spellcheck and not just from travis since the build can break locally but not on travis.

@jorisvandenbossche
Copy link
Member

The build that failed is that #21511 or the build from my previous commit?

No, it is from this PR. Each PR (and each commit that you add to a PR) is tested on travis. You can follow the link to that by clicking on the "All checks have passed" -> "Show all checks".

Note that it is not actually failing (travis will show it as green), but you can see in the doc build log that it is actually failing. Eg for the last commit added here: https://travis-ci.org/pandas-dev/pandas/jobs/393561877

The reason this is failing is simply because the correct dependencies are not installed on travis (so the installation script needs to be updated, need to check where you can add this)

I'll try to import these dependencies along with sphinx spelling and just log if they weren't found.

Yes, that's a good idea.
Although that makes yet another optional dependency for a dev environment ...

@jorisvandenbossche
Copy link
Member

BTW, I think a very valuable contribution would be to add a feature to sphinxcontrib-spelling to skip a file ... (that would solve many problems here!)

@jorisvandenbossche
Copy link
Member

BTW, I think a very valuable contribution would be to add a feature to sphinxcontrib-spelling to skip a file ... (that would solve many problems here!)

And it might be as simple as skipping based on the docname in https://github.com/sphinx-contrib/spelling/blob/fa8c4b6140d7be0dcaa2aefbf690b84090bb0884/sphinxcontrib/spelling/builder.py#L144

@FabioRosado
Copy link
Contributor Author

@jorisvandenbossche Yeah you are right that would make things so much easier, I'll try to submit a PR to the spelling extension to add the possibility to avoid certain files.

I guess this PR will have to either be placed on hold or closed since if this gets merged then the build will eventually fail when new names are added - unless we add another dependency to make this work on the build.

I will wait until I hear from you as to what is the best course of action with these issues 👍

@jorisvandenbossche
Copy link
Member

'll try to submit a PR to the spelling extension to add the possibility to avoid certain files.

That would be really cool!

I guess this PR will have to either be placed on hold or closed

I think we can go ahead with this for now, as long as the script is there to update the list of names to not have the build fail. We can later remove that again if the feature to ignore certain files is added.

@FabioRosado
Copy link
Contributor Author

I had to remove ModuleNotFoundError since linting kept marking this as non-existent as this except is only available on Python 3.6, I have updated the name list back to all the names as well.

I did leave the reference to the function that updates the wordlist on the spellcheck function, but if you run the announce.py script the list will also be updated (since its not called from main). Hope this is okay.

Will look forward to fixing some issue here once I figure out how to avoid files/folders in the spelling extension haha

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

can you rebase / update

@FabioRosado
Copy link
Contributor Author

I thought this PR wasn't accepted as there was some different opinions about it

@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

rereading i think this is ok if we can just auto generate the list of authors from the commit history rather than hard coding everything

@datapythonista
Copy link
Member

@jreback @jorisvandenbossche are you happy merging this if @FabioRosado rebases?

My opinion is that the whole typo validation adds too much complexity to the project. And that being based in enchant, which is a deprecated project, it even makes less sense to have it. We can surely have a separate repo in the pandas org that checks in the pandas repo whether there is any typo with all this system we have, that would be very useful. But having it in pandas, when afaik none of the maintainers was able to set it up, doesn't seem to be worth.

In any case, can you let @FabioRosado if he should rebase so we can merge it, or just close the PR. Thanks!

@jreback
Copy link
Contributor

jreback commented Nov 3, 2018

My opinion is that the whole typo validation adds too much complexity to the project. And that being based in enchant, which is a deprecated project, it even makes less sense to have it. We can surely have a separate repo in the pandas org that checks in the pandas repo whether there is any typo with all this system we have, that would be very useful. But having it in pandas, when afaik none of the maintainers was able to set it up, doesn't seem to be worth.

ok let's close this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing dependency for sphinxcontrib-spelling
7 participants