-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualize manuscript/project growth #1019
Conversation
@mprobson this takes a very long time to run, but I think it is appropriate for map reduce? (If just turning things into a dictionary counts as reduction). The input is a list of commits (one per line in a .txt file, generated by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How slow is this to run? If we need to speed it up substantially, could we write the statistics to disk for commits that have already been analyzed and only update that file with newer commits? We would need to weigh the tradeoffs of adding that complexity to the code versus just waiting for it it run.
Do you envision this being run manually, with every new commit, or on a regular (e.g. daily/weekly) schedule? Ideally it could auto-update in some form.
# Plot the data | ||
axes = growthdata.plot(kind='line', linewidth=2, subplots=True) | ||
for ax in axes: | ||
ax.set_ylabel('Count') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the word count, this will need to be Count (thousands)
or something similar.
Haha Tony, we're on the same page -- I was about to make a comment asking whether it would make sense to write to disk, and then realized I should just do it, since that's probably the best/only way to speed things up. It's running in a not-unreasonable amount of time with 8 CPUs, but when I had it single threaded, it was slow enough to be pretty annoying, so I definitely don't want it clogging up builds. I don't think it needs to be run with every commit. Definitely daily and even weekly would probably be fine, since it's not really critical to have a super up-to-date record of how many references are in the document etc. One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations? |
AppVeyor build 1.0.4387 ... Found 4 potential spelling error(s). Preview:content/22.vaccines.md:21:devleopmentcontent/22.vaccines.md:74:appraoches content/23.vaccines-app.md:15:IgGs content/23.vaccines-app.md:387:IgGs for commit 37c9270 is now complete. The rendered manuscript from this build is temporarily available for download at: |
I see this relates to your comments in #944 (which I'm still reviewing). I don't see any harm in creating separate conda environments. If the environments are very similar, perhaps there would be some wasted time (~minutes) configuring multiple environments when one would suffice. I'm wondering whether we should set up this analysis and #944 on the Do you see any pros or cons of adding these project statistics analysis to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to just use the absolute word count instead of dividing the word count by 1000 (I'm sure we can get this working that way if we feel strongly that we want it, but this was easier).
In terms of run time, it takes 8-10 seconds on 12 CPUs (locally) to analyze all 498 commits. With the caching, it takes 1.2 seconds to process 10 new commits locally and 2.2 seconds/10 records if I set it to a single CPU.
Update: to crunch everything through 8/27 took about 10 seconds using 12 CPUs
I actually think it would be kind of nice to have all of the code-based visualization happening in one place (or at least, not filling up the part of the repository that non-coders might be looking at). For that reason, I'd be for moving them to external-resources even though they're not external! I implemented the caching because it was bothering me. The plan to add |
AppVeyor build 1.0.4388 for commit ded441f is now complete. Found 4 potential spelling error(s). Preview:content/22.vaccines.md:21:devleopmentcontent/22.vaccines.md:74:appraoches content/23.vaccines-app.md:15:IgGs content/23.vaccines-app.md:387:IgGs... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code, and it looks good to me overall. I did not try to run it locally or test the caching behavior.
This pull request could focus on adding the code, figures, and statistics. Then I could follow up later to add that to an automated workflow and move the image and json outputs to external-resources
. That could happen after the camera ready deadline.
I changed it to just use the absolute word count instead of dividing the word count by 1000 (I'm sure we can get this working that way if we feel strongly that we want it, but this was easier).
That works for me.
exit("No new commits") | ||
|
||
# Access the variables.json and references.json files associated with each commit and store in dictionary | ||
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used pools in a long time. I see some examples using with
statements (https://stackoverflow.com/questions/45718546/with-clause-for-multiprocessing-in-python) but do not know whether that is necessary. Perhaps it is more robust in case the commit processing has an unexpected error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's the only advantage of using with
(not sure what happens if any of the tasks fails and you didn't close the pool). Usually, I prefer to use futures
, like in this function, write_phenotypes_jobs
. But if this is already working fine, no need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both! I changed it to use with
. @miltondp I wanted to try to learn to use futures, so I'll try to model off of your example code next time I do this!
print(f'Wrote {args.output_figure}.png and {args.output_figure}.svg') | ||
|
||
# Write json output file | ||
manuscript_stats = growthData.iloc[0].to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When saving to json, let's name these variables to be easier to use as template variables within the text. I'm using https://github.com/greenelab/covid19-review/blob/external-resources/csse/csse-stats.json as an example. In this case, what do you think about:
"stats_author_count": "52",
"stats_clean_date": "September 2, 2021",
"stats_iso_date": "2021-09-02",
"stats_references": "1588",
"stats_word_count": "134821"
Or we could use growth_stats
as the prefix instead of stats
.
Then we could use those variables instead of the hard-coded values in
As of April 30, 2021, there were 50 authors, 1,428 references, and 131,949 words in the documents that make up the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, Halie. I left some comments, and I agree with @agitter suggestions. Regarding the parallelization of the code, I think the code is fine if it works for you. I left some comments regarding that as well.
import matplotlib | ||
import argparse | ||
import time | ||
import os.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor comment. If you are using Python > 3.4, I suggest in the future migrating to pathlib
instead of os.path
. It's object-oriented and more intuitive in my opinion, plus it is the new way of file path handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Milton! I made this update.
Returns list of 5 statistics""" | ||
|
||
variablesCommand = "git show " + commit + ":./variables.json" | ||
variables = json.loads(subprocess.getoutput(variablesCommand)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if the command (git show...
) fails for some reason? Is that something that could potentially happen or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! It shouldn't but certainly if the wrong branch was somehow checked out, I can imagine it causing issues.
references = json.loads(subprocess.getoutput(referencesCommand)) | ||
num_ref = len(references) | ||
|
||
return([date, clean_date, num_authors, word_count, num_ref]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are returning several values, it might be more convenient to use a named tuple in the future, so you can access each value by name, which I think is more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
|
||
# If this analysis has been run before, load in the list of commits analyzed | ||
# and only analyze new commits | ||
# Assumes no commits will be added retrospectively (to take advantage of linearity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to remove this assumption could be to create a set
of old and new commits (I understand these are commits' hashes). Then you can process the commits that are new only. Assumbing both commits
and oldCommits
are both lists of commits' hashes, it would be something like: commits_to_be_processed = set(commits) - set(oldCommits)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is almost certainly a way to do this with sets, but it would be most straightforward with an ordered set, and it doesn't seem like there's a particularly good way to get that type of data structure in Python (unless I'm missing something!)
exit("No new commits") | ||
|
||
# Access the variables.json and references.json files associated with each commit and store in dictionary | ||
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's the only advantage of using with
(not sure what happens if any of the tasks fails and you didn't close the pool). Usually, I prefer to use futures
, like in this function, write_phenotypes_jobs
. But if this is already working fine, no need to change.
Co-authored-by: Milton Pividori <miltondp@gmail.com>
@agitter I believe I'm implemented all the feedback from you and Milton! The only thing is that I need to switch the branch so that the commits are on top of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @rando2. I have a few more small comments. It's okay with me that you'll create a new pull request when it's time to merge this with external-resources
.
|
||
# Access files and data in references.json associated with each commit | ||
referencesCommand = "git show " + commit + ":./references.json" | ||
references = json.loads(subprocess.getoutput(referencesCommand)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the try/except here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for catching this! I also moved the string creation outside of the try/except block, since that shouldn't fail (and definitely not with a JSON error).
AppVeyor build 1.0.4459 for commit 3fbea6b is now complete. Found 4 potential spelling error(s). Preview:content/22.vaccines.md:21:devleopmentcontent/22.vaccines.md:74:appraoches content/23.vaccines-app.md:15:IgGs content/23.vaccines-app.md:387:IgGs... |
Co-authored-by: Anthony Gitter <agitter@users.noreply.github.com>
See #1034 |
#1019 for plotting project growth, but on external-resources branch
Description of the proposed additions or changes
This is an updated version of #953 that works off of
master
to pull the commit history and then visualize commitsRelated issues
#953 #952
Suggested reviewers (optional)
Checklist