Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualize manuscript/project growth #1019

Closed
wants to merge 33 commits into from
Closed

Visualize manuscript/project growth #1019

wants to merge 33 commits into from

Conversation

rando2
Copy link
Contributor

@rando2 rando2 commented Aug 30, 2021

Description of the proposed additions or changes

This is an updated version of #953 that works off of master to pull the commit history and then visualize commits

Related issues

#953 #952

Suggested reviewers (optional)

Checklist

  • Text is formatted so that each sentence is on its own line.
  • Pre-prints cited in this pull request have a GitHub issue opened so that they can be reviewed.

@rando2 rando2 changed the title Ms stats Visualize manuscript/project growth Aug 30, 2021
@rando2 rando2 marked this pull request as draft August 30, 2021 22:14
@rando2
Copy link
Contributor Author

rando2 commented Aug 31, 2021

@mprobson this takes a very long time to run, but I think it is appropriate for map reduce? (If just turning things into a dictionary counts as reduction). The input is a list of commits (one per line in a .txt file, generated by git log on the command line) and it extracts various data, then stores in a dictionary. I don't see why we couldn't merge together a bunch of dictionaries. But since it's been a while since I parallelized anything in python, figured I should run this by you!

@rando2 rando2 added the Methods Strategies for review label Sep 1, 2021
@rando2 rando2 requested a review from miltondp September 2, 2021 19:17
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How slow is this to run? If we need to speed it up substantially, could we write the statistics to disk for commits that have already been analyzed and only update that file with newer commits? We would need to weigh the tradeoffs of adding that complexity to the code versus just waiting for it it run.

Do you envision this being run manually, with every new commit, or on a regular (e.g. daily/weekly) schedule? Ideally it could auto-update in some form.

# Plot the data
axes = growthdata.plot(kind='line', linewidth=2, subplots=True)
for ax in axes:
ax.set_ylabel('Count')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the word count, this will need to be Count (thousands) or something similar.

@rando2
Copy link
Contributor Author

rando2 commented Sep 2, 2021

How slow is this to run? If we need to speed it up substantially, could we write the statistics to disk for commits that have already been analyzed and only update that file with newer commits? We would need to weigh the tradeoffs of adding that complexity to the code versus just waiting for it it run.

Do you envision this being run manually, with every new commit, or on a regular (e.g. daily/weekly) schedule? Ideally it could auto-update in some form.

Haha Tony, we're on the same page -- I was about to make a comment asking whether it would make sense to write to disk, and then realized I should just do it, since that's probably the best/only way to speed things up. It's running in a not-unreasonable amount of time with 8 CPUs, but when I had it single threaded, it was slow enough to be pretty annoying, so I definitely don't want it clogging up builds.

I don't think it needs to be run with every commit. Definitely daily and even weekly would probably be fine, since it's not really critical to have a super up-to-date record of how many references are in the document etc.

One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations?

@AppVeyorBot
Copy link

AppVeyor build 1.0.4387 ...

Found 4 potential spelling error(s). Preview:content/22.vaccines.md:21:devleopment
content/22.vaccines.md:74:appraoches
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs for commit 37c9270 is now complete. The rendered manuscript from this build is temporarily available for download at:

@agitter
Copy link
Collaborator

agitter commented Sep 2, 2021

One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations?

I see this relates to your comments in #944 (which I'm still reviewing). I don't see any harm in creating separate conda environments. If the environments are very similar, perhaps there would be some wasted time (~minutes) configuring multiple environments when one would suffice.

I'm wondering whether we should set up this analysis and #944 on the external-resources branch. They aren't actually external. However, that would keep our automated analyses more organized. We could have a new daily/weekly workflow modeled after update-external-resources.yaml that updates these internal project statistics (update-project-statistics.yaml?). We could also use that branch to store the latest statistics written to disk from this analysis and read from that file if it exists. However, if we are only running this ~weekly in a GitHub Actions workflow instead of with each build, it doesn't really matter if it runs for 10s of minutes.

Do you see any pros or cons of adding these project statistics analysis to the external-resources branch?

Copy link
Contributor Author

@rando2 rando2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to just use the absolute word count instead of dividing the word count by 1000 (I'm sure we can get this working that way if we feel strongly that we want it, but this was easier).

In terms of run time, it takes 8-10 seconds on 12 CPUs (locally) to analyze all 498 commits. With the caching, it takes 1.2 seconds to process 10 new commits locally and 2.2 seconds/10 records if I set it to a single CPU.

@rando2
Copy link
Contributor Author

rando2 commented Sep 2, 2021

Update: to crunch everything through 8/27 took about 10 seconds using 12 CPUs

One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations?

I see this relates to your comments in #944 (which I'm still reviewing). I don't see any harm in creating separate conda environments. If the environments are very similar, perhaps there would be some wasted time (~minutes) configuring multiple environments when one would suffice.

I'm wondering whether we should set up this analysis and #944 on the external-resources branch. They aren't actually external. However, that would keep our automated analyses more organized. We could have a new daily/weekly workflow modeled after update-external-resources.yaml that updates these internal project statistics (update-project-statistics.yaml?). We could also use that branch to store the latest statistics written to disk from this analysis and read from that file if it exists. However, if we are only running this ~weekly in a GitHub Actions workflow instead of with each build, it doesn't really matter if it runs for 10s of minutes.

Do you see any pros or cons of adding these project statistics analysis to the external-resources branch?

I actually think it would be kind of nice to have all of the code-based visualization happening in one place (or at least, not filling up the part of the repository that non-coders might be looking at). For that reason, I'd be for moving them to external-resources even though they're not external!

I implemented the caching because it was bothering me. The plan to add .yaml files sounds good to me! I may de-prioritize that for the next couple days and try to get some more revisions done for the camera ready, but that seems like a great solution for keeping things up-to-date to me!

@AppVeyorBot
Copy link

AppVeyor build 1.0.4388 for commit ded441f is now complete.

Found 4 potential spelling error(s). Preview:content/22.vaccines.md:21:devleopment
content/22.vaccines.md:74:appraoches
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs...
The rendered manuscript from this build is temporarily available for download at:

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code, and it looks good to me overall. I did not try to run it locally or test the caching behavior.

This pull request could focus on adding the code, figures, and statistics. Then I could follow up later to add that to an automated workflow and move the image and json outputs to external-resources. That could happen after the camera ready deadline.

I changed it to just use the absolute word count instead of dividing the word count by 1000 (I'm sure we can get this working that way if we feel strongly that we want it, but this was easier).

That works for me.

exit("No new commits")

# Access the variables.json and references.json files associated with each commit and store in dictionary
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used pools in a long time. I see some examples using with statements (https://stackoverflow.com/questions/45718546/with-clause-for-multiprocessing-in-python) but do not know whether that is necessary. Perhaps it is more robust in case the commit processing has an unexpected error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the only advantage of using with (not sure what happens if any of the tasks fails and you didn't close the pool). Usually, I prefer to use futures, like in this function, write_phenotypes_jobs. But if this is already working fine, no need to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both! I changed it to use with. @miltondp I wanted to try to learn to use futures, so I'll try to model off of your example code next time I do this!

print(f'Wrote {args.output_figure}.png and {args.output_figure}.svg')

# Write json output file
manuscript_stats = growthData.iloc[0].to_dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When saving to json, let's name these variables to be easier to use as template variables within the text. I'm using https://github.com/greenelab/covid19-review/blob/external-resources/csse/csse-stats.json as an example. In this case, what do you think about:

  "stats_author_count": "52",
  "stats_clean_date": "September 2, 2021",
  "stats_iso_date": "2021-09-02",
  "stats_references": "1588",
  "stats_word_count": "134821"

Or we could use growth_stats as the prefix instead of stats.

Then we could use those variables instead of the hard-coded values in

As of April 30, 2021, there were 50 authors, 1,428 references, and 131,949 words in the documents that make up the project.

Copy link
Member

@miltondp miltondp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, Halie. I left some comments, and I agree with @agitter suggestions. Regarding the parallelization of the code, I think the code is fine if it works for you. I left some comments regarding that as well.

import matplotlib
import argparse
import time
import os.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor comment. If you are using Python > 3.4, I suggest in the future migrating to pathlib instead of os.path. It's object-oriented and more intuitive in my opinion, plus it is the new way of file path handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Milton! I made this update.

Returns list of 5 statistics"""

variablesCommand = "git show " + commit + ":./variables.json"
variables = json.loads(subprocess.getoutput(variablesCommand))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if the command (git show...) fails for some reason? Is that something that could potentially happen or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! It shouldn't but certainly if the wrong branch was somehow checked out, I can imagine it causing issues.

references = json.loads(subprocess.getoutput(referencesCommand))
num_ref = len(references)

return([date, clean_date, num_authors, word_count, num_ref])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are returning several values, it might be more convenient to use a named tuple in the future, so you can access each value by name, which I think is more intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!


# If this analysis has been run before, load in the list of commits analyzed
# and only analyze new commits
# Assumes no commits will be added retrospectively (to take advantage of linearity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to remove this assumption could be to create a set of old and new commits (I understand these are commits' hashes). Then you can process the commits that are new only. Assumbing both commits and oldCommits are both lists of commits' hashes, it would be something like: commits_to_be_processed = set(commits) - set(oldCommits).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is almost certainly a way to do this with sets, but it would be most straightforward with an ordered set, and it doesn't seem like there's a particularly good way to get that type of data structure in Python (unless I'm missing something!)

exit("No new commits")

# Access the variables.json and references.json files associated with each commit and store in dictionary
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the only advantage of using with (not sure what happens if any of the tasks fails and you didn't close the pool). Usually, I prefer to use futures, like in this function, write_phenotypes_jobs. But if this is already working fine, no need to change.

@rando2 rando2 marked this pull request as ready for review September 10, 2021 22:22
@rando2
Copy link
Contributor Author

rando2 commented Sep 10, 2021

@agitter I believe I'm implemented all the feedback from you and Milton! The only thing is that I need to switch the branch so that the commits are on top of the external-resources branch -- but doing that will close this PR and open a new one, so I wanted to make a note here!

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @rando2. I have a few more small comments. It's okay with me that you'll create a new pull request when it's time to merge this with external-resources.


# Access files and data in references.json associated with each commit
referencesCommand = "git show " + commit + ":./references.json"
references = json.loads(subprocess.getoutput(referencesCommand))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the try/except here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this! I also moved the string creation outside of the try/except block, since that shouldn't fail (and definitely not with a JSON error).

@AppVeyorBot
Copy link

AppVeyor build 1.0.4459 for commit 3fbea6b is now complete.

Found 4 potential spelling error(s). Preview:content/22.vaccines.md:21:devleopment
content/22.vaccines.md:74:appraoches
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs...
The rendered manuscript from this build is temporarily available for download at:

@rando2
Copy link
Contributor Author

rando2 commented Sep 13, 2021

See #1034

@rando2 rando2 closed this Sep 13, 2021
agitter added a commit that referenced this pull request Sep 13, 2021
#1019 for plotting project growth, but on external-resources branch
@agitter agitter deleted the ms-stats branch September 13, 2021 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Methods Strategies for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants