Visualize manuscript/project growth #1019

rando2 · 2021-08-30T22:10:38Z

Description of the proposed additions or changes

This is an updated version of #953 that works off of master to pull the commit history and then visualize commits

Related issues

#953 #952

Suggested reviewers (optional)

Checklist

Text is formatted so that each sentence is on its own line.

Pre-prints cited in this pull request have a GitHub issue opened so that they can be reviewed.

rando2 · 2021-08-31T15:06:46Z

@mprobson this takes a very long time to run, but I think it is appropriate for map reduce? (If just turning things into a dictionary counts as reduction). The input is a list of commits (one per line in a .txt file, generated by git log on the command line) and it extracts various data, then stores in a dictionary. I don't see why we couldn't merge together a bunch of dictionaries. But since it's been a while since I parallelized anything in python, figured I should run this by you!

agitter

How slow is this to run? If we need to speed it up substantially, could we write the statistics to disk for commits that have already been analyzed and only update that file with newer commits? We would need to weigh the tradeoffs of adding that complexity to the code versus just waiting for it it run.

Do you envision this being run manually, with every new commit, or on a regular (e.g. daily/weekly) schedule? Ideally it could auto-update in some form.

agitter · 2021-09-02T19:23:12Z

analyze-ms-stats/calc-manuscript-stats.py

+    # Plot the data
+    axes = growthdata.plot(kind='line', linewidth=2, subplots=True)
+    for ax in axes:
+        ax.set_ylabel('Count')


For the word count, this will need to be Count (thousands) or something similar.

analyze-ms-stats/calc-manuscript-stats.py

rando2 · 2021-09-02T19:38:32Z

How slow is this to run? If we need to speed it up substantially, could we write the statistics to disk for commits that have already been analyzed and only update that file with newer commits? We would need to weigh the tradeoffs of adding that complexity to the code versus just waiting for it it run.

Do you envision this being run manually, with every new commit, or on a regular (e.g. daily/weekly) schedule? Ideally it could auto-update in some form.

Haha Tony, we're on the same page -- I was about to make a comment asking whether it would make sense to write to disk, and then realized I should just do it, since that's probably the best/only way to speed things up. It's running in a not-unreasonable amount of time with 8 CPUs, but when I had it single threaded, it was slow enough to be pretty annoying, so I definitely don't want it clogging up builds.

I don't think it needs to be run with every commit. Definitely daily and even weekly would probably be fine, since it's not really critical to have a super up-to-date record of how many references are in the document etc.

One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations?

…into ms-stats

AppVeyorBot · 2021-09-02T20:12:56Z

AppVeyor build 1.0.4387 ...

Found 4 potential spelling error(s). Preview:

content/22.vaccines.md:21:devleopment
content/22.vaccines.md:74:appraoches
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs for commit 37c9270 is now complete. The rendered manuscript from this build is temporarily available for download at:

agitter · 2021-09-02T20:34:25Z

One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations?

I see this relates to your comments in #944 (which I'm still reviewing). I don't see any harm in creating separate conda environments. If the environments are very similar, perhaps there would be some wasted time (~minutes) configuring multiple environments when one would suffice.

I'm wondering whether we should set up this analysis and #944 on the external-resources branch. They aren't actually external. However, that would keep our automated analyses more organized. We could have a new daily/weekly workflow modeled after update-external-resources.yaml that updates these internal project statistics (update-project-statistics.yaml?). We could also use that branch to store the latest statistics written to disk from this analysis and read from that file if it exists. However, if we are only running this ~weekly in a GitHub Actions workflow instead of with each build, it doesn't really matter if it runs for 10s of minutes.

Do you see any pros or cons of adding these project statistics analysis to the external-resources branch?

rando2

I changed it to just use the absolute word count instead of dividing the word count by 1000 (I'm sure we can get this working that way if we feel strongly that we want it, but this was easier).

In terms of run time, it takes 8-10 seconds on 12 CPUs (locally) to analyze all 498 commits. With the caching, it takes 1.2 seconds to process 10 new commits locally and 2.2 seconds/10 records if I set it to a single CPU.

analyze-ms-stats/calc-manuscript-stats.py

rando2 · 2021-09-02T22:38:27Z

Update: to crunch everything through 8/27 took about 10 seconds using 12 CPUs

One thing I wanted to ask you is: presumably I should set up a separate conda environment because it's on a different branch from the rest of the visualizations?

I see this relates to your comments in #944 (which I'm still reviewing). I don't see any harm in creating separate conda environments. If the environments are very similar, perhaps there would be some wasted time (~minutes) configuring multiple environments when one would suffice.

I'm wondering whether we should set up this analysis and #944 on the external-resources branch. They aren't actually external. However, that would keep our automated analyses more organized. We could have a new daily/weekly workflow modeled after update-external-resources.yaml that updates these internal project statistics (update-project-statistics.yaml?). We could also use that branch to store the latest statistics written to disk from this analysis and read from that file if it exists. However, if we are only running this ~weekly in a GitHub Actions workflow instead of with each build, it doesn't really matter if it runs for 10s of minutes.

Do you see any pros or cons of adding these project statistics analysis to the external-resources branch?

I actually think it would be kind of nice to have all of the code-based visualization happening in one place (or at least, not filling up the part of the repository that non-coders might be looking at). For that reason, I'd be for moving them to external-resources even though they're not external!

I implemented the caching because it was bothering me. The plan to add .yaml files sounds good to me! I may de-prioritize that for the next couple days and try to get some more revisions done for the camera ready, but that seems like a great solution for keeping things up-to-date to me!

AppVeyorBot · 2021-09-02T22:55:40Z

AppVeyor build 1.0.4388 for commit ded441f is now complete.

Found 4 potential spelling error(s). Preview:

content/22.vaccines.md:21:devleopment
content/22.vaccines.md:74:appraoches
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs...

The rendered manuscript from this build is temporarily available for download at:

agitter

I reviewed the code, and it looks good to me overall. I did not try to run it locally or test the caching behavior.

This pull request could focus on adding the code, figures, and statistics. Then I could follow up later to add that to an automated workflow and move the image and json outputs to external-resources. That could happen after the camera ready deadline.

I changed it to just use the absolute word count instead of dividing the word count by 1000 (I'm sure we can get this working that way if we feel strongly that we want it, but this was easier).

That works for me.

agitter · 2021-09-03T19:40:02Z

analyze-ms-stats/calc-manuscript-stats.py

+            exit("No new commits")
+
+    # Access the variables.json and references.json files associated with each commit and store in dictionary
+    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())


I haven't used pools in a long time. I see some examples using with statements (https://stackoverflow.com/questions/45718546/with-clause-for-multiprocessing-in-python) but do not know whether that is necessary. Perhaps it is more robust in case the commit processing has an unexpected error.

Yes, I think that's the only advantage of using with (not sure what happens if any of the tasks fails and you didn't close the pool). Usually, I prefer to use futures, like in this function, write_phenotypes_jobs. But if this is already working fine, no need to change.

Thank you both! I changed it to use with. @miltondp I wanted to try to learn to use futures, so I'll try to model off of your example code next time I do this!

agitter · 2021-09-03T19:44:33Z

analyze-ms-stats/calc-manuscript-stats.py

+    print(f'Wrote {args.output_figure}.png and {args.output_figure}.svg')
+
+    # Write json output file
+    manuscript_stats = growthData.iloc[0].to_dict()


When saving to json, let's name these variables to be easier to use as template variables within the text. I'm using https://github.com/greenelab/covid19-review/blob/external-resources/csse/csse-stats.json as an example. In this case, what do you think about:

"stats_author_count": "52", "stats_clean_date": "September 2, 2021", "stats_iso_date": "2021-09-02", "stats_references": "1588", "stats_word_count": "134821"

Or we could use growth_stats as the prefix instead of stats.

Then we could use those variables instead of the hard-coded values in

As of April 30, 2021, there were 50 authors, 1,428 references, and 131,949 words in the documents that make up the project.

miltondp

Looks good to me, Halie. I left some comments, and I agree with @agitter suggestions. Regarding the parallelization of the code, I think the code is fine if it works for you. I left some comments regarding that as well.

miltondp · 2021-09-07T13:54:36Z

analyze-ms-stats/calc-manuscript-stats.py

+import matplotlib
+import argparse
+import time
+import os.path


This is a minor comment. If you are using Python > 3.4, I suggest in the future migrating to pathlib instead of os.path. It's object-oriented and more intuitive in my opinion, plus it is the new way of file path handling.

Thank you Milton! I made this update.

miltondp · 2021-09-07T13:57:43Z

analyze-ms-stats/calc-manuscript-stats.py

+    Returns list of 5 statistics"""
+
+    variablesCommand = "git show " + commit + ":./variables.json"
+    variables = json.loads(subprocess.getoutput(variablesCommand))


What happens here if the command (git show...) fails for some reason? Is that something that could potentially happen or not?

Good call! It shouldn't but certainly if the wrong branch was somehow checked out, I can imagine it causing issues.

miltondp · 2021-09-07T13:59:58Z

analyze-ms-stats/calc-manuscript-stats.py

+    references = json.loads(subprocess.getoutput(referencesCommand))
+    num_ref = len(references)
+
+    return([date, clean_date, num_authors, word_count, num_ref])


Since you are returning several values, it might be more convenient to use a named tuple in the future, so you can access each value by name, which I think is more intuitive.

miltondp · 2021-09-07T14:09:00Z

analyze-ms-stats/calc-manuscript-stats.py

+
+    # If this analysis has been run before, load in the list of commits analyzed
+    # and only analyze new commits
+    # Assumes no commits will be added retrospectively (to take advantage of linearity)


One way to remove this assumption could be to create a set of old and new commits (I understand these are commits' hashes). Then you can process the commits that are new only. Assumbing both commits and oldCommits are both lists of commits' hashes, it would be something like: commits_to_be_processed = set(commits) - set(oldCommits).

I think there is almost certainly a way to do this with sets, but it would be most straightforward with an ordered set, and it doesn't seem like there's a particularly good way to get that type of data structure in Python (unless I'm missing something!)

miltondp · 2021-09-07T14:14:40Z

analyze-ms-stats/calc-manuscript-stats.py

+            exit("No new commits")
+
+    # Access the variables.json and references.json files associated with each commit and store in dictionary
+    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())


Yes, I think that's the only advantage of using with (not sure what happens if any of the tasks fails and you didn't close the pool). Usually, I prefer to use futures, like in this function, write_phenotypes_jobs. But if this is already working fine, no need to change.

analyze-ms-stats/calc-manuscript-stats.sh

Co-authored-by: Milton Pividori <miltondp@gmail.com>

…into ms-stats

rando2 · 2021-09-10T22:40:39Z

@agitter I believe I'm implemented all the feedback from you and Milton! The only thing is that I need to switch the branch so that the commits are on top of the external-resources branch -- but doing that will close this PR and open a new one, so I wanted to make a note here!

agitter

Thanks for the updates @rando2. I have a few more small comments. It's okay with me that you'll create a new pull request when it's time to merge this with external-resources.

agitter · 2021-09-10T22:54:55Z

analyze-ms-stats/calc-manuscript-stats.py

+
+    # Access files and data in references.json associated with each commit
+    referencesCommand = "git show " + commit + ":./references.json"
+    references = json.loads(subprocess.getoutput(referencesCommand))


Can you please add the try/except here as well?

Thank you for catching this! I also moved the string creation outside of the try/except block, since that shouldn't fail (and definitely not with a JSON error).

analyze-ms-stats/calc-manuscript-stats.py

analyze-ms-stats/environment.yml

AppVeyorBot · 2021-09-10T23:02:30Z

AppVeyor build 1.0.4459 for commit 3fbea6b is now complete.

Found 4 potential spelling error(s). Preview:

content/22.vaccines.md:21:devleopment
content/22.vaccines.md:74:appraoches
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs...

The rendered manuscript from this build is temporarily available for download at:

Co-authored-by: Anthony Gitter <agitter@users.noreply.github.com>

…into ms-stats

rando2 · 2021-09-13T12:09:22Z

See #1034

#1019 for plotting project growth, but on external-resources branch

HM Rando added 4 commits August 30, 2021 18:02

add tracking of manuscript statistics

f930773

update .sh

051cf81

remove redundant file

feb6b92

specify branch to calc commits off of

c0afb29

rando2 changed the title ~~Ms stats~~ Visualize manuscript/project growth Aug 30, 2021

rando2 marked this pull request as draft August 30, 2021 22:14

HM Rando added 2 commits August 30, 2021 18:20

fix file reference locations and run

701bb87

remove some debugging

d1d9869

HM Rando added 2 commits August 31, 2021 11:17

move location of image output file

4097f43

fix branch name for remote

6c90ab6

rando2 added the Methods Strategies for review label Sep 1, 2021

HM Rando and others added 8 commits September 2, 2021 10:36

attempt to use concurrent.futures

4e18224

use pool.apply_async

16b65e3

paralellize with multiprocessing

1f8e23d

clea up code

0157828

use available CPUs

826ec62

update figure and .sh file

100a617

clean up .sh file

94a2025

Merge branch 'master' into ms-stats

1c9f60f

rando2 requested a review from miltondp September 2, 2021 19:17

agitter reviewed Sep 2, 2021

View reviewed changes

HM Rando added 3 commits September 2, 2021 15:41

add svg output

d01b81b

Merge branch 'ms-stats' of https://github.com/greenelab/covid19-review …

199b8ca

…into ms-stats

change word count to absolute, not /1000

37c9270

HM Rando added 2 commits September 2, 2021 18:02

cache data from previous commits

f966e34

add caching

ded441f

rando2 commented Sep 2, 2021

View reviewed changes

analyze-ms-stats/calc-manuscript-stats.py Outdated Show resolved Hide resolved

agitter requested changes Sep 3, 2021

View reviewed changes

miltondp approved these changes Sep 7, 2021

View reviewed changes

agitter mentioned this pull request Sep 10, 2021

DISCO methods paper, initial attempt to address feedback from reviewers 2&3 #1033

Merged

2 tasks

rando2 and others added 4 commits September 10, 2021 16:16

Apply suggestions from code review

461c872

Co-authored-by: Milton Pividori <miltondp@gmail.com>

address feedback from code review

8485a01

make print out a bit more verbose

472267e

remove extra print statements

427b1d7

rando2 marked this pull request as ready for review September 10, 2021 22:22

HM Rando and others added 4 commits September 10, 2021 18:37

add conda environment settings

5067163

Merge branch 'master' into ms-stats

8fe0db9

remove modification for local run

fc53320

Merge branch 'ms-stats' of https://github.com/greenelab/covid19-review …

3fbea6b

…into ms-stats

modified format statement syntax per condas preference

2ce888a

agitter reviewed Sep 10, 2021

View reviewed changes

rando2 and others added 3 commits September 11, 2021 08:28

Update analyze-ms-stats/calc-manuscript-stats.py

1b404b8

Co-authored-by: Anthony Gitter <agitter@users.noreply.github.com>

fix try/except for json loads

23e6888

Merge branch 'ms-stats' of https://github.com/greenelab/covid19-review …

995433d

…into ms-stats

agitter approved these changes Sep 11, 2021

View reviewed changes

rando2 mentioned this pull request Sep 13, 2021

#1019 for plotting project growth, but on external-resources branch #1034

Merged

2 tasks

rando2 closed this Sep 13, 2021

agitter added a commit that referenced this pull request Sep 13, 2021

Merge pull request #1034 from greenelab/ms-stats-extres

2c85371

#1019 for plotting project growth, but on external-resources branch

agitter deleted the ms-stats branch September 13, 2021 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualize manuscript/project growth #1019

Visualize manuscript/project growth #1019

rando2 commented Aug 30, 2021

rando2 commented Aug 31, 2021 •

edited

Loading

agitter left a comment

agitter Sep 2, 2021

rando2 commented Sep 2, 2021

AppVeyorBot commented Sep 2, 2021

agitter commented Sep 2, 2021

rando2 left a comment

rando2 commented Sep 2, 2021

AppVeyorBot commented Sep 2, 2021

agitter left a comment

agitter Sep 3, 2021

miltondp Sep 7, 2021

rando2 Sep 10, 2021

agitter Sep 3, 2021

miltondp left a comment

miltondp Sep 7, 2021

rando2 Sep 10, 2021

miltondp Sep 7, 2021

rando2 Sep 10, 2021

miltondp Sep 7, 2021

rando2 Sep 10, 2021

miltondp Sep 7, 2021

rando2 Sep 10, 2021

miltondp Sep 7, 2021

rando2 commented Sep 10, 2021

agitter left a comment

agitter Sep 10, 2021

rando2 Sep 11, 2021

AppVeyorBot commented Sep 10, 2021

rando2 commented Sep 13, 2021

Visualize manuscript/project growth #1019

Visualize manuscript/project growth #1019

Conversation

rando2 commented Aug 30, 2021

Description of the proposed additions or changes

Related issues

Suggested reviewers (optional)

Checklist

rando2 commented Aug 31, 2021 • edited Loading

agitter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rando2 commented Sep 2, 2021

AppVeyorBot commented Sep 2, 2021

agitter commented Sep 2, 2021

rando2 left a comment

Choose a reason for hiding this comment

rando2 commented Sep 2, 2021

AppVeyorBot commented Sep 2, 2021

agitter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miltondp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rando2 commented Sep 10, 2021

agitter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AppVeyorBot commented Sep 10, 2021

rando2 commented Sep 13, 2021

rando2 commented Aug 31, 2021 •

edited

Loading