Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary seppop number of alleles per locus incorrect #234

Closed
apaijmans opened this issue Jun 14, 2018 · 2 comments
Closed

Summary seppop number of alleles per locus incorrect #234

apaijmans opened this issue Jun 14, 2018 · 2 comments

Comments

@apaijmans
Copy link

Hi,

I have 8 different populations, each with different number of alleles per locus. I used the total data file, including all populations, to make a GENIND object. I then used seppop() to be able to run some analyses on each population separately.
However, when I use summary() to look at the basic statistics of a single population, the number of alleles per locus are incorrect. It displays the same numbers as for the total data file, instead of the numbers of alleles per locus for only the specific population.

I have included an example below, which hopefully makes it more clear. The first summary is made using the GENIND object for the complete dataset 'all_data', which included 8 populations. I then used seppop(), and made a summary of only the 8th population. The number of alleles per locus are exactly the same as for the first summary, which is not correct. When I made a GENIND object with only the individuals of the 8th population, and use summary() on that, the correct number of alleles per locus are given. All other information, such as Hobs, Hexp and missing data seem to be correct, so the problem seems only to be with how the number of alleles per locus are calculated.

I think this is a bug?

Thanks,
Anneke

> summary(all_data)
// Number of individuals: 1653
// Group sizes: 197 694 396 166 19 51 22 108
// Number of alleles per locus: 12 15 6 16 20 13 14 11 2 5 12 8 17 19 14 4 6 17 8 6 15 10 11 12 5 10 20 8 8 18 7 6 4 25 18 16 16 6 5
// Number of alleles per group: 378 385 372 369 263 272 247 312
// Percentage of missing data: 0.96 %
// Observed heterozygosity: 0.8 [removed the rest to save space]
// Expected heterozygosity: 0.83 [removed the rest to save space]

popdata <- seppop(all_data)

> summary(popdata[[8]])
// Number of individuals: 108
// Group sizes: 108
// Number of alleles per locus: 12 15 6 16 20 13 14 11 2 5 12 8 17 19 14 4 6 17 8 6 15 10 11 12 5 10 20 8 8 18 7 6 4 25 18 16 16 6 5
// Number of alleles per group: 312
// Percentage of missing data: 0.47 %
// Observed heterozygosity: 0.82 [removed the rest to save space]
// Expected heterozygosity: 0.82 [removed the rest to save space]
> summary(testpop8)
// Number of individuals: 108
// Group sizes: 108
// Number of alleles per locus: 7 13 2 8 14 8 9 9 2 4 11 6 10 15 12 4 6 10 7 5 10 8 9 6 4 8 11 7 5 15 5 3 2 15 12 11 12 4 3
// Number of alleles per group: 312
// Percentage of missing data: 0.47 %
// Observed heterozygosity: 0.82 [removed the rest to save space]
// Expected heterozygosity: 0.82 [removed the rest to save space]

@zkamvar
Copy link
Collaborator

zkamvar commented Jun 14, 2018

Hello @apaijmans,

Thank you for the detailed description and clear example. I was able to re-create this behavior with the "nancycats" data set:

library("adegenet")
#> Loading required package: ade4
#> 
#>    /// adegenet 2.1.1 is loaded ////////////
#> 
#>    > overview: '?adegenet'
#>    > tutorials/doc/questions: 'adegenetWeb()' 
#>    > bug reports/feature requests: adegenetIssues()
data("nancycats", package = "adegenet")
summary(nancycats)
#> 
#> // Number of individuals: 237
#> // Group sizes: 10 22 12 23 15 11 14 10 9 11 20 14 13 17 11 12 13
#> // Number of alleles per locus: 16 11 10 9 12 8 12 12 18
#> // Number of alleles per group: 36 53 50 67 48 56 42 54 43 46 70 52 44 61 42 40 35
#> // Percentage of missing data: 2.34 %
#> // Observed heterozygosity: 0.67 0.67 0.68 0.71 0.63 0.57 0.65 0.62 0.45
#> // Expected heterozygosity: 0.87 0.79 0.8 0.76 0.87 0.69 0.82 0.76 0.61
summary(seppop(nancycats)[[8]])
#> 
#> // Number of individuals: 10
#> // Group sizes: 10
#> // Number of alleles per locus: 16 11 10 9 12 8 12 12 18
#> // Number of alleles per group: 54
#> // Percentage of missing data: 0 %
#> // Observed heterozygosity: 0.5 0.9 0.7 0.7 0.4 0.5 0.7 0.7 0.4
#> // Expected heterozygosity: 0.66 0.73 0.75 0.66 0.69 0.82 0.66 0.75 0.7

Created on 2018-06-14 by the reprex package (v0.2.0).

Immediate Solution

The immediate solution is to specify drop = TRUE in your seppop call:

summary(seppop(nancycats, drop = TRUE)[[8]])
#> 
#> // Number of individuals: 10
#> // Group sizes: 10
#> // Number of alleles per locus: 6 5 7 5 7 7 4 6 7
#> // Number of alleles per group: 54
#> // Percentage of missing data: 0 %
#> // Observed heterozygosity: 0.5 0.9 0.7 0.7 0.4 0.5 0.7 0.7 0.4
#> // Expected heterozygosity: 0.66 0.73 0.75 0.66 0.69 0.82 0.66 0.75 0.7

Why this happened (gory details)

I think this falls in the grey area between "feature" and "bug", though I'm leaning toward "bug". One of the features of the genind object is that it stores each allele as a separate column of a matrix. The boundaries of loci are denoted by the @loc.fac slot (acessor: locFac()). You can count the number of alleles per locus by tabulating that slot, but this computation increases with the number of alleles, so the genind object also has the @loc.n.all slot (accessor: nAll()), which contains a pre-computed summary of the number of alleles per locus. The summary function simply calls the nAll() accessor when presenting this information.

The problem is how this is computed. Currently, It's calculated by the above method:

x@loc.n.all <- setNames(tabulate(loc.fac), levels(loc.fac))

Instead, it could be calculated by counting up the number of non-zero columns per locus like so:

present_alleles <- colSums(tab(x), na.rm = TRUE) > 0L
x@loc.n.all <- vapply(split(present_alleles, locFac(x)), sum, integer(1))

My only concern is that it may change people's validation workflows if they use nAll() to verify alleles were dropped. That being said, that could easily be checked by assessing the length of locFac().

I'll propose a change in this behavior, but @thibautjombart needs to approve it before it can be included in the next version.

@apaijmans
Copy link
Author

Thanks for the detailed explanation! That really helps!

All the best,
Anneke

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants