Summary seppop number of alleles per locus incorrect #234

apaijmans · 2018-06-14T14:08:38Z

Hi,

I have 8 different populations, each with different number of alleles per locus. I used the total data file, including all populations, to make a GENIND object. I then used seppop() to be able to run some analyses on each population separately.
However, when I use summary() to look at the basic statistics of a single population, the number of alleles per locus are incorrect. It displays the same numbers as for the total data file, instead of the numbers of alleles per locus for only the specific population.

I have included an example below, which hopefully makes it more clear. The first summary is made using the GENIND object for the complete dataset 'all_data', which included 8 populations. I then used seppop(), and made a summary of only the 8th population. The number of alleles per locus are exactly the same as for the first summary, which is not correct. When I made a GENIND object with only the individuals of the 8th population, and use summary() on that, the correct number of alleles per locus are given. All other information, such as Hobs, Hexp and missing data seem to be correct, so the problem seems only to be with how the number of alleles per locus are calculated.

I think this is a bug?

Thanks,
Anneke

> summary(all_data)
// Number of individuals: 1653
// Group sizes: 197 694 396 166 19 51 22 108
// Number of alleles per locus: 12 15 6 16 20 13 14 11 2 5 12 8 17 19 14 4 6 17 8 6 15 10 11 12 5 10 20 8 8 18 7 6 4 25 18 16 16 6 5
// Number of alleles per group: 378 385 372 369 263 272 247 312
// Percentage of missing data: 0.96 %
// Observed heterozygosity: 0.8 [removed the rest to save space]
// Expected heterozygosity: 0.83 [removed the rest to save space]

popdata <- seppop(all_data)

> summary(popdata[[8]])
// Number of individuals: 108
// Group sizes: 108
// Number of alleles per locus: 12 15 6 16 20 13 14 11 2 5 12 8 17 19 14 4 6 17 8 6 15 10 11 12 5 10 20 8 8 18 7 6 4 25 18 16 16 6 5
// Number of alleles per group: 312
// Percentage of missing data: 0.47 %
// Observed heterozygosity: 0.82 [removed the rest to save space]
// Expected heterozygosity: 0.82 [removed the rest to save space]

> summary(testpop8)
// Number of individuals: 108
// Group sizes: 108
// Number of alleles per locus: 7 13 2 8 14 8 9 9 2 4 11 6 10 15 12 4 6 10 7 5 10 8 9 6 4 8 11 7 5 15 5 3 2 15 12 11 12 4 3
// Number of alleles per group: 312
// Percentage of missing data: 0.47 %
// Observed heterozygosity: 0.82 [removed the rest to save space]
// Expected heterozygosity: 0.82 [removed the rest to save space]

The text was updated successfully, but these errors were encountered:

zkamvar · 2018-06-14T18:40:01Z

Hello @apaijmans,

Thank you for the detailed description and clear example. I was able to re-create this behavior with the "nancycats" data set:

library("adegenet")
#> Loading required package: ade4
#> 
#>    /// adegenet 2.1.1 is loaded ////////////
#> 
#>    > overview: '?adegenet'
#>    > tutorials/doc/questions: 'adegenetWeb()' 
#>    > bug reports/feature requests: adegenetIssues()
data("nancycats", package = "adegenet")
summary(nancycats)
#> 
#> // Number of individuals: 237
#> // Group sizes: 10 22 12 23 15 11 14 10 9 11 20 14 13 17 11 12 13
#> // Number of alleles per locus: 16 11 10 9 12 8 12 12 18
#> // Number of alleles per group: 36 53 50 67 48 56 42 54 43 46 70 52 44 61 42 40 35
#> // Percentage of missing data: 2.34 %
#> // Observed heterozygosity: 0.67 0.67 0.68 0.71 0.63 0.57 0.65 0.62 0.45
#> // Expected heterozygosity: 0.87 0.79 0.8 0.76 0.87 0.69 0.82 0.76 0.61
summary(seppop(nancycats)[[8]])
#> 
#> // Number of individuals: 10
#> // Group sizes: 10
#> // Number of alleles per locus: 16 11 10 9 12 8 12 12 18
#> // Number of alleles per group: 54
#> // Percentage of missing data: 0 %
#> // Observed heterozygosity: 0.5 0.9 0.7 0.7 0.4 0.5 0.7 0.7 0.4
#> // Expected heterozygosity: 0.66 0.73 0.75 0.66 0.69 0.82 0.66 0.75 0.7

Created on 2018-06-14 by the reprex package (v0.2.0).

Immediate Solution

The immediate solution is to specify drop = TRUE in your seppop call:

summary(seppop(nancycats, drop = TRUE)[[8]])
#> 
#> // Number of individuals: 10
#> // Group sizes: 10
#> // Number of alleles per locus: 6 5 7 5 7 7 4 6 7
#> // Number of alleles per group: 54
#> // Percentage of missing data: 0 %
#> // Observed heterozygosity: 0.5 0.9 0.7 0.7 0.4 0.5 0.7 0.7 0.4
#> // Expected heterozygosity: 0.66 0.73 0.75 0.66 0.69 0.82 0.66 0.75 0.7

Why this happened (gory details)

I think this falls in the grey area between "feature" and "bug", though I'm leaning toward "bug". One of the features of the genind object is that it stores each allele as a separate column of a matrix. The boundaries of loci are denoted by the @loc.fac slot (acessor: locFac()). You can count the number of alleles per locus by tabulating that slot, but this computation increases with the number of alleles, so the genind object also has the @loc.n.all slot (accessor: nAll()), which contains a pre-computed summary of the number of alleles per locus. The summary function simply calls the nAll() accessor when presenting this information.

The problem is how this is computed. Currently, It's calculated by the above method:

adegenet/R/basicMethods.R

Line 19 in ac436e9

x@loc.n.all <- setNames(tabulate(loc.fac), levels(loc.fac))

Instead, it could be calculated by counting up the number of non-zero columns per locus like so:

present_alleles <- colSums(tab(x), na.rm = TRUE) > 0L
x@loc.n.all <- vapply(split(present_alleles, locFac(x)), sum, integer(1))

My only concern is that it may change people's validation workflows if they use nAll() to verify alleles were dropped. That being said, that could easily be checked by assessing the length of locFac().

I'll propose a change in this behavior, but @thibautjombart needs to approve it before it can be included in the next version.

apaijmans · 2018-06-18T11:51:04Z

Thanks for the detailed explanation! That really helps!

All the best,
Anneke

zkamvar added a commit that referenced this issue Jun 14, 2018

update documentation for #234

0fabf0c

zkamvar mentioned this issue Jun 14, 2018

Change behavior of loc.n.all slot #235

Merged

thibautjombart closed this as completed in 07cd46d Jun 15, 2018

zkamvar mentioned this issue May 28, 2019

Make loc.n.all backwards compatible #257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary seppop number of alleles per locus incorrect #234

Summary seppop number of alleles per locus incorrect #234

apaijmans commented Jun 14, 2018

zkamvar commented Jun 14, 2018

apaijmans commented Jun 18, 2018

Summary seppop number of alleles per locus incorrect #234

Summary seppop number of alleles per locus incorrect #234

Comments

apaijmans commented Jun 14, 2018

zkamvar commented Jun 14, 2018

Immediate Solution

Why this happened (gory details)

apaijmans commented Jun 18, 2018