Evidence of polygenic selection on human stature inferred from spatial distribution of allele frequencies

Spatial patterns of allele frequencies reveal a clear signal of natural (or sexual) selection on human height. The average frequency of all hits (N=693) and the top significant 66 common genetic variants (pruned for linkage disequilibrium) for 26 populations belonging to 5 sub-continental human groups were significantly correlated to average phenotypic population height. The method of correlated vectors provided additional evidence for a signal of natural selection in SNPs with higher significance. Factor analysis of the five top genome-wide association study (GWAS) hits revealed a clear factor indicating selection pressures on human height, peaking among northern Europeans and some African groups (Esan Nigeria) whilst reaching a nadir among South-East Asians. Finally, a new polygenic score is created to take into account overrepresentation of derived alleles among GWAS hits and population-level differences in derived allele frequencies.

Over the last few years, researchers have started moving away from the study of genetic evolution using a single-gene, Mendelian approach towards models that examine many genes together (polygenic).The more genes are involved in a given phenotype, the more the signal of natural selection will be "diluted" across different genomic regions (because each gene accounts for a tiny effect) making it difficult to detect it using approaches focused on a single gene (Piffer, 2014a;Pritchard et al., 2010).A first attempt at empirically identifying polygenic selection was made by Turchin et al. (2012) on two populations (Northern and Southern Europeans) and evidence for higher frequency of height increasing alleles (obtained from GWAS studies) among Northern Europeans was provided.A drawback of that study was the reliance on populations from a single continent and that crude pairwise comparisons (e.g.French vs. Italian) were used without correlating frequency differences to average population height.Moreover, the strength of selection was not determined.
Two different approaches to identify selection based on the correlation of allele frequencies across different populations have been recently developed by Piffer (2013) and Berg & Coop (2014).
Piffer's method uses factor analysis of trait increasing alleles (found by GWA studies) as a tool for finding a factor that represent the strength of selection on a phenotype and the underlying genetic variation (Piffer, 2014a).An additional methodology consists of computing the correlation between genetic frequencies and the average phenotypes of different populations; then, the resulting correlation coefficients are correlated with the corresponding alleles' genomewide significance (p value).If the alleles contain selection signals, a positive correlation will be found, as alleles with high p value (more likely to be false positives) have a weaker correlation to average population phenotype (Piffer, 2014a).
Piffer's method (Piffer, 2013;Piffer, 2014a) to identify signals of polygenic selection was used in this study and applied to the top five GWAS hits (ranked according to p value).Piffer (2014b) carried out a study on height SNPs but it was based on a smaller GWAS sample and an older version (phase 1) of the 1000 Genomes data, containing data for only 14 populations.This paper uses the phase 3 1000 Genomes data and the GWAS meta-analysis was carried out on a much larger sample size, which produces more hits with better significance.The aim of this paper is to test the hypothesis that stature has undergone natural or sexual selection in populations after humans dispersed in different continents giving rise to distinct genetic clusters.
This study also exploits additional information provided by frequencies of derived and ancestral alleles.At a theoretical level, an ancestral allele is the allele that was carried by the last common ancestor between humans and other primates whereas an allele is derived when it arose in the human lineage after the split from other primates.In practice, this allele is usually ascertained via comparison with chimpanzees.One limitation of this procedure is that if a mutation arose in chimpanzees after the split from humans, then the ancestral allele is not the chimp allele.Thus, 1000 Genomes infers ancestral alleles via alignment with 6 primate species (Ensembl, 2015).
Derived allele frequencies (DAF) are not the same for all populations.Substantial DAF differences across populations have been found, largely due to random drift and population bottlenecks but in part also by different selection pressures (Henn et al., 2015).
Non-African populations tend to have higher frequencies of derived alleles, and DAF is positively correlated to distance from Africa (Henn et al., 2015).If among the GWAS hits there is an overrepresentation of an allele type (more ancestral than derived alleles or vice versa), this could bias the polygenic score towards the populations with a higher background frequency of such an allele type.Thus, this study will take into account baseline differences in DAF to create an "unbiased" or "DAF-corrected" polygenic score.
Average population height was obtained from the references listed at: http://en.wikipedia.org/wiki/Human_height,considering only statistics published after 2000 and young age groups (18-40).
Only 11 populations met these criteria (see references in Table 1).
First, the entire sample of SNPs meeting genome-wide significance (N=697) was downloaded and the population frequencies for the alleles with a positive Beta in the 250k GWAS Meta-analysis were calculated.4 SNPs could not be included because they were absent from 1000 Genomes.
For each chromosome, the three alleles with the highest p values were selected, and these were all unlinked (>500Kb apart from each other).Only unlinked alleles were used to avoid the confounding influence of linkage on cross-population allele frequency.Selection was restricted only to the alleles with the highest significance because these are less likely to be false positives.The same number of SNPs (3) from each chromosome was used to get a representative sample of the entire genome, to avoid bias due to chromosome location.The conventional nominal p-value <5×10 -8 was used as significance threshold (Barsh et al., 2012).

Amendments from Version 2
GWAS hits were divided according to derived or ancestral allele status.A paragraph was added to the introduction to justify this novel procedure.A new subsection titled "Controlling for population differences in derived allele frequencies" was added to the results section.Derived allele frequencies (DAF) differ between populations, thus a "DAF-corrected" polygenic score was computed.Two tables were added (6 and 7) to the results section, reporting the average DAF and the DAF, AAF (Ancestral allele frequency) for the alleles with a positive effect, respectively.Table 7 also reports the DAF-corrected scores.A paragraph elaborating on these new findings was added to the Discussion.
The abstract has a minor correction and a sentence was added.A polygenic score was calculated as the mean frequency of height increasing alleles (defined as those with a positive Beta coefficient in the meta-analysis).

REVISED
Analyses were carried out using R.

Results
Dataset

Polygenic score
Polygenic scores and average country height are reported in Table 1.
The Pearson correlation between polygenic score including all hits (N=693) and average country height was r=0.79 (N=11, p=0.004).
The correlation with the 66 hits polygenic score was r=0.83 (N=11, p=0.002).Table 2 reports average frequencies by sub-continental populations.

Method of correlated vectors (MCV)
Spearman's rank order correlation between each allele's p value and its correlation with the polygenic score (66 hits) and with height were respectively -0.26 and -0.34 (N=66, p=0.037 and 0.0053).The "rcorr" and "cor" functions in R produced slightly different results due to differences in dealing with ties (equal values)."cor" produced slightly stronger coefficients (-0.28 and -0.37).The correlations using all hits were much smaller, but in the right direction (N=693; r= -0.059, p=0.121 and r= -0.111, p=0.003 for height and polygenic score, respectively).
This provides evidence for the hypothesis that more significant GWAS hits (alleles) are enriched with natural selection signal.A similar phenomenon was observed in a previous analysis of genes affecting human height (Piffer, 2014b).
Factor analysis of the top 5 hits Factor analysis requires a satisfying cases to variable ratio, thus only a handful of SNPs could be used and these had necessarily to be those with the lowest p value, as they are more likely to be genuine hits (see previous section, MCV).
The top 5 alleles (i.e.those with the lowest p value) all correlated with the polygenic score and with average height in the expected direction (positively), as shown in Table 3 (see Dataset 3).The average correlations were 0.58 and 0.69, respectively, which is a significant improvement compared to the average of the correlations with polygenic score and height of all the 66 alleles (r=0.03 and 0.04, respectively; see Dataset 2, cells BP38-39).
A factor analysis using minimum residuals was carried out.A single factor was extracted that explained 42% of the variance.
Factor loadings are displayed in Table 4.These are all positive (in the expected direction).
Factor scores were extracted with the Thurstone method (Thurstone, 1947), and are reported in Table 5.
The Pearson correlation between average country height and the factor score was strongly positive (r=0.88,N=11, p=0.001).This factor was also significantly correlated to the 66 hits polygenic score (r=0.78,N=26, p<0.001) and to the 693 hits polygenic score (r=0.545,N=26, p=0.004).Controlling for population differences in derived allele frequencies Allele status could be ascertained for 691 of the 697 SNPs.Among the alleles with a positive effect, there were 370 derived and 321 ancestral alleles, respectively Since this is not an equal representation, it creates a potential confounding factor.The derived allele frequency (DAF) was computed including both positive and negative effect alleles, to verify that these varied among populations.Average DAF is reported in Table 6.These indeed confirmed previous findings that non-African populations have higher frequencies of derived alleles (Henn et al., 2015).Since there are more derived alleles with a positive effect in this sample, the polygenic scores for African populations are lowered compared to non-African populations.Correcting for this bias will thus increase the polygenic scores of African populations relative to the others.
Since the frequency of ancestral alleles is the inverse of DAF (1-DAF), a polygenic score was calculated that gives equal weight to the ancestral and derived alleles by averaging the mean frequencies of ancestral and derived alleles with a positive GWAS effect.This ensures that polygenic scores of populations with higher background DAF are not biased upwards.Table 7 reports the average frequencies of derived and ancestral alleles with a positive effect, sorted in descending order by their mean polygenic score.The top and bottom scores are obtained by European and East Asian populations, respectively.The correlation with average population height is r=0.819.

Discussion
Polygenic scores, created by averaging frequencies from 26 populations of all SNPs having attained genome-wide significance (N=693; p<5*10 -8 ) in the largest and most recent human height GWAS, was positively correlated with the average height of 11 populations (r=0.79).Another polygenic score, obtained from 66 height increasing alleles pruned for linkage disequilibrium, was positively correlated with the average height of 11 populations (r=0.83).
The method of correlated vectors revealed that alleles with lower p values had a higher correlation with phenotypic height and polygenic score, suggesting that they tend to be enriched with signal of natural selection.A factor analysis of the top five GWAS hits produced a factor (whose loadings are all in the expected direction) which is significantly and strongly correlated both to population average height and to the polygenic scores.This showed an improvement over the correlation of the five single alleles with population height (Table 3, last row) which averaged 0.66, which in turn improved over the average correlation of the 66 alleles, which was near zero.
The rankings of polygenic and factor scores match with the folk perception on the stature of various racial groups: Africans> Europeans> South/Central Asians> Hispanics> East Asians (Table 2).However, the ranking of Africans was lower in the polygenic score computed using all the GWAS hits, whereas the others were little altered with respect to each other.
South East Asians had the lowest scores, a result which matches with their anthropometric description.
Within Europe, northern Europeans (Finns and White Americans) had a higher genotypic stature than their southern counterparts (Italians and Spaniards), confirming the results from a previous study on GWAS loci which compared northern vs southern Europeans (Turchin et al., 2012).
A correction for background derived allele frequency (DAF) was performed using the SNPs for which this information was available (N=691).Derived alleles were less common among African populations, confirming findings from previous studies (Henn et al., 2015).Since there were slightly more derived alleles among the GWAS hits (e.g. the alleles with a positive effect on stature), this biased downward the polygenic scores of African populations.When a DAF-corrected score was created, the African populations moved up from the bottom scores, which were in turn "occupied" by the East Asian populations (e.g.Japanese, Chinese and Vietnamese), confirming the traditional anthropometric description.There was also a marginal improvement in the correlation with average population height (r=0.82vs 0.79 for the corrected and uncorrected scores, respectively).
A limitation was the unavailability of sound statistics on the average height of many populations.Moreover, although human height is largely heritable, it is also heavily influenced by nutrition and living conditions.The importance of environment is suggested by the dramatic secular trend which took place in the 20th century in developed countries (e.g.Arcaleni, 2006;Webb et al., 2008); an association with dietary intakes (i.e.milk consumption) and socioeconomic status has also been observed (Mamidi et al., 2011;Webb et al., 2008).Most of the missing data were for developing countries which likely have not reached their full growth potential or ethnic groups living in Western societies (Indian Telegu or Gujarati) for which anthropometric statistics are not easily available.If the allele frequency factor represents a genuine signal of natural selection, then the difference between it and current phenotypic height could be used as an indicator of the quality of diet and living conditions in general.

Conclusion
Factor analysis of allele frequencies is a promising method for detecting signals of recent selection on polygenic traits.

F1000Research
The only major thing that, in my opinion, stands between this manuscript is the availability of the R scripts used to produce the data tables (especially given the difference between corr and rcorr).I looked for any linking to or availability of these scripts in in Piffer 2013 or 2014a-d, and could not find it.Without these scripts, I do not think this work can be considered reproducible.
Two additional (minor) revisions are as follows.
First, the phenotypic data in table one seems to reply solely on Wikipedia.It seems likely that the author could find additional height information beyond Wikipedia.
Second, the last four paragraphs of the article could be combined into one.
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed.Competing Interests: (Wood et al., 2014)  based on a very large sample (N=250K) identified common variants responsible for normal variation in human height within populations.

Table 1 . Polygenic score and height.
*Not on Wikipedia.Region of a country, more specific statistics found elsewhere.

Table 3 . Top five SNPs. (p value
and r with polygenic (pol) score).

Table 7 . Average frequency of alleles with a positive effect. Average frequency
of alleles with a positive effect.AA=Ancestral alleles; DA=Derived alleles.Mean (DAF-corrected) polygenic score sorted in descending order.

PubMed Abstract | Publisher Full Text Hung
MV, Pak S: The

impact of environment on morphological and physical indexes of Vietnamese and South Korean students. VNU
Journal of Science,  Natural Science and Technology.2008;24:50-55.Reference SourceMamidi RS, Kulkarni B, Singh A: Secular

Factor Analysis of Population Allele Frequencies as a Simple, Novel Method of Detecting Signals of Recent Polygenic Selection: The Example of Educational Attainment and
IQ. Mankind Quart.2013; 54(2): 168200.Reference Source Piffer D: Simple statistical tools to detect signals of recent polygenic selection.IBC.2014a; 6(1): 1-6.Reference Source Piffer D: Opposite selection pressure on stature and intelligence across human populations.Open Behav Genet.2014b.Reference Source Piffer D: Dataset 1

in: Evidence of polygenic selection on human stature inferred from spatial distribution of allele frequencies.
F1000Research. 2015a

Dataset 3 in: Evidence of polygenic selection on human stature inferred from spatial distribution of allele frequencies. F1000Research. 2015b. Data Source Piffer D: Dataset 4 in: Evidence of polygenic selection on human stature inferred from spatial distribution of allele frequencies. F1000Research
. 2015c.Data SourcePritchard JK, Pickrell JK, Coop G: The