Using sheep genomes from diverse U.S. breeds to identify missense variants in genes affecting fecundity

Background: Access to sheep genome sequences significantly improves the chances of identifying genes that may influence the health, welfare, and productivity of these animals. Methods: A public, searchable DNA sequence resource for U.S. sheep was created with whole genome sequence (WGS) of 96 rams. The animals shared minimal pedigree relationships and represent nine popular U.S. breeds and a composite line. The genomes are viewable online with the user-friendly Integrated Genome Viewer environment, and may be used to identify and decode gene variants present in U.S. sheep. Results: The genomes had a combined average read depth of 16, and an average WGS genotype scoring rate and accuracy exceeding 99%. The utility of this resource was illustrated by characterizing three genes with 14 known coding variants affecting litter size in global sheep populations: growth and differentiation factor 9 ( GDF9), bone morphogenetic protein 15 ( BMP15), and bone morphogenetic protein receptor 1B ( BMPR1B). In the 96 U.S. rams, nine missense variants encoding 11 protein variants were identified. However, only one was previously reported to affect litter size ( GDF9 V371M, Finnsheep). Two missense variants in BMP15 were identified that had not previously been reported: R67Q in Dorset, and L252P in Dorper and White Dorper breeds. Also, two novel missense variants were identified in BMPR1B: M64I in Katahdin, and T345N in Romanov and Finnsheep breeds. Based on the strict conservation of amino acid residues across placental mammals, the four variants encoded by BMP15 and BMPR1B are predicted to interfere with their function. However, preliminary analyses of litter sizes in small samples did not reveal a correlation with variants in BMP15 and BMPR1B with daughters of these rams. Conclusions: Collectively, this report describes a new resource for discovering protein variants in silico and identifies alleles for further testing of their effects on litter size in U.S. breeds.

There are currently 48 Mendelian traits and disorders in sheep where the causative variants are known 1 . Many of these variants affect the gene's protein sequence, and thereby alter its normal function. Although gene function may be affected by a wide range of large and small scale genomic sequence differences 2,3 , variants that alter amino acid sequences via missense, nonsense, frameshift, and splice site variants, are among those most likely to affect function 4 . DNA polymorphisms encoding these protein variants are readily identified by aligning genomic sequences of animals to a high-quality, annotated reference genome assembly like that available for sheep 3 . Identifying protein variants encoded by individuals in a population is an essential first step in characterizing genes known to influence traits 5,6 .
In principle, protein variants may be identified in silico for a gene of interest with access to population-scale whole genome sequence (WGS) data, like that found at the National Center for Biotechnology Information (NCBI) BioProjects and Sequence Read Archives (SRA). The first large ovine BioProject was deposited by the International Sheep Genomics Consortium (ISGC), which included the genome sequences of 75 sheep from 43 breed groups and two wild species from around the world (PRJNA160933). Although global diversity is outstanding in these sheep, these animals are not ideally suited for protein variant discovery across U.S. sheep populations due to their exotic breed composition and low numbers within breed. In addition, the terabyte size of SRA datasets is challenging to work with, and not readily searchable by gene or accessible on the internet with a user-friendly environment, such as the Integrated Genome Viewer (IGV) 7,8 .
We previously showed in cattle that protein variants for a gene of interest may be identified in silico with the appropriate population sample and 14x WGS datasets 9 . To that end, we created a similar publicly accessible, 16x WGS resource of 96 rams, that is viewable online with IGV. The rams share minimal pedigree relationships, and represent nine popular U.S. breeds and a composite line. Their genomes may be used to identify DNA polymorphisms in genes that affect the protein sequences in U.S. sheep populations.
To highlight the utility of this resource, we analyzed three wellstudied genes previously shown to encode protein variants affecting litter size in sheep: growth and differentiation factor 9 (GDF9), bone morphogenetic protein 15 (BMP15), and bone morphogenetic protein receptor 1B (BMPR1B). Together, there are 14 previously reported missense, nonsense, and frameshift variants affecting the protein function of these genes, and thereby affect ovulation rate and litter size 10,11 .
The proteins encoded by GDF9 and BMP15 are oocyte-secreted paralogs of the transforming growth factor-beta (TGF-β) superfamily that form homo-and heterodimeric ligands, and are essential for ovarian and follicular development 12 . These ligands synergistically regulate folliculogenesis through complex interactions with multiple receptors, such as BMPR1B. The BMPR1B gene encodes a type 1 membrane protein receptor that binds GDF9 and BMP15 in some mammals, although the identities of the BMPR1B ligands in sheep are unknown 13 . The amino acid sequences of GDF9, BMP15, and BMPR1B are highly conserved among placental mammals, and variants that alter key residues in peptide sequence, diminish function, and affect traits like ovulation rate and litter size. For example, substitution of arginine (R) for glutamine (Q) at position 249 (Q249R) in BMPR1B causes attenuation of BMPR1B signaling and ultimately leads to an increase ovulation rate 14,15 . Likewise, missense, nonsense, and frameshift variants in GDF9 and BMP15 may abolish function and cause an increase in ovulation rate in carrier ewes, while causing sterility in homozygous ewes 10 . However, some homozygous missense variants only diminish the protein's biological activity. For example, the homozygous substitution of methionine (M) for valine (V) at position 371 (V371M) in GDF9 allows ewes to remain fertile and hyper prolific. Since the types and distribution of protein variants encoded by these genes was unknown in U.S. sheep, we sought to identify them with WGS from the set of 96 U.S. rams.
We identified nine missense variants and 11 encoded protein variants in the three genes evaluated. Only one variant was previously known to be associated with increased litter size (GDF9, V371M). However, four variants were not previously reported. In BMP15, a Q for R substitution was observed at position 67 (R67Q), and a proline (P) for leucine (L) substitution was observed at position 252 (L252P). In BMPR1B, an isoleucine (I) for M at position 64 (M64I), and an asparagine (N) for threonine (T) was observed at position 345 (T345N). Based on the pattern of evolutionary conservation for these residues in vertebrates, it was hypothesized that some of these novel missense variants could interfere with protein function, affect litter size, and be useful for producers interested in modulating lamb production to match available resources.

Ethics statement
This article contains no studies performed with animal subjects. ; and one Navajo-Churro ram ( Figure 1). In addition to their contributions to the U.S. sheep industry, the breeds were selected to represent genetic diversity for traits such as fertility, prolificacy, maternal ability, growth rate, carcass leanness, wool quality, mature weight, and longevity. The Navajo-Churro ram was included for its rare lysine 171 (K171) substitution in the prion gene. The rams sampled from each breed were chosen to minimize their genetic relationships at the grandparent level. DNA samples of all 96 rams have been made available for global use as genotyping reference material since 2010 16 .
WGS production, alignment, and SNP genotyping DNA was extracted from whole blood with a typical phenol:chloroform method and stored at 4°C in 10 mM TrisCl, 1 mM EDTA (pH 8.0) as previously described 16 . Library preparation for DNA sequencing was also accomplished as previously described 9 . Briefly, 2 μg of ovine genomic DNA was fragmented and used to make indexed, 500 bp, paired-end libraries. Pooled libraries were sequenced with a massively parallel sequencing machine and highoutput kits (NextSeq500, two by 150 paired-end reads, Illumina Inc.). Pooled libraries with compatible indexes were repeatedly sequenced until 40 GB of data with greater than Q20 quality was collected for each ram, thereby producing at least 10-fold mapped read coverage for each index. This level of coverage provides scoring rates and accuracies that exceed 99% 9,18 . The DNA sequence alignment process was similar to that previously reported 18 . FASTQ files were aggregated for each animal and DNA sequences, aligned individually to Oar_v3.1 with the Burrows-Wheeler Alignment tool (BWA) aln algorithm version 0.7.12 19 , merged, and collated with the bwa sampe command. The resulting sequence alignment map (SAM) files were converted to binary alignment map (BAM) files, and subsequently sorted via SAMtools version 1.3.1 20 . Potential PCR duplicates were marked in the BAM files using the Genome Analysis Toolkit (GATK) version 3.6 21 . Regions in the mapped dataset that would benefit from realignment due to small indels were identified with the GATK module RealignerTargetCreator, and realigned using the module IndelRealigner. The BAM files produced at each of these steps were indexed using SAMtools. The resulting indexed BAM files were made immediately available via the Intrepid Bioinformatics genome browser with groups of animals linked at the USDA, ARS, USMARC internet site.
The raw reads were deposited at NCBI BioProject PRJNA324837. Mapped datasets for each animal were individually genotyped with the GATK UnifiedGenotyper with arguments "--alleles" set to the VCF file (Supplementary File S1), "--genotyping_mode" set to "GENOTYPE_GIVEN_ALLELES", and "--output_mode" set to "EMIT_ALL_SITES". Lastly, some SNP variants were identified manually by inspecting the target sequence with IGV software version 2.1.28 7,8 (described below in Methods section entitled 'Identifying protein variants encoded by GDF9, BMP15, and BMPR1B genes'). In these cases, read depth, allele count, allele position in the read, and quality score were taken into account when the manual genotype determination was made.
Evaluating WGS data integrity with 163 reference SNPs and 50 k bead array SNPs Genotypes from a set of 163 reference SNPs were used as an initial verification of the WGS datasets. These DNA markers have been used for parentage determination, animal identification, and disease traceback 22 . The 163 reference SNPs were previously genotyped across the MSDPv2.4 by multiple overlapping PCR-Sanger sequencing reactions, multiplexed matrix-assisted laser desorption/ ionization time-of-flight mass spectrometry (MALDI-TOF MS) genotyping assays, and 50 k bead array platforms 22 . The genotype call rate was defined as the number of SNP sites with three or more mapped reads, divided by the total number of sites tested. The error rate in the WGS data was estimated by comparing the independently-derived consensus genotypes for these SNPs to the WGS genotypes. An animal's WGS dataset passed initial verification when the accuracy of the WGS genotypes exceeded 97%, and the average mapped read depth was proportional to the amount of WGS data collected. Animals' datasets that failed this initial verification were inspected for contaminating and/or missing files. Once identified, the dataset was corrected and reprocessed. Linear regression analysis was accomplished in Excel version 2016. Access to the sequence was made available via USDA, ARS, USMARC internet site. Because the raw datasets were available online as they were produced, the raw FASTQ files were deposited in the NCBI SRA only after they were validated as described above. These 96 sets of files may be accessed through BioProject PRJNA324837 in the Project Data table under the Resource Name: SRA Experiments.
SNPs from the OvineSNP50 BeadChip (Illumina Inc.) were selected for comparison because they were numerous, uniformly distributed across the ovine genome, and available. Based on the nucleotide sequence of the 54,242 probes obtained from the manufacturer, the positions of 51,796 SNPs were verified via a BLAT process, as previously described 18 . There were 50,357 of these that mapped uniquely to autosomes and were used for analysis (Supplementary File S1). The genotypes from the WGS data were compared to those from the 50 k bead array with a custom program written specifically for this operation.
Identifying protein variants encoded by GDF9, BMP15, and BMPR1B genes The nucleotide variation in the exon regions of GDF9, BMP15, and BMPR1B was visualized through the public access portal at ARS USMARC with open source software installed on a laptop computer. Variants were recorded manually in a spreadsheet as previously described 9 . Briefly, a Java Runtime Environment version 8, update 131 (Oracle Corporation, Redwood Shores, CA) was first installed on the computer. When links to the data were selected from the appropriate web page, IGV software version 2.1.28 7,8 automatically loaded from a third-party site (University of Louisville, Louisville KY) and the mapped reads were loaded in the context of the ovine Oar_v3.1 reference genome assembly. Gene variants were viewed by loading WGS from a set of eight animals of different breeds, and the IGV browser was directed to the appropriate genome region by entering the gene abbreviation in the search field (e.g., GDF9). The IGV zoom function was used to view the first exon at nucleotide resolution with the "Show translation" option selected in IGV. Since GDF9 was in the reverse orientation with regards to the Oar_v3.1 assembly, the reference sequence was reversed so the translation was correctly viewed from right to left. The exon sequences were visually scanned for polymorphisms that would alter amino acid sequences, such as missense, nonsense, frameshift, and splice site variants. Once identified, the nucleotide position corresponding to a protein variant was viewed and recorded for all 96 animals. Using IGV, codon tables, and knowledge of the ovine GDF9, BMP15, and BMPR1B protein sequences (NP_001136360.2, NP_001108239.1, and NP_001009431.1, respectively), the codons affected by nucleotide alleles were translated into their corresponding amino acids and their Oar_v3.1 positions noted. Haplotype-phased protein variants were unambiguously assigned in individuals that were either: 1) homozygous for all variant sites, or 2) had exactly one heterozygous variant site. Maximum parsimony phylogenetic trees were manually constructed from the unambiguously phased protein variants. The phylogenetic trees were used, together with simple maximum parsimony assumptions, to infer haplotype phase in seven rams where two heterozygous variant sites occurred in GDF9.
The protein phylogenetic trees were rooted by comparing the variable residues in sheep to those from related species. Ovine peptide sequences for GDF9, BMP15, and BMPR1B were used to search NCBI's refseq_protein database with BLASTP 2.6.1 23,24 . Aligned protein sequences from a representative subset of 29 vertebrate species were used for the comparison.
Statistical analysis of litter size in daughters of carrier rams Lambing records for daughters of carrier rams were retrieved from the USMARC historical database and analyzed with the mixedmodel analysis of variance procedure (MIXED) of SAS (SAS Inst., Inc., Cary, NC; version 9.3). The phenotype evaluated was total number of lambs born (including stillborn) as a repeated record for each ewe. Different sets of ewes contributed to the analysis of each gene locus, and breed-specific genotype contrasts were evaluated. There were, however, similar models employed for all of the analyses. The models included fixed effects of classification for ewe age, and the sire-derived genotype class for the allele contrast in question. Three groups were created for ewe age to combine similar biologically performing ages: Group 1, ewe lambs; Group 2, ewes aged 2-5 years; and Group 3, ewes older than 5 years. The random effect of "ewe" was fitted and used to test the genotype contrast mean square. The Kenward-Roger option was used to approximate denominator degrees of freedom associated with the random effect of "ewe". For analysis of the X-linked BMP15 allele contrasts, the sire-derived gamete in these daughters was known directly. For analyses of autosomal genotype contrasts, it was inferred that rams of different genotypes had different distributions of daughter genotypes sampled. This inference reduced the power of analysis compared to a direct allelic test because we cannot determine the maternal-derived allele.

Genome sequencing and validation of WGS datasets
The average amount of genomic DNA sequence collected per animal was 50.4 GB (range 40.0 -97.7, SD 10.4). Independentlyderived genotypes from two sets of reference SNPs were used to confirm the identity and evaluate the quality of these data: those from 163 parentage markers, and those from approximately 50,000 markers on the OvineSNP50 bead array. Both sets have SNPs that are well distributed, highly-informative, and have been widely used. The WGS-derived genotypes for the 163 parentage SNPs were obtained by manually viewing an animal's mapped reads at the relevant genome coordinates via the internet and third-party software (illustrated in Figure 2A, and described in Methods). The expected genotypes and read depths were consistent for all but one of the 96 datasets, owing to missing data for that animal. After rectifying the data omission and performing regression analysis of the data for all 96 rams, the average calculated read depth (17.0) was directly proportional to the amount of sequence collected for each animal (range 11.9 -33.9, SD 3.6; Figure 2B). a mismapping of reads as this site averaged only 3.5 reads per animal. The overall accuracy of WGS genotypes for the 163 reference SNPs was 99.4%, and no animals had a SNP genotype accuracy less than 97% (i.e., not more than 4 errors in 163 SNP genotypes; Figure 2C). The few WGS genotype errors observed were typically caused by undetected heterozygous alleles at sites with low read coverage. Thus, comparing genotypes from 163 reference SNPs to those derived from the WGS file sets was effective for discovering and repairing errors, and independently verifying coverage.
The coverage and integrity of the WGS datasets were also evaluated at 50,357 evenly distributed, autosomal SNP sites from bead array data 25 . When plotted as a distribution of read depths by SNPs for all animals combined, the read depth was normally distributed with a mode near 16 ( Figure 3A). The calculated average read depth per SNP per animal was 16.8 for the 50 k bead array SNPs (Min 11.7, Max 34.2, SD 3.5), compared to 17.0 for the 163 reference SNPs above. Averaged over all animals, the concordance between WGS genotypes and those from the bead array was 99.5% ( Figure 3B) compared to 99.4% for the 163 reference SNPs. The genotype concordance reached a maximum at approximately 99.89% for the animal with the highest read depth (34.2-fold, 97.7 GB Q20 data). Taken together, the WGS genotype results for 163 reference SNPs was consistent with those for the 50 k bead array SNPs and indicated that the WGS datasets from these 96 rams are of sufficient quality and coverage for use in identifying and decoding gene variants in U.S. sheep.
Protein variants encoded by GDF9, BMP15, and BMPR1B genes The WGS data for the 96 rams were used to analyze the coding regions of GDF9, BMP15, and BMPR1B. These genes encode proteins of 453, 393, and 502 amino acids, respectively, each with multiple functional domains ( Figure 4A). Viewing the aligned sequences and detecting variants was simple, fast, and accurate with the IGV software and a publicly available web-based browser developed for this purpose ( Figure S1, Table S1). Nine missense variants were observed in the three genes with the 96 genomes ( Table 1). Four of the nine variants were not previously reported: BMP15 (R67Q, L252P) and BMPR1B (M64I, T345N). No other missense, nonsense, frameshift, splice site, or indel variants affecting the coding region were detected. A comparative list of the coding variants discovered here is given in Table 2, together with those previously reported for the three genes. Eleven protein sequence isoforms were predicted from phased combinations of codon variants (Table 3). Haplotypes were translated and placed in the context of a phylogenetic tree for predicted variants for GDF9, BMP15, and BMPR1B ( Figure 4B). The trees were rooted based on the pattern of evolutionary conservation of the residues in vertebrates ( Figure 5). All four of the previously unreported protein variants encoded by BMP15 and BMPR1B were on the distal nodes of their respective tree, indicating they arose after those on adjacent nodes.
The previously reported GDF9 V371M variant was present in our reference panel only in Finnsheep (Table 4). Alleles encoding the M371 residue are associated with increased litter size in both carriers and homozygous individuals ( An analysis of amino acid sequence conservation among species helped identify critical residues more likely to be involved in important protein functions. Ovine GDF9, BMP15 and BMPR1B were 80, 88, and 99% identical to other Artiodactyla species, at the propeptide sequence level ( Figure 5). We predict that variant residues in highly conserved protein domains are more likely to affect ovulation rate and litter size. The most well conserved residue in GDF9 showing missense variation among sheep breeds is V371,     located in the TGF-β-like domain. Throughout Eutheria, sheep were the only species observed to have the M371 residue encoded by GDF9 ( Figure 5). This is consistent with the substitution of the M371 residue causing reduced protein function, and therefore increased litter size in Finnish Landrace sheep (Table 2). Less conserved were the GDF9 residues at positions 87, 241, and 332, which are variable throughout Eutheria species and have not been associated with fecundity in sheep. With regards to missense variants in the other TGF-β ligand, BMP15 residues at positions 11, 67, and 252, were conserved through most of the Laurasiatheria, although the L11 deletion variant is common in sheep and has not been associated with fecundity (Table 2). Since Q67 and P252 substitutions in BMP15 have not been previously reported, their impact on protein function or reproductive phenotype has yet to be determined.
Conservation in the TGF-β receptor ligand receptor, BMPR1B, is particularly striking with 98% propeptide identity observed throughout Eutheria species, compared to approximately 76% and 77% for GDF9 and BMP15, respectively. Moreover, BMPR1B residues at positions 64 and 345 are also conserved throughout the Eutheria, suggesting that the I64 and the N345 substitutions in sheep may affect protein function. The I64 substitution in Katahdin sheep is in the extracellular activin receptor domain, whereas the N345 substitution in Romanov sheep is between the active site proton acceptor domain and the activation loop of the cytoplasmic domain ( Figure 4A). Although intriguing, the potential effects of the observed substitutions encoded by GDF9, BMP15 and BMPR1B in U.S. Sheep are unknown.

Retrospective analysis of litter size in daughters of carrier rams
The potential effects of the observed GDF9, BMP15 and BMPR1B variants on reproductive phenotypes were examined by analyzing lambing records from daughters of the rams sequenced in this where two leucine residues are commonly present; ni, a fourth protein variant was not identified for BMPR1B; nr, not in refseq_protein database and thus residues were determined by analyzing WGS data; dash, not enough sequence similarity for comparison or missing polypeptide region; nm, did not match a refseq_protein in the database for that species.  Figure 4. The distinctive missense variant or reference isoform is indicated in parentheses. b GDF9 protein "variant 4" contains the M371 amino acid previously associated with litter size in Finnish landrace sheep 11,26,37 . c BMP15 protein "variants 3 and 4" contain the previously unreported P252, and Q67 residues, respectively. d BMPR1B protein "variants 2 and 3" contain the previously unreported I64 and N345 missense variants, respectively. e Hyphen indicates the variant was not detected in that group.
project. There were no database records for daughters of the five Finnsheep rams carrying the GDF9 allele encoding the M371 residue (i.e., "Variant 4", Table S2). There were, however, records for 403 daughters sired by eight rams with at least one of the four BMP15 or BMPR1B variants. Together, the eight rams sired 480 lambs in various flocks in seven years, although not all variant genotypes were frequent in these rams (Table S3-Table S6). Analyses of these data did not reveal a significant correlation between litter size and any of the four BMP15 or BMPR1B variants (95% confidence interval). However, this simple test for association lacked power, and could only detect litter size effects. It remains possible that a well-designed, prospective genetic study may detect biologically and economically relevant differences associated with these variants of highly-conserved residues in developmentally important genes.

Discussion
We created a searchable and publicly viewable online genomics resource consisting of 96 individuals representing a broad cross section of U.S. sheep breeds, and demonstrated its use for identifying protein variants. The DNA for these 96 rams, together with their 95 tetrad families, is also available for confirming segregation alleles identified in the WGS 16 . A minimum of 40 GB of short read, paired-end DNA sequence data provided at least 11-fold mapped genome coverage for each animal. The aligned sequences were made available for downloading or viewing online with a customized IGV visualization software that supports accurate manual assessment of gene-specific genetic variation. The average coverage of the sheep diversity panel was 16.8-fold and resulted in an average genotype accuracy of approximately 99.5%. These numbers were consistent with previous results obtained with 96 beef bulls 9 . This online resource provides the ability to readily inspect gene variants reported in one breed, evaluate them in other breeds, and search for any additional variants that may affect protein structure. The ability to identify the full range of protein variants in a population is critical for designing studies intended to test a candidate gene's influence on a trait.
The web-based platform worked well for analyzing three ovine genes with previously documented missense variants affecting ovulation rate and litter size. In a matter of hours, each gene was evaluated for any obvious coding variants, scored in the group of 96 rams, and compared to the previously known variants. Of the 14 known causative variants affecting litter size in sheep, only one was observed in the 96 U.S. rams, and only in the Finnsheep breed (GDF9 V371M). This is consistent with reports that the highly prolific Finnish Landrace sheep are thought to be the source of the V371M variant 11,26 . With regards to U.S. Finnsheep, the frequency of the GDF9 V371M variant was 0.25%, with five of the 10 rams having zero copies of the V371M variant. Since ewes homozygous for the M317 variant are known to be fertile, there is a good opportunity for breeders to modulate the frequency of the GDF9 V371M variant in their purebred Finnsheep flocks, and thereby attain a more optimal litter size for their ewes.
The WGS analysis also revealed four previously unreported missense variants: BMP15 R67Q and L252P; BMPR1B M64I and T345N. Although our preliminary tests for association between variants and litter size did not detect a significant difference, the evidence for dismissing these candidates is not compelling due to the limited number of sires with the variant allele. In spite of having no direct evidence of phenotypic effects associated with these alleles, analysis of the evolutionary conservation of residues at variant sites, their locations within the protein domains, and the effects on ovulation in other species has provided some insight. For example, the BMP15 R67Q variant found in Dorset was the least conserved, and predicted to be the least likely to affect function among placental mammals. Since the Q67 residue is present in several other Eutheria, and is not part of the mature BMP15 peptide ligand, its occurrence would seem to be a functional evolutionary option ( Figure 5). In humans, the equivalent variant (R68Q) was reported in the 1000 Genomes Project with no apparent disease effect noted (rs782187019) 27 . However, a tryptophan (W) substitution at this same position in humans causes premature ovarian failure and primary ovarian insufficiency (i.e., R68W) 28 . Thus, some substitutions at this position may cause loss of function in some mammals, but it appears as though Q67 may not be one of them.
Unlike the R67Q variant, the L252P variant encoded by BMP15 was not observed in any other vertebrate species and was strictly conserved throughout the Laurasiatheria species. The P252 residue does not appear in the mature BMP15 peptide, however, it is plausible that the non-conservative substitution of P252 for L252 could interfere with post-translational processing of the mature peptide.
In primate species, M253 is the equivalent residue to ovine position L252P, and healthy human individuals represented in the 1000 Genomes Project have rare heterozygous substitutions of V253 and T253 with no pathology reported. Because alleles with the P252 residue were present at a high frequency (1.0 in four White Doper), it's unlikely that the homozygous state causes sterility in ewes. However, the possibility remains that P252 residue may decrease function, and that two copies of a slightly less functional BMP15 may increase the ovulation rate and litter size.
In contrast to the numerous missense variants encoded by the ovine GDF9 and BMP15 genes, there has been only one missense variant identified in the receptor gene, BMPR1B (Q249R The second BMPR1B variant, T345N, is located inside the cell between two closely spaced active site domains and was present in three of ten Romanov rams (including a homozygote), and one of ten Finnsheep rams. The T345 residue is conserved throughout Tetrapoda species and N345 was not found in any Vertebrata species. A search for human variants in the 1000 Genomes Project revealed only a rare S345 substitution with no pathology reported. Based on the location of the T345 variant near the active site, its strict evolutionary conservation in vertebrates, and that it was found in the two most prolific U.S. breeds, we hypothesize that the N345 residue diminishes the function of the BMPR1B receptor and may influence ovulation and litter size. The BMPR1B T345N variant thus represents a high-priority candidate allele for validation studies in these breeds. If any of these newly discovered variants are confirmed to be associated with litter size, DNA-based tests for them could be incorporated into existing genetic testing platforms and used to select for important traits and manage production. Since the number of lambs produced per ewe per year is of fundamental economic importance to sheep production regardless of the production system, these types of DNA tests would be helpful for producers interested in modulating lamb production to match available resources and maintain long-term sustainability.

Conclusion
In summary, the WGS resources described here are suitable for use in identifying and decoding gene variants in the vast majority of U.S. sheep. When applied to GDF9, BMP15 and BMPR1B genes, the findings suggest there may be variants circulating in the U.S. that could be further evaluated for potential use to increase litter size in U.S. breeds. These resources, including the web interface, underlying sequence data, and the associated information are available to researchers, companies, veterinarians, and producers for use without restriction.
The data have also been deposited with links to BioProject accession number PRJNA324837 in the NCBI Bio-Project database.
In addition, access to the aligned sequences is available via USDA internet site: http://www.ars.usda.gov/Services/Docs. htm?docid=25585. Download access to the BAM files is available at the Intrepid Bioinformatics site: http://server1.intrepidbio.com/FeatureBrowser/customlist/ record?listid=7918711123 Lambing records for daughters of carrier rams were retrieved from the USMARC historical database, which is not accessible to the public. Table S3-Table S6 provide summary data from these records, which is adequate for the reproducibility and re-analysis purposes of this article.

Competing interests
No competing interests were disclosed.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgements
We thank the USMARC Core Facility staff for outstanding technical assistance, and J. Watts for secretarial support. We thank Dr. K. Leymaster for assistance in developing the sheep diversity panel and thoughtful discussions and improvements to the manuscript. This work was conducted in part using the resources of the University of Louisville's research computing group and the Cardinal Research Cluster, and we thank Mr. H. Simrall for his assistance.
Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA. The USDA is an equal opportunity provider and employer.  Click here to access the data. Table S3. Effect of sire copies of BMP15 P252 alleles ("Variant 3") on daughter litter size.

Supplementary material
Click here to access the data. Table S4. Effect of sire copies of BMP15 Q67 alleles ("Variant 4") on daughter litter size.
Click here to access the data. Table S5. Effect of sire copies of BMPR1B I64 alleles ("Variant 2") on daughter litter size.
Click here to access the data. sequence coverage was increased for this individual to ensure this result. Yet, the use of modern tools for predicting the functional effect of amino acid substitutions should have warned them that this variation (T345N) is not likely to produce a phenotype (PROVEAN score = -2.191, Neutral). In this respect, the 4 novel variations described in the fecundity genes are of minor importance. Nevertheless, the observation that none of the U.S. popular breeds carries the Booroola mutation should have an impact, as introgression of this mutation revolutionized sheep production in Spain and Israel . Despite the apparent weakness of the work in identifying important novel variants of fecundity genes, I approve this work as a valuable genomic resource. The authors are advised to control for DNA contaminants, to avoid the discrepancies described by extending the clear and accurate presentation of this work to the affiliated 1.

7.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. 29 August 2017 Referee Report doi:10.5256/f1000research.13224.r24897

Christine Couldrey
Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand The manuscript entitled "Using sheep genomes from diverse U.S. breeds to identify missense variants in genes affecting fecundity" provides an informative overview of a publicly searchable DNA sequence resource for U.S. sheep that the authors have generated. The manuscript is well written. The manuscript is largely descriptive in nature but does include a proof of principal type study to highlight the potential usefulness of this dataset. The study is appropriately designed and technically sound. Most of the conclusions drawn are adequately supported by the results. Where the authors have not identified phenotypic effects of amino acid changes they have acknowledged that this study does lack power to make definitive statements.
A few details could do with further detail/clarification/some modification: In the last paragraph on page 4 where the manuscript states "when the accuracy of the WGS genotypes exceeded 97%"…… Please explain why 97% was chosen as the threshold.
Page 5 under heading Identifying protein variants encoded by GDF9, BMP15, and BMPR1B: the methods describe haplotype phasing, however, it is unclear as to how many of the 96 sheep were able to be phased and used in analysis. It is therefore difficult to determine the validity of this method. The methods or corresponding results section should be expanded to include this information.
Page 5 Methods section under heading Statistical analysis of litter size in daughters of carrier rams: Please include information of how many rams had daughter lambing records for each breed and each of the variants identified Page 6 The authors should include some discussion/information about the inclusion of animals that had the lowest concordances with other genotyping platforms, particularly the animals with ~17X coverage and a concordance of ~97%. In some research facilities this (and some of the other animals with greater than 10X coverage and less than 99% concordance) would be treated as suspect and excluded, or further analysis undertaken -if the latter is the case, please include the further analysis. Page 7 the sentence "Alleles encoding the M371 residue…" refers to Table 2, however this residue is not referred to in Table 2. Please correct. Page 7 "We predict that variant residues in highly conserved protein domains are more likely to affect ovulation rate and littler size". While this sentence could be true, it is also possible that protein domains are highly conserved for functions other than ovulation rate and litter size. Page 12 There is insufficient data to really make the statement "Thus some substitutions at this position may cause loss of function in some mammals but it appears as though Q67 may not be one of them". Given that the phenotypes around ovarian failure in sheep are likely not well recorded given the culling of sheep relatively early in life.

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate? I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.