The effects of centuries of isolation shaped the genetics of Greenlanders
Greenland Genetic Database Project 2014-08 (Ref. No. 2015-16426) and Project 20201-09: Importance of High Latitudes for the Genetic Diversity
Many people of European descent have a lot of the genetic material in genetic databases. What little research has taken place on the island. The genetic makeup of most of the Inuit and Europeans who live in the area has changed because of living in the high latitudes.
The study was given the go-ahead by the Scientific Ethics Committee in Greenland. No. 2011–056978), project 2014-08 (ref. Project 2015–22 is a project with no. no. 2015-16426) and project 2021-09) and was conducted in accordance with the Declaration of Helsinki, second revision. All participants gave their consent in writing.
Oral glucose tolerance in African populations with missingness on the phenotypes and the protein abundances: A sample-size-matched GWAS with UK Biobank
For the sample-size-matched GWAS with UK Biobank, we randomly sampled 5,996 people with no missingness on the phenotypes. We matched the number of people in the Greenlandic cohort with the number of people in the random sample that came out for associations on the protein abundances. The GRMs on the UK Biobank were estimated on variants with MAF > 5% and missing less than 1%, resulting in a total of 4.5 million variants. We removed all the variant within 2.5 MB of that variant and repeated until we found no variant with a P value of 5 108. The model with the lowest P was assigned to the model which was the other way around. To avoid duplicated signals of strong associations that were found in both models, we excluded associations within ±1 Mb of an association with a lower P value.
To find new variants, we kept only variants with good coverage in gnomAD (gnomADover15 > 80%, gnomADover50 < 10% and gnomAD filter = PASS or NA). Moreover, we did not allow for spanning deletions (variant call format (VCF) asterisk allele), variants with missingness greater than 20% or several variants 1 bp apart in the Greenlandic WGS data. All these filters were used to minimize the number of false-positive variants. According to the minor allele in gnomAD African populations, some variant were not evenly divided.
Height, weight, systolic blood pressure, diastolic blood pressure, and hip and waist circumference were measured, and body mass index and waist-hip ratio were calculated. All IHIT participants above 18 years, B99 participants above 35 years and a subset of B2018 participants underwent an oral glucose tolerance test, where blood samples were drawn after an overnight fast of at least 8 h, and at 30 min (only for B2018) and 2 h after receiving 75 g glucose. The measurement of haemoglobin 1ac was taken at a time of 30 min and 2 h. The values of cholesterol and high-density lipoprotein cholesterol were measured. The World Health Organization 1999 criteria 57 defined type 2 diabetes and the controls as normal-glucose tolerant based on the oralglucose tolerance test data.
The Olink protein data for the Greenlandic participants used for protein quantitative trait loci analysis are from ref. 31. The Olink Target 96 Inflammation and Cardiovascular II panels measure the relative levels of 186 genes in 3,732 people. The OlinkAnalyze R package was used to bridge and normalized the 2 batches. Normalized protein expression values on a log2 scale were inverse-rank normalized, including normalized protein expression data below the limit of detection. A quality control warning was excluded.
Inuit and European admixture proportions were calculated using the software ADMIXTURE60 on a subset of variants with MAF > 5%, missingness less than 1% and LD-pruned within 1 Mb removing variants with R2 > 0.8 using Plink v.1.9.0 (ref. 61).
For fine structure analysis of the Inuit ancestry, we used the neural network framework, HaploNet33, on the phased and imputed data of all 5,996 Greenlandic participants on SNPs with MAF > 5%. We did window-based haplotype clustering using a variational autoencoder. We used a window size of less than 1 billion genetic variations to create haplotype cluster likelihoods for all samples and also to infer finescale population structure through both ancestry estimation and principal component analysis. We performed unsupervised ancestry estimation allowing for two ancestral sources (K = 2) and ran it with several seeds to ensure that the expectation maximization algorithm of HaploNet had converged. The convergence criterion was defined as having two runs within five log-likelihood units of the best seed. The sources of ancestry were assumed to be European and Inuit.
We used the branch length samples of the local tree from the estimated ARG to test for selection. These branch length samples were used as input to CLUES to infer allele frequency trajectories and test for selection44. To obtain empirical P values, we tested for selection for 999 additional variants matched to have within ±10% derived allele frequency of each variant. The log-likelihood rank was calculated based on the total number of variants tested. Random variations weren’t taken within 5 MB of any of the tested ones.
Reference and alternative allele counts were counted using Plink v.1.9.0 (ref. Keeping allele order and projected to the desired number of participants using a formula. Where k is the observed number of alternative alleles and j is the number of alleles to project to, that is binom. For each site we get the probability that we would observe j alternative alleles in a subsample of m alleles. The probabilities were summed and folded to get the SFS.
To measure the number of segregating SNPs as a function of the number of participants sequenced, we projected the SFS to the wanted number of participants, folded the SFS and summed across all the non-zero SFS-bins. In this way, we get the number of segregating SNPs for all possible subsamples of participants from our data.
Predictors under the model were calculated using the effects size and the allele frequency of the effect. The model’s deviation was calculated by using the effect size and expected frequency as the variables. The lower and upper CI were used to calculate the effect size but the same formula was used for the difference between the two CIs. For choosing the variants with more than 1% variance explained we used the formula PVE = β2/(β2 + SE(β)2 × N), where SE(β) is the standard error of β and N is the number of participants70. For binary traits we calculated the liability-scale variance explained using the R package Mangrove (v.1.21) as previously described14.
As described previously72, we estimated relatedness using a filtered set of genetic variants with MAF > 5%, missingness < 5% and LD-pruned (Plink v.1.9.0 indep-pairwise 1,000 kb 1 0.8) along with the inferred admixture proportions as input to NGSremix73. The pairwise relatedness can be calculated by looking at the fraction of the genes sharing zero or one or two alleles identical by descent. An offspring pair is defined as one that has k1 + k2 > 0.95 and k1 > 0.75 and a parent is inferred from the age of the participants. Full sibling pairs were defined as relationships with 0.3 < k1 < 0.7 and k2 > 0.1. Out of 5,828 people with age and location, we identified 1,727 parent–offspring and 1,841 full sibling relationships. Relationships were normalized to the number of possible pairs. For relationships between regions, for example, region 1 and 2, the number of possible pairs was calculated as npossible(1,2) = nregion1 × nregion2, where nregion1 and nregion2 is the number of participants in region 1 and 2, respectively. Within region, the number of possible pairs of parent–offspring relationships was calculated as npossible(1,1) nregion1 1 Within region, the number of possible pairs of full sibling relationships was calculated as npossible(1,1) (nregion1 1)/2
To calculate the expected frequency of homozygous carriers with the current fine structure, fhom(structure), we estimated the regional allele frequencies, AFregion, based on sample location, calculated the expected number of homozygous carriers in each region as nhom(region) = nregion × AF2region, calculated the total sum of homozygous carriers, nhom = nhom(region1) + nhom(region2) The region 8 is divided with the total number of participants. = nhom/ntotal. The expected frequency of carriers was estimated by the panmictic population. 10,000 bootstrap samples were used to estimate the CIs of the frequencies.
We made 10,000 branch length samples of the local tree, using the provided SampleBranchLengths script from Relate to estimate variant age. Relate doesn’t allow for changes in the tree top down. We took the time from each sample to the most recent common ancestor and the preceding one. This yields a minimum and maximum age of the variant measured in generations for each branch length sample. From the intervals, we estimated the age of the variants and the credibility interval. As the weight of the interval is the same as the interval length, we calculated a probability density. By doing so we assumed that the age is equally likely to lie anywhere within each interval and we gave equal weight to each of the 10,000 sampled branch lengths. The median probability density was estimated and the estimate was 2.5% and 97.5% quantiles. The age in generations was converted to years by multiplying with the assumed generation time of 28 years per generation.