Since I began blogging here in February we’ve come a long way in getting a better sense of South Asian genetic relationships. By “we,” I’m referring mostly to Zack Ajmal of the Harappa Ancestry Project, and to a lesser extent the Dodecad Ancestry Project and the Eurogenes Genetic Ancestry Project. These explicitly amateur enterprises have taken off the shelf population genetic analytic tools, such as ADMIXTURE, and combined them with a “crowd-sourced” sampling strategy. Zack now as over 100 individuals, the vast majority of them South Asian, some from ethnicities and communities which have never been analyzed in the academic literature.
The Times of India has now taken an interest in the Harrapa Ancestry Project. I’m rather tickled by this. When I first began corresponding with Zack about the technical details of preforming this survey of South Asian genomics neither of us knew where we were going to go. The main issue we both felt needed to be addressed was of scope of sampling. In other words, there were simply too many under-sampled populations in South Asia when it came to academic analyses of the human genetics of the region.
A quick survey of a map of some participants in HAP shows that much of north-central India remains woefully under-sampled even after six months:
View Harappa Ancestry Project in a larger map
But these are still young days yet. So what have we found out? I outlined some of the tentative conclusions a month ago. As time passes I am coming more and more to the conclusion that the primary connection that we South Asians have with the peoples of western Eurasia is through an affinity with the broad swath of peoples between Europe and the Middle East. The plot to the left shows the reason behind my assertion. It is a two-dimensional representation of genetic distances between putative clusters. I have given concrete examples for populations which are close substitutes of these ideal types (which were constructed by disaggregating the ancestry of real populations). I have not given a population example for “Ancestral North Indians” because no such real population exists as a good proxy. As some of you know modern South Asians can be viewed to a large, but not exclusive, extent as a two-way combination between a West Eurasian affiliated group, “Ancestral North Indians” (ANI), and another population termed “Ancestral South Indians” (ASI). The ASI are closer to East Eurasians than West Eurasians, but nevertheless the relationship is much more distant to East Asians than that of ANI to other West Eurasian groups. In fact, ANI can be substituted for other West Eurasian groups without too much disruption on a world-wide scale when it comes to representation the relationships between human populations. In contrast, the closest living population to the ASI are the tribes of the Andaman Islands, who are tens of thousands of years distant from the ASI of the mainland. There are no pure ASI present today. The South Indian adivasi is 30-40% ANI, and 70-60% ASI. The Pathan is 80% ANI and 20% ASI. Most people in South Asia span the gamut. But even this is too simple. More detailed analyses often tend to suggest that there have been multiple intrusions into South Asia since the original ANI-ASI admixture event, whether it be the Southwest Asian affinities of the peoples of northwest and western India, or the clear East Asian connections of Munda tribes in northeast India. I don’t want to get into those details in this post. For more interpretation I invite you to peruse some of Thorfinn’s posts at Brown Pundits (for those of you who care, Thorfinn is a pseudonym. His family is from
Gujarat Punjab and Uttar Pradesh).
What I want to do is get back to the title of the post: how do I show you some raw results in a gestalt fashion? By this, I want readers of this weblog to immediately see some general relationships and be able to place their own community into a broader South Asian and international genetic context.
Here is my crack at that task. All the raw results are form Zack’s K = 11 Reference 3 run. In plain English what it means is that Zack took his huge population data set (which runs into the thousands) and told it to evaluate all the genetic variation and allocate ancestral quanta to individuals based on the 11 informative population clusters which fell out of the data. If you want a concrete example of how this works, if you have a population of of Swedes, Nigerians, and African Americans, and set K = 2, then the Swedes would all be at 100% for one cluster, the Nigerians at 100% for another cluster, while the aggregate African American population would shake out to be at 80% in the cluster which is fixed on Nigerians and 20% in the one fixed in the Swedes. If the African American population is representative then 10% of the individuals would have greater than 50% ancestry from the cluster which is at 100% in the Swedes.
Observe that I have not named the clusters. That’s because the clusters are statistical artifacts which fall out of the patterns of variation in the data. They map onto reality, but they are not reality. After the fact it seems reasonable to label one cluster “European” and the other cluster “African,” but always remember that these are labels useful for your interpretation, but the algorithm itself is sorting the variation across the data set. In other words when looking at complex plots at higher K’s focus on the relationship between the populations and individuals, and not absolute values of ancestral quanta and labels.
All this matters because at K = 11 Zack gave the clusters plausible labels, which usually correspond to high proportions in certain populations and regions. To make it more South Asian focused I removed the non-South Asian groups except Iranians in his reference set. The reference set consists of many individuals in each population (e.g., 10 Russians for the Russian population). Additionally, many of the population clusters which are defined for Africans, Oceanians, Amerindians, and more specific East Asian populations are not relevant for South Asians. So I amalgamated the African groups into one cluster, and all the peripheral East Eurasian, Oceanian, and Amerindian ones into another. I left the primary East Asian cluster, which defines China to Southeast Asia, disaggregated, since it is somewhat informative in South Asia (e.g., both my parents are ~10% East Asian).
With all that done the clusters which remain are:
- S Asian, common among South Asian populations, though found at lower proportions in West Asia and Southeast Asia
- Onge, an Andaman Island tribe
- E Asian, the cluster which defines Han Chinese, Japanese, and Southeast Asians
- SW Asian, a Middle East focused component, but found elsewhere in proportion to distance
- European, peaks in northeast Europe, on the Baltic
- East Eurasian, which is an aggregate of Oceanians, Siberians, Amerindians. A “catchall” which is really noisy in South Asians
Understand that the genetic distance between these components is not the same. The European-SW Asian distance value is rather small. The East Eurasian category throws in some rather distantly related groups, but none are that important in South Asians.
So I generated a bar plot with these clusters with the reference populations. But Zack also has results for nearly 100 individuals of South Asian origin. I removed those of mixed background, kept Iranians as an outgroup, and combined their identities somewhat. So Bengali Brahmins are clustered with other Bengalis, but I note they are Brahmin. Many of the “Unspecified” people below don’t fall into a clear category, but it isn’t as if they’re totally generic. Myself and my parents are the three Bengalis with a lot of East Asian without specification, but I know I have Bengali Brahmin (paternal grandmother), Kayastha (maternal grandfather), and Middle Eastern ancestry (maternal grandmother) within memory (i.e., these origins are preserved orally or textually). Additionally, my paternal lineage has several individuals who carry the honorific Khan, though that’s not straightforwardly mapped onto any caste term.
All the reference populations, which are averages of many individuals of a given group are at the top of the bar plot and in all caps. All the rest of the bars represent individuals from HAP, sorted by ethnic/regional labels, and then specific community/caste identity if possible.
If you want the raw spreadsheet, I put the open office file online.
Notes: The Gujarati population was separated tentatively because it seems as if there are two clusters. One of them is likely affiliated with the Patel community, but the other is a grab-bag of various groups. Most of these individuals are unrelated, but I am in the list along with my parents, so don’t overweight the East Asian component in Bengalis on account of that (though my parents are unrelated, and have the same quantum of East Asian).