South Asian genetic variation in a glance

Since I began blogging here in February we’ve come a long way in getting a better sense of South Asian genetic relationships. By “we,” I’m referring mostly to Zack Ajmal of the Harappa Ancestry Project, and to a lesser extent the Dodecad Ancestry Project and the Eurogenes Genetic Ancestry Project. These explicitly amateur enterprises have taken off the shelf population genetic analytic tools, such as ADMIXTURE, and combined them with a “crowd-sourced” sampling strategy. Zack now as over 100 individuals, the vast majority of them South Asian, some from ethnicities and communities which have never been analyzed in the academic literature.

The Times of India has now taken an interest in the Harrapa Ancestry Project. I’m rather tickled by this. When I first began corresponding with Zack about the technical details of preforming this survey of South Asian genomics neither of us knew where we were going to go. The main issue we both felt needed to be addressed was of scope of sampling. In other words, there were simply too many under-sampled populations in South Asia when it came to academic analyses of the human genetics of the region.

A quick survey of a map of some participants in HAP shows that much of north-central India remains woefully under-sampled even after six months:


View Harappa Ancestry Project in a larger map

ANI+4.jpgBut these are still young days yet. So what have we found out? I outlined some of the tentative conclusions a month ago. As time passes I am coming more and more to the conclusion that the primary connection that we South Asians have with the peoples of western Eurasia is through an affinity with the broad swath of peoples between Europe and the Middle East. The plot to the left shows the reason behind my assertion. It is a two-dimensional representation of genetic distances between putative clusters. I have given concrete examples for populations which are close substitutes of these ideal types (which were constructed by disaggregating the ancestry of real populations). I have not given a population example for “Ancestral North Indians” because no such real population exists as a good proxy. As some of you know modern South Asians can be viewed to a large, but not exclusive, extent as a two-way combination between a West Eurasian affiliated group, “Ancestral North Indians” (ANI), and another population termed “Ancestral South Indians” (ASI). The ASI are closer to East Eurasians than West Eurasians, but nevertheless the relationship is much more distant to East Asians than that of ANI to other West Eurasian groups. In fact, ANI can be substituted for other West Eurasian groups without too much disruption on a world-wide scale when it comes to representation the relationships between human populations. In contrast, the closest living population to the ASI are the tribes of the Andaman Islands, who are tens of thousands of years distant from the ASI of the mainland. There are no pure ASI present today. The South Indian adivasi is 30-40% ANI, and 70-60% ASI. The Pathan is 80% ANI and 20% ASI. Most people in South Asia span the gamut. But even this is too simple. More detailed analyses often tend to suggest that there have been multiple intrusions into South Asia since the original ANI-ASI admixture event, whether it be the Southwest Asian affinities of the peoples of northwest and western India, or the clear East Asian connections of Munda tribes in northeast India. I don’t want to get into those details in this post. For more interpretation I invite you to peruse some of Thorfinn’s posts at Brown Pundits (for those of you who care, Thorfinn is a pseudonym. His family is from Gujarat Punjab and Uttar Pradesh).

What I want to do is get back to the title of the post: how do I show you some raw results in a gestalt fashion? By this, I want readers of this weblog to immediately see some general relationships and be able to place their own community into a broader South Asian and international genetic context.

Here is my crack at that task. All the raw results are form Zack’s K = 11 Reference 3 run. In plain English what it means is that Zack took his huge population data set (which runs into the thousands) and told it to evaluate all the genetic variation and allocate ancestral quanta to individuals based on the 11 informative population clusters which fell out of the data. If you want a concrete example of how this works, if you have a population of of Swedes, Nigerians, and African Americans, and set K = 2, then the Swedes would all be at 100% for one cluster, the Nigerians at 100% for another cluster, while the aggregate African American population would shake out to be at 80% in the cluster which is fixed on Nigerians and 20% in the one fixed in the Swedes. If the African American population is representative then 10% of the individuals would have greater than 50% ancestry from the cluster which is at 100% in the Swedes.

Observe that I have not named the clusters. That’s because the clusters are statistical artifacts which fall out of the patterns of variation in the data. They map onto reality, but they are not reality. After the fact it seems reasonable to label one cluster “European” and the other cluster “African,” but always remember that these are labels useful for your interpretation, but the algorithm itself is sorting the variation across the data set. In other words when looking at complex plots at higher K’s focus on the relationship between the populations and individuals, and not absolute values of ancestral quanta and labels.

All this matters because at K = 11 Zack gave the clusters plausible labels, which usually correspond to high proportions in certain populations and regions. To make it more South Asian focused I removed the non-South Asian groups except Iranians in his reference set. The reference set consists of many individuals in each population (e.g., 10 Russians for the Russian population). Additionally, many of the population clusters which are defined for Africans, Oceanians, Amerindians, and more specific East Asian populations are not relevant for South Asians. So I amalgamated the African groups into one cluster, and all the peripheral East Eurasian, Oceanian, and Amerindian ones into another. I left the primary East Asian cluster, which defines China to Southeast Asia, disaggregated, since it is somewhat informative in South Asia (e.g., both my parents are ~10% East Asian).

With all that done the clusters which remain are:

  • S Asian, common among South Asian populations, though found at lower proportions in West Asia and Southeast Asia
  • Onge, an Andaman Island tribe
  • E Asian, the cluster which defines Han Chinese, Japanese, and Southeast Asians
  • SW Asian, a Middle East focused component, but found elsewhere in proportion to distance
  • European, peaks in northeast Europe, on the Baltic
  • African
  • East Eurasian, which is an aggregate of Oceanians, Siberians, Amerindians. A “catchall” which is really noisy in South Asians

Understand that the genetic distance between these components is not the same. The European-SW Asian distance value is rather small. The East Eurasian category throws in some rather distantly related groups, but none are that important in South Asians.

So I generated a bar plot with these clusters with the reference populations. But Zack also has results for nearly 100 individuals of South Asian origin. I removed those of mixed background, kept Iranians as an outgroup, and combined their identities somewhat. So Bengali Brahmins are clustered with other Bengalis, but I note they are Brahmin. Many of the “Unspecified” people below don’t fall into a clear category, but it isn’t as if they’re totally generic. Myself and my parents are the three Bengalis with a lot of East Asian without specification, but I know I have Bengali Brahmin (paternal grandmother), Kayastha (maternal grandfather), and Middle Eastern ancestry (maternal grandmother) within memory (i.e., these origins are preserved orally or textually). Additionally, my paternal lineage has several individuals who carry the honorific Khan, though that’s not straightforwardly mapped onto any caste term.

All the reference populations, which are averages of many individuals of a given group are at the top of the bar plot and in all caps. All the rest of the bars represent individuals from HAP, sorted by ethnic/regional labels, and then specific community/caste identity if possible.

If you want the raw spreadsheet, I put the open office file online.


hap2.jpg


Notes: The Gujarati population was separated tentatively because it seems as if there are two clusters. One of them is likely affiliated with the Patel community, but the other is a grab-bag of various groups. Most of these individuals are unrelated, but I am in the list along with my parents, so don’t overweight the East Asian component in Bengalis on account of that (though my parents are unrelated, and have the same quantum of East Asian).

33 thoughts on “South Asian genetic variation in a glance

  1. don’t have time to go back…but i’m pretty sure that some of the andra individuals are reddys. no idea why that isn’t showing up, might have been an error in importing the data and moving around formats. the reddys are characterized by a really high skew toward “sw asian” vs. “european” to the right of the bar plot.

  2. Kerala has rows for Brahmin, Muslim, and Christian, all 3 are minorities (albeit big ones). Why is there no data for Hindu-origin non-Brahmin Kerala people? Just curious!

  3. The stats for Punjab make sense (that they have a lot of European ancestry). Very often, you can mistake Punjabi origin folks to be white folks (the white genes are still strong in them).

  4. Why is there no data for Hindu-origin non-Brahmin Kerala people? Just curious!

    no one submitted yet 🙁 we want as many different types. though the muslim and christian are probably from non-brahmin backgrounds.

  5. The earliest Christians in Kerala claim to be from Nambudiri Brahmin families and from Jews. Later conversions were probably from all classes of Hindus. Kerala being the spice source of the ancient world attracted people from across the world on a regular basis.and some of them stayed- from places like Rome,Syria, Persia, the middle east….. – So its complicated.

  6. So its complicated.

    if we had large sample sizes it would be trivially obvious to apportion quantities. e.g., look at the bene israel and cochin jews.

    • Although I mention a number of foreign countries in my previous post, my guess is that their influence on the genetic makeup is trivial. Mostly Keralites are some kind of Indian. (Sorry but I’m not up on the technical lingo in this field)

  7. Mostly Keralites are some kind of Indian.

    two points.

    1) there are tribal groups from kerala in the reference sets, just so people know. right now the main problem with a lot of these data sets is that we have a lot of tribal and low caste groups, and thousands of tamil brahmins (OK, i’m exaggerating, but tamil brahmins are really into genotyping 🙂

    2) the stuff i’ve seen does indicate elevated southwest asian influences on the coast of western india. i am not totally skeptical of the possibility that a lot of nasranis will have moderate elevation of “southwest asian” ancestry anymore cuz of a few things i’ve seen. i doubt many of them are descended from brahmins though (this never made demographic sense, there aren’t enough brahmins). all these questions will be answered in the near future.

  8. Here’s a novel idea – why don’t we talk about how similar we brown people are rather than harping on the differences all the time?

    To them, we are all the same.

    • The demands of modern politics: Some demographic groups seek the status of being marginalized. The rest of the world owes them something.

  9. “tamil brahmins are really into genotyping”

    Ha ha. Tam Brams are insanely proud of being Tam Bram

    “all these questions will be answered in the near future.”

    I am Malayali Christian and this is fascinating stuff. Thanks to Razib and to the others working on this project. Cant wait.

  10. there’s another kerala syrian christian in the next run. N = 2 is not great, but when you start from zero there’s huge marginal returns.

  11. I wonder what the results of this experiment would have been if (1) we used 4 informative population clusters all found in South Asia: Mundas, Eastern Indians [Mizoram or Nagaland area], some Tamil Dalit, and a Pathan, and (2) we set K=2.

    I suspect that the many populations of Desis would appear even more similar.

  12. It would be interesting to see the results of a Malayali who is Knanaya Christian! The Knanaya community’s endogamous practices are based on the idea that we migrated from the Assyrian town of Edessa, Mesopotamia and have a Jewish background. There has been a significant amount of criticism in the purity of the genetic pool of the community considering the sole reason the community continues to enforce endogamous practices is to preserve this lineage.

    • It would be interesting to see the results of a Malayali who is Knanaya Christian! The Knanaya community’s endogamous practices are based on the idea that we migrated from the Assyrian town of Edessa, Mesopotamia and have a Jewish background. There has been a significant amount of criticism in the purity of the genetic pool of the community considering the sole reason the community continues to enforce endogamous practices is to preserve this lineage.

      All Malayalis are really Arabic, Romans, Jewish Scythians. They take offense at being Dravidian.

  13. we used 4 informative population clusters all found in South Asia: Mundas, Eastern Indians [Mizoram or Nagaland area], some Tamil Dalit, and a Pathan, and (2) we set K=2.

    zack ajmal could do that. he has all the populations.

    It would be interesting to see the results of a Malayali who is Knanaya Christian! The Knanaya community’s endogamous practices are based on the idea that we migrated from the Assyrian town of Edessa, Mesopotamia and have a Jewish background. There has been a significant amount of criticism in the purity of the genetic pool of the community considering the sole reason the community continues to enforce endogamous practices is to preserve this lineage.

    N = 1 would answer that question. there are LOTS of mizrachi jewish data sets in the public domain (i have some on my computer). note that both the bene israel and cochin jews clearly exhibit signatures of being an admixed population with middle eastern jews.

  14. It’s quite telling that the newest Syrian Christian participant is more South-West Asian than European. As for the unspecified Telugus, most of them are Reddy (I’ve had some exchanges with them on the 23andMe forums). The same applies for the Gujaratis, wherein I think all three of them are Patels (once again, going by my exchanges with them in lieu of me informing them about the project). And yeah, TamBrahms are the most well represented ethnicity on HAP, never expected so many of them to have genotyped themselves.

  15. Also, the Sourasthrian is Gujarati whereas the Thathai Bhatia is Sindhim and likely Rajput.

  16. AV, there’s a Nair in the next round! booh-yah! going by the two syrian xtians my own hypothesis is that both brahmins and syria xtians intermixed with the non-dalit populations of south india. that can explain the “sw asian” percentage elevation in south indian brahmins vs. north indian ones proportionally, as well as the enrichment of that component among syrian xtians, since they started from a high baseline.

  17. So the main difference between Iranians and many various India groups is the presence of Onge in varying degrees in the Indian groups. What stands out to me is that there isnt a single Indian group that does not have a significant Onge presence. And it is the absence of Onge that is differentiating Indian groups from Kalash, Brahui, and to a small extent Pathan (I bet this is sample is a desi Pathan and not Pathan from NWFP)

  18. And it is the absence of Onge that is differentiating Indian groups from Kalash, Brahui, and to a small extent Pathan (I bet this is sample is a desi Pathan and not Pathan from NWFP)

    no, they’re pakistani pathan from the HGDP data set. and i think pathans are desi (yes, some pathans would deny this, but the genetic and cultural evidence is pretty convincing to me, despite their iranian language). your general point is spot on though. there is an ancient substratum of indian ancestry, probably about ~40% of the ancestry of all south asians, which separates indians from other eurasians, to the west and east. though this substratum seems to be present in low proportions among some southeast asians too (e.g., cambodians). you can probably affiliate if with some of the indian-specific M mtDNA haplogroups if you are into more old-school phylogeographic methods.

    • i may be misremembering my se asian history, but did the people responsible for pushing hindu deity worship and sanskrit into the region also leave genetic markers to any significant degree?

  19. AV, there’s a Nair in the next round! booh-yah!

    I know, I’m very excited about that. He’s an e-Friend of mine, and heard about HAP a while ago through me, so he submitted first thing :)! He scores extremely well with South Indian Brahmins on 23andMe (Compare Genes tool), so let’s see what HAP’s analyses have in store for him. Regarding your hypothesis, I think it goes without question that South Indian Brahmins took on local wives at first upon coming to South India, and this is attested by their mtDNA demographics too. Considering that they exhibit the more aboriginal (in reference to it’s antiquity in South Asia) mtDNA M and R slightly more than they do the West-Eurasian U2, U7, T (and to a lesser extent HV + H) which is typical for North Indian Brahmins. If my 23andMe sharing list in anything to go by, they exhibit mtDNA M/R : West Eurasian mtDNA in a ratio of 3:2, possibly indicating that were later waves wherein Brahmin males from the North also brought their own females with them. Unfortunately there haven’t been any caste/region-specific studies on South Asian mtDNA, so we can’t form any concrete conclusions as of now.

    Also, in my last comment, I meant Sindhi* and not Sindhim [typo].

    PS. IDK why, but every time I try to change my commenting name to “AV” it changes to “A93”. Maybe some glitch with this Google Account?

  20. Razib Khan: The South Indian adivasi is 30-40% ANI, and 70-60% ASI. The Pathan is 80% ANI and 20% ASI. Most people in South Asia span the gamut.

    I love this data. Thanks for putting this all together! I have a basic question. On one hand, we have: “The South Indian adivasi is 30-40% ANI, and 70-60% ASI. The Pathan is 80% ANI and 20% ASI. Most people in South Asia span the gamut.”

    Question: Is Onge a proxy for the “ASI” and is the SouthWestern Asian and/or European gene a proxy for “ANI”?

    Also, I’m amazed that some one of your jatt samples is basically identical to a Pashtun sample. Intuitively, I can see that the people some times bear a striking resemblance, and they’re cultural and behavioral attributes are very similar and specific. They both tend to treat guest really well, very patriarchal, loyal, same or similar phyisque, same type of hair growth, etc.

    Pathan: 48 S Asian; 11 Onge; 1 E Asian; 16 SW ASian; 17 European; 5 East Eurasian.
    Punjabi Jatt: 49 S Asian; 13 Onge; 1 E Asian; 13 SW Asian; 20 European; 3 East Eurasian.

    NOTE: Numbers won’t necessarily round to 100%.

    Also, I can’t help but think to myself that when the proto-Indo-Iranian language started to diverge, that the Iranian-speaking Pashtuns and the Indo-Aryan speaking Jatt homeland is where this divergence took place. Moreover, their liturgical interpretations must have diverged at this time as well. It’s remarkable how Indo-Europeans first emerged in Iran around 1000 BC from the EAST (not West or Caucaus mountains), and also at about this time, Zoraster lived. Currently, linguists seem to think that this divergence took place in Central Asia around 2000 BC, I think.

    I think that the proto-Jatts and proto-Pashtuns eventually developed the proto-Sanskrit and proto-Iranian languages. Given that they are both culturally domineering and pugnacious, they spread these new views and methods.

  21. Is Onge a proxy for the “ASI” and is the SouthWestern Asian and/or European gene a proxy for “ANI”?

    And where does that leave the south asian component which is the largest one in desis and declines as you exit the subcontinent? The southwest asian and european components are a fraction of the south asian component among desis. Desis are overwhelmingly Onge + South Asian.

    It seems like the Onge component pretty much defines desiness. It is non-existent to vanishingly small in persians and other west asians and europeans.while it ranges from ~15% in punjabis of the peasant caste (jatts) to 25% in tamil brahmins and 35% in the vishvakarma “forward” caste and tribals of south and central india.

    The total absence of the Onge component in the Kalash sets them apart from desis. So much for the Harappa Project’s Kalash component……

  22. but did the people responsible for pushing hindu deity worship and sanskrit into the region also leave genetic markers to any significant degree?

    they found some markers in bali a few years back. i think it depends on the region. it seems more likely in maritime southeast asia. less likely in continental areas.

    Question: Is Onge a proxy for the “ASI” and is the SouthWestern Asian and/or European gene a proxy for “ANI”?

    well, onge is correlated with ASI, but it is not a substitute. the fraction of onge is lower than ASI. and “ANI” tracks s asian, sw asian, and european. it’s more complicated.

    It seems like the Onge component pretty much defines desiness. It is non-existent to vanishingly small in persians and other west asians and europeans.while it ranges from ~15% in punjabis of the peasant caste (jatts) to 25% in tamil brahmins and 35% in the vishvakarma “forward” caste and tribals of south and central india.

    The total absence of the Onge component in the Kalash sets them apart from desis. So much for the Harappa Project’s Kalash component……

    1) the onge underestimates ASI

    2) the onge is not a perfect proxy since they’re genetically somewhat distant. look at the non-trivial fractions across southeast asia in the reference runs

    3) the kalash have ASI, albeit low levels, because onge tends to underestimate ASI

    see the supplements of reich. et al

  23. I read from some pro-Indian nationalistic source suggesting that all proto-Caucasians came from the Indian Peninsula after 60,000 years ago. This was an example of a north and western migration.

    This is my question: Could we prove using genetics, that a migration occurred from south India into the North? I would think that in order to do so, we must identify a gene, gene sequence that’s expressed very highly amongst South Indians, and measure the frequency of this gene/gene_sequence for other groups.

  24. I read from some pro-Indian nationalistic source suggesting that all proto-Caucasians came from the Indian Peninsula after 60,000 years ago. This was an example of a north and western migration.

    first, can we stop using the word caucasian? it’s confusing.

    http://blogs.discovermagazine.com/gnxp/2011/01/stop-using-the-word-caucasian-to-mean-white/

    second, the more correct model is that everyone outside of africa excluding some people in the middle east may have south asian origins ~50,000 years ago. that’s because south asia was probably the most congenial area for african humans outside of africa (much of the middle east was really dry). in fact it is almost certainly true that all east eurasians, oceanians, and amerindians, have a south asian root at that stage. it is a little more confused for the west eurasian populations. they may have a more middle eastern root population, which was a sister group to the south asian humans.

    This is my question: Could we prove using genetics, that a migration occurred from south India into the North? I would think that in order to do so, we must identify a gene, gene sequence that’s expressed very highly amongst South Indians, and measure the frequency of this gene/gene_sequence for other groups.

    yes, genetics and suggest this. the real proof though would have to be extractions from ancient DNA. i think north india is dry and cool enough that we’ll get some, especially from older indus societies where human remains were buried underneath the house. i am kind of skeptical about the feasibility in south india, but who knows? perhaps something from the deccan?

    • Razib Khan: first, can we stop using the word caucasian? it’s confusing.

      I usually use “honky” for someone of Northern European, white ancestry, and who manifests xenophobia; and I use Caucasian as a more inclusive demarcator to include N. Africans, Middle Easterns, and even Orissans (adapted from Cavalli-Sforza’s population map).

      Razib Khan: second, the more correct model is that everyone outside of africa excluding some people in the middle east may have south asian origins ~50,000 years ago. that’s because south asia was probably the most congenial area for african humans outside of africa (much of the middle east was really dry). in fact it is almost certainly true that all east eurasians, oceanians, and amerindians, have a south asian root at that stage. it is a little more confused for the west eurasian populations. they may have a more middle eastern root population, which was a sister group to the south asian humans.”

      This is amazing! This is what I was referring to. I am so surprised about this. But when I ponder a little more deeply into this, I never thought too deeply about this to begin with. You see, I’ve seen many maps showing when, where, and the migratory route of our ancestors. What was very remarkable was that humans migrated out of Africa 60,000 years ago, and almost instantly, they made it all the way to Australia. Perhaps they drove hybrid vehicles, but I forgot that wheels were only invented around 6,000 YA. These humans “hugged” the coastlines on their way to Australia. So, I would think that they hung out there for millenia, and then wandered to elsewhere.

      Now, that being said, your information also corroborates with another factoid that I knew, that I believe Cavalli-Sforza touched upon: That ~67% of European’s genetics are from the Middle East. Now, this makes sense to me. Of course, he seems to believe that their Middle Eastern genetic contribution occurred sometimes after 10,000 YA.

      Boston_Mahesh: This is my question: Could we prove using genetics, that a migration occurred from south India into the North? I would think that in order to do so, we must identify a gene, gene sequence that’s expressed very highly amongst South Indians, and measure the frequency of this gene/gene_sequence for other groups.

      Razib Khan: yes, genetics and suggest this. the real proof though would have to be extractions from ancient DNA. i think north india is dry and cool enough that we’ll get some, especially from older indus societies where human remains were buried underneath the house. i am kind of skeptical about the feasibility in south india, but who knows? perhaps something from the deccan?

      Very good! You know, RK, one thing that I’d love to see in the future is a very robust, scientific way to elucidate our past, and these genetic techniques seem to be an amazingly powerful tool to discern our genetic origins/migratory path, etc. We already are at a Golden Age of all this, I believe, now that sequencing is relatively inexpensive. I’m very sure that this would shed more appreciation for our pedigree, but at the same time, it would, perhaps, help to destabilize certain loose nations with very differing genotypes.

  25. and I use Caucasian as a more inclusive demarcator to include N. Africans, Middle Easterns, and even Orissans (adapted from Cavalli-Sforza’s population map).

    the term is caucasoid caucasian is a different term.

    Now, that being said, your information also corroborates with another factoid that I knew, that I believe Cavalli-Sforza touched upon: That ~67% of European’s genetics are from the Middle East. Now, this makes sense to me. Of course, he seems to believe that their Middle Eastern genetic contribution occurred sometimes after 10,000 YA.

    i think he’s correct. you need to “update” your model of he world where the ice patterns had a long term effect. i think europe was probably “replaced” by middle eastern/central eurasian populations several times. think of what happened to the australian aboriginals or bushmen in south africa. as for india, the “ANI” was probably intrusive to the subcontinent. or the “ASI” are. the two groups are genetically VERY different. since the ASI don’t resemble anyone else closely (the connection to the andaman islanders is probably somewhat more distant than that between amerindians and east asians), and the ANI are basically another generic west eurasian group, the most likely scenario is that ANI represents a “back migration” of west eurasians who may originally have been from the fringes of south asia.

    I’m very sure that this would shed more appreciation for our pedigree, but at the same time, it would, perhaps, help to destabilize certain loose nations with very differing genotypes.

    most people are ignorant, so i doubt it’ll matter.

  26. Pingback: The genetic origin of Indians – variation in a glance « The Bach