Structure within Houston Gujaratis resolved?

guj.jpgAbout two and a half months ago I brought your attention to the fact that there is population substructure in the Gujaratis of Houston. That might sound strange, but here’s the back story. Over the past ~10 years or so there has been a project attempting to catalog common human genetic variation, known as the HapMap. The HapMap began with East Asian, West African, and European groups. But over the years it has been expanding. The first South Asian population added to the database were people of Gujarati origin in Houston, Texas. Therefore, you had a situation where in the medical genetic literature there was a lot of talk about “Gujaratis from Houston,” as if that was a group of particular importance.

The ultimate pragmatic rationale for the catalog was to allow researchers to control for ancestry when attempting to fix upon genes implicated in disease. By illustration, if Chinese have disease X at a greater frequency than Europeans, if you had a common pool of Chinese + Europeans then all the genetic variants associated with the Chinese might come up as causal, when actually it’s just a correlation with ancestry.guj2.jpgAnd this brings me to the Houston Gujaratis. One thing that jumps out at you in analyses of genetic variation of this population set is that it has substructure. That is, there are two populations within the data set. More precisely, there is one tight cluster, while the rest of the individuals vary a great deal in their genetic character. The image above is my own plotting of the variation of Chinese and the Houston Gujaratis onto a cubic space. You immediately see that there is a Chinese cluster and a Gujarati cluster, and a range of Gujaratis who fall outside of the main cluster.

Knowing what we know about the prevalence of endogamy among South Asians the immediate model which jumped out at me was that the Houston Gujarati cluster was a specific subgroup which migrated to the United States. But who? My immediate hunch was that they might be a group of Patels. Others of you suggested Bohras.

I can now report something substantive thanks to Zack Ajmal. He has some Gujarati Patels in the Harappa Ancestry Project,and they match closely with the Gujarati cluster in question. This does not exclude the possibility that the cluster consists of Bohras, and does not entail that it must be Patels. I don’t know the relationship between these various groups in Gujarat. But I think we’re getting closer to a resolution of this mystery at least.

indiaMDS_htm_31221595.jpgOf course the Gujarati HapMap cluster is not unknown to scholars. Two years ago in the supplements of the paper Reconstructing Indian History the authors observed the peculiar pattern in the principal component plots, which visualize the largest independent dimensions of variation in a data set. Most of the Indian populations fell along a line which has at one pole various groups like South Indian Dalits and at the other pole Europeans. But as the authors note a section of the Gujaratis were outside of the expected pattern. Why? Here is their hypothesis:

…Interestingly, one of the GIH subgroups fall outside the main gradient of Indian groups, suggesting that they harbor substantial ancestry that is not a simple mixture of ASI and ANI. A speculative hypothesisis that some Gujarati groups descend from the founders of the “Gurjara Pratihara” empire, which is thought to have been founded by Central Asian invaders in the 7th century A.D. and to have ruled parts of northwest India from the 7-12th centuries. I. Karve noted that endogamous groups with names like “Gurjar” are now distributed throughout the northwest of the subcontinent, and hypothesized that that they likely trace their names to this invading group.

This is wrong. The reason that a subset of Gujaratis fall outside of the main cluster is that they are a very genetically homogeneous group. This is why you exclude close relatives from these analyses; the relatives will shake out into their own clusters, which is obviously not what you want to clutter up the results. All the Gujaratis who are not in the cluster run the gamut you would expect in terms of ancestry for individuals from Central West India. Those in the distinctive cluster have a particular pattern in common.

To the left is a bar plot I generated from a selection of individuals and population from Zack’s K = 11 ADMIXTURE run. You can see the raw data in Google Docs. What K = 11 means that Zack took all the individuals in his data set, which runs into the thousands, and allowed the program to apportion them to 11 populations. These are not real populations necessarily, but abstractions. So you shouldn’t take the labels too seriously. I’ve limited it to the population components of particular relevance for South Asians. The labels in all caps are a number of individuals from public data sets. Those which are not in all caps are individuals from the Harappa Ancestry Project. I’ve constrained the individuals and populations to be somewhat informative of my overall point. What is that point? The “Patel” Gujarati cluster is among the most “pure” of South Asian populations. The Bengali to the left is my mother, and you see can see that her South Asian proportion drops mostly because of her elevated East Asian ancestry. Among the Jatts the European and Southwest Asian proportion is higher. The “Onge” components refers to an affinity with a tribe in the Andaman Islands. This, combined with the “S Asian” component is probably a good shadow of patterns of variation which denote ancestral deep roots within the Indian subcontinent. Combing the two you see the the Gujarati cluster and individuals affiliated with it top out in excess of 90%! I think this is the outcome of the ancient admixture event between “Ancestral North Indians” and “Ancestral South Indians” which defines South Asians as a distinctive genetic unit on a worldwide canvas. All those who came later, whether it be Austro-Asiatics, Aryans, or Scythians, are overlays upon this robust common substrate.

Ironically the geneticists who decided to select the Gujaratis of Houston stumbled onto a group which is archetypically representative of what it means to be South Asian in a biological sense.

18 thoughts on “Structure within Houston Gujaratis resolved?

  1. Razib,

    Based on population genetics data, could you shed some light on which migration to the subcontinent was bigger in absolute numbers: The Aryan Migrations or the Scythian Migrations?

  2. Razib Khan: “former.”

    Boss_Mahesh: Absolutely amazing. I swear, I would have thought it would have been Scythians. The Scythians never influenced any languages, any religious movement, or and they didn’t leave many place names among other things that they didn’t leave behind. Maybe their migration into “India” was localized in Afghanistan and/or Pakistan. The only cultural relics that Scythians may have left in the Subcontinent that I know of are: 1. A “Saka” calendar, which began around 1900 years ago. 2. Rajput migration to Rajasthan. 3. Maybe “Gujjars” are Scythians. They’ve many place names such as “Gujranwala”, “Gujrat”, and “Gujarat State”, among other things.

    By the way, detecting Scythian movement maybe much more challenging. They were very mixed people. Most spoke Turkic language, and some, the ruling elite, may have spoke an Iranian language.

  3. By the way, detecting Scythian movement maybe much more challenging. They were very mixed people. Most spoke Turkic language, and some, the ruling elite, may have spoke an Iranian language.

    i don’t know how the ethnographic terms were confused, but the ancient scythians were clearly an iranian people. if they had turkic and it had an impact it would be easily detectable in the genetics. from what i can tell the jatts do not have it. there is quite often some turkic in muslims from UP north and west. some of the pashtuns in the HGDP data set clearly have it. i do not usually see it in hindus, except for people from the himalayan fringe (nepal, himachal pradesh). we bengalis tend to have a more southeast asian element. the two are differentiable.

    i say the indo-aryans had a bigger impact becasue the “european-like” component found at low fractions across south asian is WAY more widespread than the saka domains. in fact, it seems to be found in most indo-aryan people, even bengalis of non-brahmin origin like my parents, and south indian brahmins. so if you integrate over the distribution the indo-aryans clearly were much more of an impact. the saka influence was salient though obviously.

  4. hey, sorry if this has been answered because I skimmed the article really quickly, but why did you leave out African ancestry? Or is it so negligible that it wasn’t worth putting in? I always figured certain groups like pashtuns, balochis, sindhis, etc would have just as much African as East Asian heritage; although judging by the graph, I really underestimated just how much E.Asian ancestry non-Bengali S.Asians have

  5. sorry if this has been answered because I skimmed the article really quickly, but why did you leave out African ancestry? Or is it so negligible that it wasn’t worth putting in? I always figured certain groups like pashtuns, balochis, sindhis, etc would have just as much African as East Asian heritage; although judging by the graph, I really underestimated just how much E.Asian ancestry non-Bengali S.Asians have

    1) outside of muslims from the northwest part of the subcontinent, african is negligible

    2) at the 0-5% interval a lot of the “east asian” is probably noise, though not all (especially on the northern fringe of the subcontinent). the tell is if you have a group with low levels of east asian admixture that is uniform, vs. a group where lots of people have none, and then one person has a lot. that is not uncommon among north indians (excluding bengalis, who tend to have low levels which vary across the 5-15% interval). among some south indian tribals you have 2-3% in everyone. that is either VERY OLD admixture, or it is noise that emerges from the inability to properly separate ancestral components in the algorithm.

  6. Razib, My humble suggestion is that you summarize the the point your posts in the introduction or conclusion – in non-technical language if possible.

  7. It is interesting that the Armenians (multiple individuals) have such a high S.Asian component. I am wondering if the S.Asian in this case is ANI (not ASI).

    Also the (single) Rajasthani brahmin with E.Asian–is it a Scythian or Mongol contribution or just migrations from within other parts of N.Indai. There have been a lot of those as I’ve said before. Can you clarify what E.Asian is (something shared with peoples of SE Asia, the Han etc…)?

    About Gujaratis (Patels) I’ve been saying for sometime that some Indians are more Indian than others. I have a mental map of Indians and their stories (everyone has theirs). With Patels, contributing factors were that in the last 200 years as a diasporic community they faced the full brunt of the clash of civilizations. I won’t say who I feel are least Indian 😉 but your DNA PCA profiles support my ranking.

  8. I imagine that the Indo-Aryans were more of a “settling” population than the Scythian who were a tribal/ruling elite hence the disproportionate genetic impact.

    Impressive that the n.w south asia (indus) has seen an admixure of other “continental” influences like African & N.E Asian so is Bengal in that matter; they were really the “borderlands”.

    Apologies for the confusion but is there any resolution or clarity on Gujarati B?

  9. Apologies for the confusion but is there any resolution or clarity on Gujarati B?

    so zack calls “gujarati b” “gujarati a.” so it is resolved. probably patels.

  10. I am wondering if the S.Asian in this case is ANI (not ASI).

    yes.

    Also the (single) Rajasthani brahmin with E.Asian–is it a Scythian or Mongol contribution or just migrations from within other parts of N.Indai. There have been a lot of those as I’ve said before. Can you clarify what E.Asian is (something shared with peoples of SE Asia, the Han etc…)?

    these results won’t allow clarification. but it’s not that hard. generally it is as you expect, ppl from nw south asia are more ‘turk’, those from east south asia more ‘southeast asian.’

  11. India was peopled LONG ago: about 60,000 years ago. To put this into perspective, the following were colonized at – N. America – 25KYA Europe – 40KYA Australia – also 60KYA. Humans emigrated from Africa around 60 KYA as well.

    Anyways, the narrative here seems to patronize the indigenous genotypes here. This doesn’t make sense, and it’s like saying that apes have a lot of human genes (not the semantically more lucid “humans have a lot of ape genes.”). The research’s narrative should attempt to elucidate how much Proto-Indian genes remains in C Asia and S Asia and SE Asia.

  12. India was peopled LONG ago: about 60,000 years ago. To put this into perspective, the following were colonized at

    there’s no guarantee that the first settlers made much of an impact and weren’t replaced. i think this has happened in europe, and, in much of east africa.

    This doesn’t make sense, and it’s like saying that apes have a lot of human genes (not the semantically more lucid “humans have a lot of ape genes.”). The research’s narrative should attempt to elucidate how much Proto-Indian genes remains in C Asia and S Asia and SE Asia.

    there is a non-trivial probability that all of non-africa was settled from the fringes of south asia. but the ASI lineage which is oldest in india, even if it is descended from the first settlers, obviously has changed since the last common ancestor with other lineages.

    fwiw, i bet ASI is basal to all east eurasian, oceanian, and amerindian lineages. it might be basal to west eurasian ones though. there is mounting evidence that anatomically modern humans left africa closer to ~100,000 years ago, but that there was a “pause” or interregnum somewhere in southwest asia before a second expansion event.

  13. Also, what is the centroidal European gene set? Is it someone from the geographic/population center of Europe in SE France/Lithuania? Or is it from the eskaru who may have been descendants if the original cromagnons? Or from someone just west of the Urals in Kazakhstan?

    Now that I think about it, there are many “poles” which to reference data sets.

    We could have even framed the poles as a Tamil Dalit Node, a Tibetan Node, a Portuguese Node (to detect admixtures amongst Konkani Brahmins and Goan Christians), a Tajik/Sogdian/Pamiri Node to detect Scythians, or even a Persian Node to detect Greeks (since Alexandrea troops were mostly Iranians when he came to Pakistan).

    All this being said, I’m very surprised at how homogenous Desi are when it comes to the “Indian” gene an it seems to cut off sharply when you go outside the pashtun/baluchi area in a non-cline manner.

  14. Also, what is the centroidal European gene set? Is it someone from the geographic/population center of Europe in SE France/Lithuania? Or is it from the eskaru who may have been descendants if the original cromagnons? Or from someone just west of the Urals in Kazakhstan?

    the “european” in south asians is clearly affiliated with eastern europeans. i’m not going to get into the details, but it is pretty obvious if you run ADMIXTURE yourself. it’s a band from the baltic down to the caucasus and sweep to northwest india and a little beyond.

    We could have even framed the poles as a Tamil Dalit Node, a Tibetan Node, a Portuguese Node (to detect admixtures amongst Konkani Brahmins and Goan Christians), a Tajik/Sogdian/Pamiri Node to detect Scythians, or even a Persian Node to detect Greeks (since Alexandrea troops were mostly Iranians when he came to Pakistan).

    some populations really are more “types” than others. a tamil dalit and portuguese node makes sense. the others don’t. those groups are simply always linear combinations of other groups in all the runs i’ve seen.

    • We are pre-supposing the point of origins of various immigrant groups into the Deshlands by measuring how similar or how frequent certain haplotypes/SNP/etc. show up amongst the Baltic-Caucaus Belt and India. I would think that a great experiment could be done the following way(s): 1. If we wanted to see how similar the various peoples of S.Asia are to a “polar node”, then we should see what haplotype/SNP/etc. this polar node has an extreme amount of (i.e. they express a STR many times about 95% of the time, whereas the rest of the world expresses it much less. Or conversely, they express it 2% and the rest of the world expresses it higher).

      For example: Suppose that you set up the experiment to detect how closely Desis cluster with Georgians, you would want to test for a gene expression that the Georgians have a lot or a dirth of. OTOH, if they only expressed something at an intermediate level, say 50%, than this is much less useful by itself. The neighbors of the Georgians may express it more or less than 50% but further out, they may express it at 50% as well. So it’s useless unstable information.

      1. Why not have one of the nodes at Tajikistan, which was the original homeland of the Aryans before they migrated to Desh and Iranian plateau? The Tajiks weren’t supplanted by the Turks either, but by other Iranian groups (i.e. Sogdians, etc.).

      2. Has anything been elucidated about the Greek/Persian immigration into Afghanistan/Pakistan in ~300BC?

      3. Why not have a “polar node” from the people who now live at the Harappa/Indus Valley area, and compare how the vicinal areas have this gene set and at what frequency? Another words, this would measure the equally-likely supposition that Harappans migrated OUTSIDE to even Afghanistan/Turkmenistan/etc. This actually makes sense, since this area was prepopulated early on and had a big population. They were very likely to migrate outwards.

      Thanks for all of your amazing insights and contributions here, Razib. I love your articles.

  15. I recall the B blood group being linked to S.Asia (including Afghan/Tajik populations). A map of the B blood group shows these regions being highlighted at 15% of the population (deep South India is not one of these regions–it tends to run in some brahmin families, most other people are O). This I think would be due to the Aryan contribution to the S.Asian gene pool. In Western Europe the B is not present at these levels. I’ve not really seen any discussion on this but maybe you could comment further.