The 2009 paper Reconstructing Indian population history was a watershed in understanding the genomics of South Asians. Before this point the studies had been with unrepresentative samples, fewer markers, or, South Asians were only a sidelight. This paper put the focus on South Asians to elucidate the group’s population history (it still undersampled eastern South Asians, though this seems part of the plan because of their focus on two, not three, ancestral Indian components). If you want to know more about the paper, here is the ungated version. But in this post I want focus on an issue which you can find only in the supplements to the paper.
The HapMap project, which surveys genetic variation in world populations, has a set of Gujaratis, from Houston, Texas. This is currently the primary population of Indian origin you have to work with in the public data sets. There are other South Asian populations in the public domain, but their number of markers is far lower. So the Gujarati sample is very useful right now. But one thing that immediately jumps out at you is that there are in fact two Gujarati clusters. In the PCA plot I’ve extracted from the supplements you see the two largest components of genetic variation. PC 1, the x axis, separates whites from South Asians, and PC 2, separates one group of Gujaratis from everyone else. What’s going on here?
First, let’s take another look at the Gujarati population, and compare it to other South Asians. I ran ADMIXTURE the other day with a combined data set of Eurasians, Papuans, and Berbers from Algeria. I exclude Africans and New World populations because for the purposes of this analysis I’m not interested in that genetic variation (Africans are so diverse that they often take up many components of any analysis). The way ADMIXTURE works is that it takes in a K number of hypothetical ancestral populations, and assigns fractions of ancestry to individuals to each group, taking as the reference the variation found in the aggregate sample. To make this concrete, I have two Bengalis in the sample, my parents. While the Chinese come out to be almost 100% one population, and the French 100% another, my parents tend to be a mix. That’s because they share commonalities with both groups. Also, for what it’s worth I pruned my data set down to 80,000 markers.
To the right you see runs K = 2 to K =4. That means 2, 3, and 4, ancestral populations respectively. Each bar is divided by color into population assignments. From the top to bottom I have Sindhis, French, Papuans, Chinese, Bengalis, and Gujarati_A and Gujarati_B. Gujarati_A is the group of Gujaratis who are more diverse and do not form a tight distinctive cluster in the plot above. Gujarati_B are those who do form a tight distinctive cluster.
A K = 2 the Chinese and French elements are clear. This is an east vs. west division. South Asians are a bit more western than eastern, as you might expect, with the Bengalis being the most eastern, and the Sindhis the least. At K = 3 you see a Papuan element break out in red, which South Asians share. Note that my parents have the green element which is nearly 100% in Chinese. Finally at K = 4 you see a South Asian component, in green now.
Note the difference between the two Gujarati samples. Gujarati_A has much more of whatever is distinctive to the French. Gujarati_B is more singularly South Asian. On a finer grained level, you can see in this figure that Gujarati_B is very uniform in ancestral quanta in comparison to Gujarati_A.
What does this tell us? The PCA pulls out only a small fraction of genetic variance, but because it extracts independent dimensions it is really informative of between population difference. You want to not include related individuals in this sort of stuff, because they often show up as part of their own cluster, as they share so many genetic variants. To me Gujarati_B looks to be a specific group among that ethnic group which is broadly related because of endogamy. This is a common pattern among South Asians because of jati. And, in the paper I referenced above it shows up in the genomics.
Why does this matter? Because different South Asian groups have different genetic susceptibilities. England has many more South Asians, so one knows that Bengalis in particular have a greater risk of type 2 diabetes, all things controlled, than Punjabis. But more specifically individual endogamous groups should have their own genetic predispositions. If this is true, and Gujarati_B is an endogamous group, then pooling it into the broader Gujarati sample may be problematic. At least without correction. If the Gujarati HapMap sample is used as a proxy for South Asian genetic variance, then the peculiarities of Gujarati_B may produce false positives when it comes to generalizing to South Asians more broadly. In contrast, Gujarati_A are all co-ethnics, and if they’re South Asian likely the outcome of long term endogamy. But, the individuals may be the outcome of different histories, and so only share risk variants common to Gujaratis, or perhaps even South Asians.
That’s the practicality. On to the reason why I hope the readership here will be informative: does anyone know what this difference in the Houston Gujaratis could be? I have speculated that perhaps Gujarati_B are a specific jati which is well overrepresented in the United States. Ethnographically Gujaratis should have a sense of this I assume, just as non-Sylheti and non-Mirpuri Bangaldeshis and Pakistanis in England are aware of the dominance of these two subgroups.