The 2009 paper Reconstructing Indian population history was a watershed in understanding the genomics of South Asians. Before this point the studies had been with unrepresentative samples, fewer markers, or, South Asians were only a sidelight. This paper put the focus on South Asians to elucidate the group’s population history (it still undersampled eastern South Asians, though this seems part of the plan because of their focus on two, not three, ancestral Indian components). If you want to know more about the paper, here is the ungated version. But in this post I want focus on an issue which you can find only in the supplements to the paper.
The HapMap project, which surveys genetic variation in world populations, has a set of Gujaratis, from Houston, Texas. This is currently the primary population of Indian origin you have to work with in the public data sets. There are other South Asian populations in the public domain, but their number of markers is far lower. So the Gujarati sample is very useful right now. But one thing that immediately jumps out at you is that there are in fact two Gujarati clusters. In the PCA plot I’ve extracted from the supplements you see the two largest components of genetic variation. PC 1, the x axis, separates whites from South Asians, and PC 2, separates one group of Gujaratis from everyone else. What’s going on here?
First, let’s take another look at the Gujarati population, and compare it to other South Asians. I ran ADMIXTURE the other day with a combined data set of Eurasians, Papuans, and Berbers from Algeria. I exclude Africans and New World populations because for the purposes of this analysis I’m not interested in that genetic variation (Africans are so diverse that they often take up many components of any analysis). The way ADMIXTURE works is that it takes in a K number of hypothetical ancestral populations, and assigns fractions of ancestry to individuals to each group, taking as the reference the variation found in the aggregate sample. To make this concrete, I have two Bengalis in the sample, my parents. While the Chinese come out to be almost 100% one population, and the French 100% another, my parents tend to be a mix. That’s because they share commonalities with both groups. Also, for what it’s worth I pruned my data set down to 80,000 markers.
To the right you see runs K = 2 to K =4. That means 2, 3, and 4, ancestral populations respectively. Each bar is divided by color into population assignments. From the top to bottom I have Sindhis, French, Papuans, Chinese, Bengalis, and Gujarati_A and Gujarati_B. Gujarati_A is the group of Gujaratis who are more diverse and do not form a tight distinctive cluster in the plot above. Gujarati_B are those who do form a tight distinctive cluster.
A K = 2 the Chinese and French elements are clear. This is an east vs. west division. South Asians are a bit more western than eastern, as you might expect, with the Bengalis being the most eastern, and the Sindhis the least. At K = 3 you see a Papuan element break out in red, which South Asians share. Note that my parents have the green element which is nearly 100% in Chinese. Finally at K = 4 you see a South Asian component, in green now.
Note the difference between the two Gujarati samples. Gujarati_A has much more of whatever is distinctive to the French. Gujarati_B is more singularly South Asian. On a finer grained level, you can see in this figure that Gujarati_B is very uniform in ancestral quanta in comparison to Gujarati_A.
What does this tell us? The PCA pulls out only a small fraction of genetic variance, but because it extracts independent dimensions it is really informative of between population difference. You want to not include related individuals in this sort of stuff, because they often show up as part of their own cluster, as they share so many genetic variants. To me Gujarati_B looks to be a specific group among that ethnic group which is broadly related because of endogamy. This is a common pattern among South Asians because of jati. And, in the paper I referenced above it shows up in the genomics.
Why does this matter? Because different South Asian groups have different genetic susceptibilities. England has many more South Asians, so one knows that Bengalis in particular have a greater risk of type 2 diabetes, all things controlled, than Punjabis. But more specifically individual endogamous groups should have their own genetic predispositions. If this is true, and Gujarati_B is an endogamous group, then pooling it into the broader Gujarati sample may be problematic. At least without correction. If the Gujarati HapMap sample is used as a proxy for South Asian genetic variance, then the peculiarities of Gujarati_B may produce false positives when it comes to generalizing to South Asians more broadly. In contrast, Gujarati_A are all co-ethnics, and if they’re South Asian likely the outcome of long term endogamy. But, the individuals may be the outcome of different histories, and so only share risk variants common to Gujaratis, or perhaps even South Asians.
That’s the practicality. On to the reason why I hope the readership here will be informative: does anyone know what this difference in the Houston Gujaratis could be? I have speculated that perhaps Gujarati_B are a specific jati which is well overrepresented in the United States. Ethnographically Gujaratis should have a sense of this I assume, just as non-Sylheti and non-Mirpuri Bangaldeshis and Pakistanis in England are aware of the dominance of these two subgroups.
Addendum: For those who want meatier South Asian focused population genomics, please see Zack Ajmal’s continuous series of results. Looks like he just got a new batch of participants.
By the way that paper you referenced is surely an old classic trick. Let’s divide the natives into North and South and let the North think they are better and that they are related closely to westerners. Come on, that has been done to death.
Indian, Indus valley, whatever you want to call it, has been around for at least 5000 years. Western civilization came FROM eastern civilisation. We brownies were here first, whether the world wants to believe it or not. And the Africans were before us. There is no Indo-european language, they are Indian languages. There was no Aryan invasion. This so called history that is touted is bs and pushed by people who can’t stand the fact that Indians were here first and did it all before the west came along. The Africans before that.
Don’t buy into this colonial, imperialistic nonsense. Think for yourself. The strength of India comes from its unity. That is why people are so afraid of that unity and strength being realized. If India actually united and realized its rich and deep cultural history and became proud of it again, they could take over the world.
There are still too many macaulayite desis unfortunately.
Who ever said the Northerners were better? All the best desi minds come from the South and East. Ramanujan? Bose? Chandrashekar? However you want to rationalize it, the fact is that most people from the subcontinent of all castes and creeds have a European-like component and an ancient Asian component to their ancestry. How is that “macaulayite?” I guess thinking for yourself means only paying attention to views that make you feel good about yourself.
And this comment is why India will never be a truly free, independent nation. Because the Indians themselves actually rail against the idea that they may have an amazing cultural heritage untainted by european influences. They have been so brainwashed by the propaganda that they believe it themselves. Not only believe it themselves but allow themselves to argue against the idea that they may have a strong, independent culture. As long as these attitudes persist, India will always be a pawn in a game which they don’t even realize is being played around them.
India really needs a wake-up call.
I think I said it best when I said “What type of facts are those?”
Peace out homies
The European-like component is older than the Vedas and NEVER ACTUALLY LIVED IN EUROPE!