The undersampled 1 billion (genetically that is)

nastruc1.pngTwo issues compel this post. One is practical. The other is more, shall I say, spiritual (or at least fun!). In regards to the first, a few weeks ago I reviewed a paper which reported that the efficacy of response to a particular leukemia treatment regime was dependent on the amount of Native American ancestry an individual had. One has to be specific here, because many people who are white or black American have significant Native American ancestry (Brett Favre’s paternal grandfather was Choctaw), and many people who identify as Native American may not have as much Native American ancestry as others. But for the purposes of this blog post, I want to bring to your attention the figure above, which I extracted from the paper. Its implications may pose a major problem in the future for South Asian biomedical research in the United States. The chart is a bar plot where each thin slice is one of several thousand children suffering from leukemia, and you see in the color coding the amount of ancestry they have from a particular geographic region. So, most self-identified white children have predominantly European ancestry (red), though there are other elements (gray is African, purple is Native American, and green is Asian). Black Americans are mostly African, with a minority white component. Hispanic Americans are a mixture of various groups, depending on ethnicity. Finally, notice the Asian Americans: they are mostly Asian, but there is a minor European component! What’s going on here? It turns out that the hospitals recording race information are actually using the government categories. So Asian American includes South and East Asians, two very different groups genetically. In fact, genetically South Asians are somewhat closer to Europeans, on average, than they are to East Asians. This is a problem, because as many of you know different ethnic groups have different responses to various drugs, not to mention divergent health risks (raise your hands if you have a parent with type 2 diabetes), and aggregating these very different groups in a genetic sense into one racial category is problematic for biomedical research. I also realized that I was part of the problem, insofar as I do recall blithely entering “Asian” into the forms that were occasionally presented to me. Of course I’m aware that I’m a different kind of Asian than a Korean person, but I hadn’t given though to the downstream effect of my decision to check the Asian box. And I’m someone with a life sciences background!

Unfortunately, to add to this confusion of socio-political terms there is the problem that South Asians are one of the most undersampled groups when it comes to genetic variation in the world. Here are the South Asian groups in the Human Genome Diversity Project:

  • Balochi
  • Brahui
  • Burusho
  • Kalash
  • Makrani
  • Pashtun
  • Sindhi

Notice something? They’re all Pakistani. Why? Because apparently the government of India threw a lot of red tape at the scientists who wanted to collect Indian data. This is still apparently an issue. An Indian American graduate student acquaintance told me recently that there had still been difficulty getting access to genetic material for the 1000 Genomes Project, though it looks like some Indian results will finally come online late in 2011.

Unfortunately even the selected groups won’t be enough. South Asians are a very diverse category genetically, with a lot history of endogamy and population substructure. What that means in plain English is that for medical purposes it may be particularly difficult to generalize from a small set of groups to the whole population for South Asians (in contrast to Europeans and East Asians, who are rather homogeneous). This is one reason that Indian Americans are useful, in that the regulatory barriers are somewhat lower in the United States. But as we all know Indian Americans are not a representative sample of South Asians, so that is no panacea.

eurvar.pngSpeaking of representative, note that above I mentioned offhand that on average South Asians are genetically closer to Europeans than East Asians. I was careful to say on average, because I am part of the minority which is closer to East Asians than Europeans. Because so much of the research has been focused on Pakistani groups as proxies for South Asians I believe that there has been an inordinate lack of awareness of the eastern element among groups like Bengalis. The 2009 paper Reconstructing Indian population history was a seminal contribution to understanding how South Asians came to be, but it too tended to neglect the eastern element of South Asian ancestry. The figure which I’ve posted is a PCA plot. In plain English the x and y axis represent the two largest dimensions of genetic variation in Eurasian populations. They are not to scale. The x axis explains much more genetic variation than the y axis. But, the plot shows the relationships of various populations clearly. Please observe my position. I’m clearly an outlier when compared to South Asians, shifted toward East Asians as you’d expect, away from the dominant linear trend.

Or am I an outlier? To some extent because of the undersampling of eastern South Asians when compared to other South Asians my outlier position may simply be because there are very few Bengalis who have been genotyped (or Assamese, people from Orissa, etc.). There may be plenty of people between me and the primary South Asian cluster, we just don’t know.

Unlike the problem with medical research (which will require consciousness raising, politics, and a shift in fiscal resources), you can help “fill the gap” in our knowledge. I generated the PCA above from results produced by my friend Zack Ajmal, who is behind the Harappa Ancestry Project. The goal of HAP is basically to fill in the gaps in South Asian genetics by collecting data from individuals who have been typed by 23andMe. Zack is looking for data from people of Iranian, South Asian, Tibetan, or Burmese background (he is also taking data from people with partial South Asian ancestry). You can see who has submitted so far. Unfortunately, my own family represents all of Bengal currently (only unrelated individuals are useful for this, so my parents mean an N = 2, as I’ve now been removed from the data set). There is only one respondent who has provided data who is from Uttar Pradesh. Zack has been cranking away at this for under a month, so I think he’ll have a more robust data set soon enough.

That’s all for now. In the next post I will address a strange pattern in the sample of Gujaratis from Houston, which is one of the few public Indian data sets. Genetics is powerful, but it has to be supplemented with other sorts of information. Hopefully readers of this weblog can chip to resolve this mystery, because from what I can tell geneticists see the pattern but don’t know what might explain it (I have a hunch).

Addendum: If you want to contribute to HAP, or know someone who might want to, here’s the page with the details. 1 million SNPs from 23andMe will now cost you $260. I wouldn’t recommend it if you want medically actionable results. But you get your raw data, so you can do your own analysis.

22 thoughts on “The undersampled 1 billion (genetically that is)

  1. For those thinking of signing up for 23andme, keep in mind that they OWN YOUR DNA. Use it for Scientific and Commercial research. By signing up to this service you consent to the use of your dna in any experiments 23andme and its partners and/or third parties 23andme decides to share your data with choose to conduct. Cloning? Organ generation? Gene specific bio-weapons? The sky’s the limit! For only $399.99 you have the pleasure of giving up your genetic identity to the highest bidder. Cheers

    • @mario Quite right. 23andme’s co-founder is the wife of Google’s Sergey Brin – who knows you down to your tastes in XXX. A self-selecting database of schnooks if there was one.

      Would be worth a lot of money to Chemical Alis and AQ Khans, labelled by south asian ethnic/caste origin, to minimize friendly causalities in Hindustan. And dichotomists.

      @Alina-M Don’t get hung up on small things, it’s a running theme on SM. It’s my personal observation that South Asians are notoriously literate in race/gene lingo, as opposed to culture/arts/history/religion/literature etc., compared to other populations. An indicator of what they spend their free time doing.

  2. Interesting, Mr. Khan.

    However, I’m skeptical about how linear the arc is from Iranians to South Indians.

    Luigi Cavalli Sforza has written about the lack of genetic variation in Europe, and him and another researcher – one Malhotra – found out that NW India/Pakistan/Afghanistan area has a LOT of genetic variations.

    Also, I read that perhaps it was the Great Plague that had the effect of homogenizing the population of Europe, but I’m not sure if I buy this.

    Finally, I believe that the Indian government wants to squelch this research in order to preserve unity within India.

  3. I will address a strange pattern in the sample of Gujaratis from Houston

    Very compelling! I’ll have to discuss this with the bhais.

    boston_mahesh: How do you figure Europe lacks genetic variation? mtDNA and Y-chromosome haplogroups alone are pretty diverse. Or are you looking only at blunt phenotypes?

    I don’t think the diversity of the south/central Asian nations was in question here. The positioning of the Hazara is telltale enough.

  4. Luigi Cavalli Sforza has written about the lack of genetic variation in Europe, and him and another researcher – one Malhotra – found out that NW India/Pakistan/Afghanistan area has a LOT of genetic variations.

    fyi, t his is the same population set that cavalli-sforza used. but the marker density is much higher. cavalli-sforza had to rely on classical genetic techniques, so he didn’t have many markers per person. this research uses hundreds of thousands of markers per individual.

    Also, I read that perhaps it was the Great Plague that had the effect of homogenizing the population of Europe, but I’m not sure if I buy this.

    the population drop was not great enough, so that wouldn’t do it. the homogeneity of europe probably has deeper roots, since most of the continent was empty at the end of the last ice age.

    the linearity is an artifact of the fact that you’re looking at the two top dimensions of variance. there are others, though the east-west one (x axis) is an order of magnitude bigger than the other ones in the low number values (high PC).

  5. boston_mahesh: How do you figure Europe lacks genetic variation? mtDNA and Y-chromosome haplogroups alone are pretty diverse. Or are you looking only at blunt phenotypes?

    if you remove african admixture from the middle east, indians may be the most diverse non-african group in the world. though it depends, as both india and the middle east have a lot of diverse groups, so which populations you compare. genetic diversity is arraryed like so:

    african > mideast, india > europe > east asia > oceania > new world

  6. I haven’t seen a people more desperately interested in genetic origin than South Asians. Honestly the amount of chest-beating over the True Aryans, Scythians, Dravidian supremacists, Invader blood, Purity sheesh… Not to mention all sorts of inferiority/superiority complexes, stereotypes and conspiracies – not attractive ’cause I’m dark/proud of my dark skin, big nose, penis sizes damn… It seems each linguistic group/caste/religion is very keen to “explain” why they are do better/worse than other South Asians from Nobel Prizes to Bollywood and quicker to judge individuals based on these pet prejudices. I think it’s rooted in a more fundamental cultural characteristic of not taking responsibility for your faults, an entitlement mentality and blaming “the man”. Grow up.

  7. I don’t think the diversity of the south/central Asian nations was in question here. The positioning of the Hazara is telltale enough.

    the hazara are exceptional they’re around 40% east asian, which along with their “genghis khan” Y chromosomal lineages validates their self-conception as a people whose origins go back to the mongol period in iran.

  8. I haven’t seen a people more desperately interested in genetic origin than South Asians. Honestly the amount of chest-beating over the True Aryans, Scythians, Dravidian supremacists, Invader blood, Purity sheesh… Not to mention all sorts of inferiority/superiority complexes, stereotypes and conspiracies – not attractive ’cause I’m dark/proud of my dark skin, big nose, penis sizes damn… It seems each linguistic group/caste/religion is very keen to “explain” why they are do better/worse than other South Asians from Nobel Prizes to Bollywood and quicker to judge individuals based on these pet prejudices. I think it’s rooted in a more fundamental cultural characteristic of not taking responsibility for your faults, an entitlement mentality and blaming “the man”. Grow up

    ….and yet, you’re the only one here who even thought to introduce such childish topics on a thread about genomics/health. Or did I miss the debate on penis size VS genetic origin?

    This is a problem, because as many of you know different ethnic groups have different responses to various drugs, not to mention divergent health risks (raise your hands if you have a parent with type 2 diabetes), and aggregating these very different groups in a genetic sense into one racial category is problematic for biomedical research. I also realized that I was part of the problem, insofar as I do recall blithely entering “Asian” into the forms that were occasionally presented to me. Of course I’m aware that I’m a different kind of Asian than a Korean person, but I hadn’t given though to the downstream effect of my decision to check the Asian box. And I’m someone with a life sciences background

    Same, I work in a hospital lab and until recently we had a box labeled East Asian/South Asian/Pacific Islander…I guess since we’re such a small minority in the USA this gets overlooked. Thanks for sharing the data Razib

  9. Quite right. 23andme’s co-founder is the wife of Google’s Sergey Brin – who knows you down to your tastes in XXX. A self-selecting database of schnooks if there was one.

    Would be worth a lot of money to Chemical Alis and AQ Khans, labelled by south asian ethnic/caste origin, to minimize friendly causalities in Hindustan. And dichotomists.

    this is an ignorant comment which bespeaks a total ignorance of the power, or lack thereof, of modern genetics and applied biology. i’m not having a discussion about this, it’s just a fact i’m laying out there for other readers. the most brilliant geneticists at the top american universities couldn’t do any of the things outlined above in the foreseeable future. it’s laughable to presume that scientists in pakistan could. it seems like we know which commenter is illiterate in genetics.

  10. It’s my personal observation that South Asians are notoriously literate in race/gene lingo, as opposed to culture/arts/history/religion/literature etc., compared to other populations. An indicator of what they spend their free time doing.

    I’ve had the opposite experience – most South Asian immigrants I’ve met are wayyyy more well-versed in their religion/culture than in science (at least this seems typical of Paki/Afghani Moslems, the desi group I interact with most often). Many of them are superstitious and outright reject scientific beliefs in favor of their religion.

    Although desi’s (like many 1st/2nd/3rd gen immigrants) are more likely to go into STEM fields, particularly medicine/health sciences so it wouldn’t surprise me if we’re overall more versed in gene lingo than the overall American population. But I can’t pretend I’ve ever heard any discuss racial genetics the way Razib’s doing here.

  11. alina, american hindus are confessionalized (they are affiliated), but perceive themselves as relatively secular. see:

    http://religions.pewforum.org/comparisons

    click “importance of religion in one’s life.”

    all that being said, it is laughable to say that south asians are well versed in genetics. they have a lot of interest in folk taxonomy, but it isn’t grounded in the new genomics. i try to bring some of that in the comments here, and now i’m pushing in on the front pages.

    i understand that are long recurrent themes on this weblog in relation to racism and ethnocentrism. additionally, there are real substantive concerns (not just paranoid fantasies as comment #1) re: genetic privacy. but i’d ask readers to hold off on those issue for now, since they’re mooted regularly here and elsewhere. rather, i’m open to questions about what a PCA plot means, how south asians relate genomically to each other and to other populations, as entailed by the newest science.

  12. It’s the data-set, labelled with ethnic group, that is valuable. The state-of-the-art (as it is publicly known) to interpret the data will always improve with time.

    If this data could help in understanding why “different ethnic groups have different responses to various drugs, not to mention divergent health risks” – your words then weaponizing a drug to target, or at the very least pose significantly more health risks, to certain ethnic groups (more than others) is plausible.

    In a terrorist hell hole like South Asia, where environmental monitoring is non-existent, this data could pose significant risks to the population there.

  13. your words then weaponizing a drug to target, or at the very least pose significantly more health risks, to certain ethnic groups (more than others) is plausible.

    no, it’s not plausible. if you knew of the difficulties of designing and deploying bioweapons you wouldn’t find it plausible at all. but you don’t know enough to make an informed assessment. instead, you’re leaning on generalities. that’s fine, you’re free to express your opinion. but stop repeating yourself now.

    • I’m not repeating myself, nor is any of it irrelevant, so removing my comment which had facts from wikipedia, was poor form.

      You’re asking for SNP (“1 million SNP will cost…”), here is what is says about it on wikipedia: Variations in the DNA sequences of humans can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. SNPs are also thought to be key enablers in realizing the concept of personalized medicine.

      You cited Human Genome Diversity Project as lacking data on Indians, here is wikipedia: In 1995, the National Research Council (NRC) issued its recommendations on the HGDP. While the NRC endorsed the concept of diversity research, it criticized the HGDP’s procedure, claiming that the HGDP had too many ethical lapses and problems. The NRC report suggested several alternatives such as doing sampling anonymously (i.e. sampling genetic data without tying it to specific racial groups). … such approaches would eliminate the concerns discussed below (regarding racism, weapons development, etc.)….

      It goes on to say that there were reports that funding for it had been drained due to protests based on ethics.

  14. What is a computer science (and EE) person, originally from Lahore, doing collecting raw DNA samples?

    He could be a researcher. Computer sciences and biological sciences are linked, especially genomics which is heavily computational in nature and an interdisciplinary field – people approach it with backgrounds in engineering, physiology, neural sciences, biochem, etc…

    hey, i don’t know this guy and i have no clue why he’s collecting raw DNA samples, but i’m willing to give him the benefit of the doubt is all.

    @Razib – Just curious, where would sub-saharan Africans fall on the PCA plot? In regards to homogeneity, are they more/less/just as diverse as South Asians? thanks

  15. Just curious, where would sub-saharan Africans fall on the PCA plot? In regards to homogeneity, are they more/less/just as diverse as South Asians? thanks

    the first component of variance always separates africans from non-africans. here’s a PCA i generated from my own pooled data set:

    http://blogs.discovermagazine.com/gnxp/files/2011/02/HGDPme.png

    africans are more diverse than south asians. they’re more diverse than anyone else. in some ways you can model all non-africans as a subset of northeast african. the main exception to this is the 4% neandertal in all non-africans, as well as the 4-6% other hominid in on top of the neandertal among oceanians.

    please stop responding to the “concern troll.”

  16. Is there a particular reason that there are this many Hazaras in the database at all? Is it data from the Y chromo studies done previously? Just curious, as a Hazara, what variables are on the two axes on that graph to make Hazaras fall smack dab in the middle.

    Also, I gotta say, from personal experience over the past 4 years volunteering for bone marrow drives, brown people do NOT even entertain the idea of registering. That includes south american brown and south asian brown.

  17. Is there a particular reason that there are this many Hazaras in the database at all? Is it data from the Y chromo studies done previously? Just curious, as a Hazara, what variables are on the two axes on that graph to make Hazaras fall smack dab in the middle.

    the hazaras are in the database because they sampled various pakistani groups, and i guess there were some pakistani hazaras that they got. the two axes are basically west-east and north-south in eurasia. the hazaras clearly have substantial mongol ancestry admixed with persian & northwest south asian lineages. so they are between west and east eurasians, and pulled a touch south.

  18. I am an EECS (Electrical Engineering and Computer Science) guy. I am from Atlanta, not Lahore. I went to school in Lahore almost 2 decades ago. I do not run any blog on blogspot, so zachandamber.blogspot is not mine. My blog is linked via my name on this comment. And calling my blog Islamic is almost like calling Razib’s Brown Pundits blog Islamic. :)

    While there are no Ashkenazi samples in HGDP, there are other studies and papers about Jewish genetics and data is available publicly. For example, Behar et al has Jewish samples from a number of regions.

  19. Sanjaya:

    zackandamber.blogspot is where I started blogging almost a decade ago. I moved to my own domain in May 2003 and lost control of the blogspot URL soon after. I have no idea whose blog that is now. And I have no link to the Google cached page you linked to.