*Corresponding Author:
H. Padh
Provost, Vanita Vishram Women's University
Athwa Gate, Surat, Gujarat 395007, India
E-mail: [email protected]
Date of Submission 05 December 2020
Date of Revision 03 July 2021
Date of Acceptance 04 November 2021
Indian J Pharm Sci 2021;83(6):1107-1113  


Remarkable advancement in next generation sequencing technology has made personal genome analysis feasible and affordable. Here we present the whole genome sequence analysis of an individual from the Kashmir region of India who is from the pandit community (IHGP04). The Indian male’s genome was sequenced at 38.6 X coverage with a total of 110 Gb of sequence data generated using the Illumina HiSeq 2000 platform. The variant analysis revealed over 3.6 million single nucleotide variants and 5 46 681 small insertions and deletions including about 2.7 % of novel (unreported) variants. The known variants were analyzed for their health and disease relevance and drug pharmacogenetic profile. The mitochondrial haplogroup (L3e2b1a1) is indicative of maternal ancestors’ arrival to the Indian subcontinent about 70 000 y ago, while the Y-chromosome haplogroup (J) is suggestive of the arrival of paternal ancestors less than 25 000 y ago. The haplogroups also suggest different routes used by the ancestors to arrive on the subcontinent. The analysis and validation of the novel variants and relationship with health and diseases should be the next logical step.


Genome analysis, genome and health, chromosomes, lymphocytes, blood

Indian population has two features that collectively make it a unique population to understand: 1.3 billion Indians make up 1/6th of the world population and is perhaps the most diverse in its anthropological as well as genetic makeup. It is a matter of concern that the Human reference genome originally developed in 2003[1] is derived from about 200 anonymous individuals from six countries: China, France, Germany, Great Britain, Japan and the United States, but did not have Deoxyribonucleic Acid (DNA) from any Indian. Subsequently, significant work has been done on Indian genomes but not much variant data from the Indian population has been added to the human genome variant database[2-7]. This leaves the reference human genome not an ideal representative of the human race. The reference Human Genome and related variant databases need to be enriched with data from underrepresented populations.

It has been an effort of our laboratory to sequence and analyze Indian genomes with the following broader objectives: To help develop the variant database of the Indian population; to enrich the reference human genome variants with new variants reported from the Indian population; to understand the human migration waves and admixing during past 70 000 y which now make up the Indian population.

With affordable technology and accessible reference genome database,   analyzing   individual   genome has become feasible to understand one’s health and associated health related risk factors. Collectively the database has also revealed important aspects of the development of the human race. Particularly, Y-chromosome and mitochondrial haplogroups have helped to develop a high resolution human migration map during the past 100 000 y. With individual genome analysis, we are now able to ascertain the ethnic affiliation of the ancestors and develop a migratory pattern for that particular group and develop a health profile of the individual.

The contemporary Indian population is an admixture of several waves of human migration from various directions, which is believed to have been segregated into various communities over the past 2000 y or so[2]. In this context, we have recently reported a complete sequence analysis of a male from the Western part of The contemporary Indian population is an admixture of several waves of human migration from various directions, which is believed to have been segregated into various communities over the past 2000 y or so[2]. In this context, we have recently reported a complete sequence analysis of a male from the Western part of

The objectives of the present study were-DNA sequencing of a male Indian individual from Kashmir and analysis of the genome variants for known risk factors for health and disease; to identify unreported novel variants present in this individual; and through the analysis of mitochondrial and Y-chromosome haplogroups decipher the migratory routes as well as the time for the ancestors of this individual to reach India.

Materials and Methods

Sample selection and DNA isolation:

A healthy 61 y old male individual (IHGP04) who is a Hindu Brahmin resident of Kashmir in Northern India, with no chronic disease condition was selected for this study. The institutional ethics committee approved the study and the signed informed consent was obtained from the individual before initiation of the study. After the approval, the blood sample (10 ml of the peripheral blood) was withdrawn using a heparinized needle and was collected in anticoagulant (Dipotassium     Ethylenediaminetetraacetic     Acid (K2 EDTA) or sodium heparin) precoated collection tubes under aseptic conditions. Genomic DNA was extracted using the phenol-chloroform method[8-11]. Assessment of the isolated DNA sample was carried out in terms of quality, quantity and integrity of the DNA. The DNA quality and quantity were measured using NanoDrop 2000 (Thermo Fisher Scientific Inc., USA) and Qubit 2.0 DNA Broad Range Assay (Invitrogen, USA). The sample was also run on 1 % agarose gel to determine the quality of genomic DNA.

Biochemical analysis and family disease profiling:

The basic biochemical analysis like haemogram profiling, lipid profiling [total cholesterol, triglycerides, High-Density Lipoprotein (HDL), Low-Density Lipoprotein (LDL)] and biochemical profiling (blood urea, serum creatinine, liver tests, blood and urine glucose) was performed using the manufacturer’s protocol. The demographic details and the disease history profiling of the individuals and their relatives were also recorded.

Cytogenetic analysis:

Chromosomal aberrations accounting for the identification of genetic diseases through the photographic representation was performed using karyotyping technique. Karyotyping was carried out using peripheral blood lymphocytes by applying standard techniques and Giemsa Banding (GTC banding) with 400-500 band resolution. No consistent structural and numerical chromosomal abnormalities were detected in the cytogenetic analysis through G-banded karyotyping chromosome imaging.

Library construction and sequencing:

The isolated genomic DNA was sent to Theragen BioInstitute, the Republic of Korea for sequence analysis. After passing the quality check measures the isolated sample was processed for DNA library preparation.

Library construction: Genomic DNA was fragmented using Covaris S220 (Covaris Inc, USA) to a targeted size of 350–500 bp. The fragmented DNA was then end-repaired, ligated to Illumina TruSeq adapters and Polymerase Chain Reaction (PCR)-enriched using TruSeq DNA sample preparation kit (Illumina, USA) according to the manufacturer’s protocol. The final sequencing library was quantified using KAPA kit (KAPA Biosystem, USA) on Agilent Stratagene Mx- 3005p quantitative PCR (Agilent, USA). The library size was confirmed using Agilent Bioanalyzer High Sensitivity DNA Chip (Agilent, USA). The parameters of the library constructed were similar to what was published earlier.

Whole-genome sequencing: The resulting library obtained was sequenced using an Illumina flow cell and 202 cycles on the Illumina HiSeq 2000 platform (Illumina, USA). A total of 98 GB of raw data was generated on the sequencer for IHGP04 sample, 98.9 % of mappable reads were mapped with a mean depth of 34.6 X.

Alignment and variant analysis:

Pre-processing: The raw data generated after the sequencing in the form of Binary Base Call (BCL) files were first converted plain-text QSeq format files, followed by FASTQ format using Illumina’s BCL converter. After conversion, the pre-processing was carried out. In pre-processing, the duplicated pair- end reads  and  reads with  more  than 2Ns  (N  bases representing uncalled nucleotides) were filtered. This ensured the high quality reads in the sample.

Mapping of reads: We first filtered low-quality reads by using Next Generation Sequencing Quality Control (NGS QC) toolkit v 2.3.3 with ‘-l 70 -s 20’ options cutoff read length for HQ=70 %, cutoff quality score=20). After filtering, the filtered reads were aligned onto the Human Genome version 19 (hg19) using Burrows-Wheeler Alignment-Maximal Exact Match (BWA-MEM) 0.7.8 with the default option. In addition, SnpEff with dbSNP138, dbSNP142, 1000 Genomes Phase 1 release v3, and ESP6500 were used to supplement hg19. SAM files were converted to BAM files using Samtools 0.1.19. PCR duplicated reads were removed by MarkDuplicate subroutine in Picard v1.9.2 (http://broadinstitute.github.io/picard/). To increase the accuracy of variants call, IndelRealigner and BaseRecalibration in GATK v2.3.9 were used. The variants were called by GATK UnifiedGenotyper with ‘--heterozygosity 0.0010 -dcov 200 -stand_call_conf 30.0 -stand_emit_conf 30.0’ options.

Identification & Classification of Variants: The identification of all the variants like Single Nucleotide Polymorphism (SNPs), Single Nucleotide Insertions and Deletions (InDel), structural variants, Copy Number Variations (CNV) and multiple nucleotide variants were carried out using the SnpEff software [(SnpEff version 3.4 (build 2013-11-23), by Pablo Cingolani]. The identified classes of variants like SNPs, InDels and multi-base Indels were subjected to classification in the form of known and Novel variations by aligning the data to dbSNP v137.

Variant analysis: A comprehensive list of 500 plus genes associated with various disease categories was screened and generated from the databases like Online Mendelian Inheritance in Man (OMIM), The National Center for Biotechnology Information (NCBI), SNPedia etc., The generated gene list consisted of genes associated with major diseases such as neurological disorders, cardiovascular diseases, cancer, diabetes, obesity, asthma and allergy. In addition, genes involved in drug metabolism and nutrigenomics were included in the data set.

The known SNPs in the coding region of the sample (IHGP04) were analyzed using the comprehensive gene list to assess the presence of the potential risk variants involved in the disease susceptibility.

The amino acid position of the novel variants in the coding region of disease-associated genes was determined using NCBI. The novel variants were then evaluated for their Polymorphism Phenotyping v2 (Polyphen score)[12] to predict the variants with significant damaging/deleterious effects.

Results and Discussion

The isolated human DNA sample (IHGP04) was subjected to Whole Genome Sequencing (WGS) using Illumina HiSeq 2000. A total of 98 Gb of sequence data (736 million high-quality pair-end reads of length 101 bp) was generated for the analysis. The sequence data thus generated was then aligned to the human genome reference v37.1 (hg19). Out of the total reads generated, 98.9% of the reads were successfully mapped with an average of 38.6 X coverage of the human reference genome (Table 1).

Total Reads 736 428 860
Read Length (bp) 150
Total yield (Mbp) 110 464
Reference size (Mbp) 2858
Throughput mean depth (X) 38.60
De-duplicated reads 695 64 382
Mappable Reads 659 280 879

Table 1: Preprocess Data Statistics Summary

The sample IHGP04 was aligned with the human genome reference v37.1 (hg19) for identification of variants (includes SNPs, Indels, CNVs and SV), as shown in Table 2A, Table 2B and Table 2C. In total 3 608 953 Single Nucleotide Variants (SNVs) and 546 681 small INDELs were identified in the IHGP04 genome. To get the estimate of novel variants, the SNVs were compared with those of the dbSNP 138 database. Out of the total 3 642 767, 96.3 % SNVs were reported in the dbSNP 137 database and subsequently, 97.6 % were found in dbSNP142. Total of 208 604 novel SNV and 68 354 novel INDELs were identified. Further classification of the SNVs showed 18 631 were homozygous and 32 961 were heterozygous in intragenic regions. Apart from SNVs, a total of 3545 short indels (up to ±20 bases) were identified in intragenic homozygous state and 6714 heterozygous INDELs were also identified, of which 789 were novel INDELs in the intragenic region. These and other characterizations of the variants are provided in Table 3. The genome also contained larger structural variants including inversions, translocations and duplication which are listed in Table 2C.

  SNPs Small insertions Small deletions
Variants 3 608 953 273,806 272,875
Synonymous variants 11 548    
Non-synonymous variants 10 267    
Splicing variants 256    
Stop gained 66    
Stop loss 46    
Frame shift 346    
Found in dbSNP138, % 96.3    
Found in dbSNP142, % 97.3    
Het/Hom ratio 1.67    
Ts/Tv ratio 2.0771    

Table 2A: Variant Analysis

  No. of Variants
Copy number gains (>2) 618
Copy number losses (<2) 380

Table 2B: Copy Number Variant

SV type No. of Variants
Duplications 79
Insertions 2173
Deletions 5229
Inversions 88
Translocations 130

Table 2C: Structural Variants

Coding SNPs (cSNPs) may lead to amino-acid substitution in proteins encoded by the gene. Out of a total of 24 620 cSNPs in the IHGP04 genome, 12 091 were identified as non-synonymous substitutions (nsSNPs), of which about one-third were homozygous SNPs. In addition, 685 novel SNPs were of the nsSNPs category. Any possible significance of nsSNPs needs to be analyzed to determine the contribution to phenotype. Among the novel INDELs, 74 indels were found in the coding region (Table 3).

Homozygous Heterozygous Novel  SNP Homozygous Heterozygous Novel INDEL
Total 1 495 832 25 83 170 2 08 796 2 80 604 4 93 097 68 354
INTERGENIC 830 954 14 27 671 1 25 346 1 46 733 2 48 154 35 749
INTRAGENIC 18 631 32 961 1777 3545 6714 789
UPSTREAM 36 878 67 479 6718 8305 14 357 2388
DOWNSTREAM 41 323 79 606 7571 9147 17 158 2517
UTR_5_PRIME 1552 2807 276 303 317 89
UTR_3_PRIME 8940 15 692 1294 2199 3 805 398
INTRON 545 130 9 32 742 63 764 1 09 631 2 01 394 26 222
Noncoding exon variant 3761 8255 892 493 883 128
Synonymous mutation 4488 8041 473 0 0 0
Non-synonymous mutation 4175 7916 685 1 1 0
Indels in coding region 0 0 0 247 314 74

Table 3: Variant Mapping Statistics

Based on analyses several variants in the non-coding intragenic regions can also be ‘probably damaging’[12-14]. Novel variants need to be thoroughly analyzed for their locations in the coding region, possible effect on the structure-function of the putative proteins and overall impact on the phenotype. The frequency of the occurrence of the novel variants in the target population should be ascertained.

Mitochondrial DNA (mtDNA) sequence is widely used to understand the maternal genetic and migration history of human populations[15]. Variation in the mtDNA sequence has been analyzed in the human population, both in terms of evolution and population dispersals and in terms of the role that mtDNA mutations play in human disease[16-19]. The analysis of the IHGP04 mitochondrial genome (16 569 bp) showed that it had 41 SNPs as compared to the Cambridge Reference Sequence (rCRS)[17]. The haplogroup analysis of the mitochondrial sequence was carried out using Mito Tool (http://www.mitotool.org/)[20-22] that revealed L3e2b1a1 haplogroup to be the most related haplogroup for IHGP04. L3 haplogroup is found in the Indian subcontinent, particularly in the southern and eastern parts of India. The members of this group are believed to be the earliest arrival on the subcontinent directly from Africa probably through the coastal route and spread further to the Australian Aboriginal population.

The timeline of arrival in India is about 70 000 y back (http://www.eupedia.com/europe/Haplogroup_J_ mtDNA.shtml) (fig. 1A).


Fig. 1: Illustration of tracking maternal and paternal ancestry through analysis of mitochondrial and Y-chromosome haplogroups, respectively. Mitochondrial DNA sequence analysis pointed to L3 group which is expected to be the earliest entrants to the subcontinent. They perhaps came directly from Africa through coastal route. Paternal haplogroup was J group which is also abundant in Middle-East, Northern Africa, Southern Europe and West Asia. Its entry into the subcontinent is dated around 20 000 y ago. The inset picture depicts distribution of J haplogroup.

Similarly, the Y haplogroup was calculated using the Y-chromosome DNA sequence SNP index (http://www.isogg.org/tree/ISOGG_YDNA_SNP_Index.html) and the variants for IHGP04 individual. J haplogroup was determined for the IHGP04 sample (fig. 1B), which is found in southern Europe, the Northern Sub-Saharan region around the Mediterranean Sea, along with West Asia and the Indo-European belt[22-24]. The arrival of this haplogroup members in India is estimated to be later than 20 000 y back through land routes from the North-West region. Drastically different   ancestors of the maternal and paternal lineages, in terms of the time of arrival and the haplogroups, indicate extensive admixing of the populations on this continent. Earlier, we have analyzed the genome of the Gujarati male from the western region of India[8]. In that particular case, the haplogroup migration pattern for both mtDNA and Y chromosome followed a similar timeline and route of migration[8,22-24].

The analysis of the Copy Number Variation (CNV) was carried out using the readDepth program. This program detects CNVs by measuring the depth of coverage. The CNV in terms of gene loss and gene gain was reported for the IHGP04 sample. Out of the total 998 CNVs, the genome contained a total of 380 deleted genes (copy number <2) and 618 duplicated genes (copy number >2), shown in Table 2B.

To have a health profile of an individual, the cSNPs identified in the IHGP04 genome were annotated using several databases like OMIM, SNPedia, Human Gene Mutation Database (HGMD) and others[25-31]. A comprehensive list of 500 plus genes and their variants that have been associated with disease susceptibility or risk was developed. Further, all the variants of the IHGP04 genome were analyzed, which revealed susceptibility to hypertension, lung cancer, heart disease and others. The variant genes and their correlation to health and disease are listed in Table 4. The individual has a lower risk of Alzheimer's and dementia due to variants in the Cholesteryl Ester Transfer Protein (CETP) gene.

Disease profile  
Increased risk  
AGT, CYP4A11 Higher hypertension risk
HNF1A, MMP9, XPC ~2X increased lung cancer risk
LRP8, SNX19, GIPR & ROS1 Increased risk for heart disease; better response to statins for SNX19
HCRTR2 2-6X increased risk for cluster headaches
TAS2R38 possible unable to taste bitter
ATF4 1.78X increased risk for schizophrenia in males
NAT2 Increased risk of hearing loss
EDA2R Increased risk of baldness
ERAP1 Higher risk for spondylitis
ERAP2 1.3X increased risk for preeclampsia in most population
FAM71F1 >2X increased risk of familial obesity
Protection/reduced risk  
CEPT Lower risk of dementia and Alzheimer’s
Pharmacogenomics profile  
KCNJ11 Better response to Repaglinide in case of T2D
SLC22A2 may have reduced risk to nephrotoxicity in response to cisplatin treatment
HSPA1L May be protective against adverse reaction to carbamazepine
KIF6 May have increased risk for adverse cardiovascular events when treated with statins

Table 4: Disease Association And Drug Response Prediction In Genome Ihgp04

The genetic variants (SNPs) present in an individual can be used for the prediction of adverse drug reactions and also manage the effective dose of drugs prescribed. To understand the pharmacogenomic profile of IHGP04 individual, the variants (SNPs) present in the genome were annotated based on the published literature relevant to pharmacokinetic and pharmacodynamics relevance[32-34]. Upon analysis, we found the genome to have several SNPs associated with the metabolism of or response to various drugs. The variant analysis suggests that the person may have a better response to the drug Repagalinide in case of type 2 diabetes and may have a reduced risk of nephrotoxicity to cisplatin therapy. Some of the relevant profile parameters are listed in Table 4.

With technology becoming accessible and affordable, individual genome analysis is becoming a reality. The outcome from such exercise is important to the individual to glimpse at one’s genetic background and impact on one’s health. A large number of unreported new and novel variants were present in the individual (208 796 SNPs and 68 354 Indels). The novel variants add to the database for human genetic variants. This study is particularly important because the human genetic variant database does not contain adequate representation from people of the Indian subcontinent, which makes up approximately 1/6th of the world population. More analysis of individual genomes from the underrepresented populations will eventually make the human genome database “complete” representing the entire human population. Analysis of mitochondrial DNA and Y-chromosome indicate timeline and route of human migration as revealed from the present analysis. Indian population is an admixture of several waves of migration for 80 000 y; it is therefore not possible to have a representative Indian genome. The current analysis has picked one individual from the Kashmir region of the North-Northwest part of India. Many more genome analyses across the country would help us to develop a better pic of the genetic background of Indians, which may help in the development of a better health policy of the nation.


The author is indebted to Suhani Almal for analysis of some results; Milee Agarwal, Sweta Patel for their advice during the initial phase of the project; Je Hoon Jun and Jong Bhak of Personal Genomics Institute, Genome Research Foundation, Suwon, Republic of Korea. Financial support from GSBTM is gratefully acknowledged.

Conflict of interests:

The authors declared no conflicts of interest.