Genetic data type: whole genome or whole exome?
While the genomic industry has favored genotyping and whole exome (WES) because of cost and analysis difficulty considerations, the secrets of the genome reside in the whole genome (WGS). Those who can innovate to solve the herculean challenge of WGS analysis will unlock tremendous value for their stakeholders and society.
Picture the iceberg that you've been told about many times before. You remember that the visible portion at the top is only about 10% of the iceberg, and that the submerged portion makes up the great majority that remains unseen below the surface even as it is the portion that is the most vital to the structure and function of the iceberg as a whole.
Well, the genome is a bit like the iceberg, only that the proportions are even worse. You've heard that there is an all-important protein coding region of the genome, called the exome, and that the remaining 99% of the genome is "junk DNA". This great majority portion of the genome, the inter-genic regions, contains so much noise and diversity that it has befuddled scientists. The idea that the inter-genic 99% is junk arose precisely because researchers could not make sense of it.
Many studies have pushed back against the "junk" idea, showing that inter-genic regions serve critical but poorly understood functions, for instance around gene regulation (i.e., control over how the famous protein-coding 1% functions). At least 200,000 intergenic regulatory regions are estimated in the human genome (compared to ~20,000 protein-coding genes). Most variants associated with complex disease are also present in these dark regions. Still, despite the mounting evidence supporting the immense biological functions of non-exomic regions of the genome, most genomic research has remained focused on the exome.
The reason for this dramatic, asymmetric focus on the exome is borne of cost and analysis difficulties. Because genotyping and WES represent a miniscule portion of the genome, the costs to run such sequencing are much lower. Similarly, the attendant data storage costs are lower because, where one human WGS is about 100 GB of data on disk, WES is only about 1 GB and genotyping data a fraction of that. Data management and analysis follow the same trend, as the computational resources and time required to process the same number of WES are much less than that for the same operation performed on WGS.
A brief comparison of the three major genetic data types
Genotyping | Whole exome (WES) | Whole genome (WGS) | |
---|---|---|---|
Genome fraction | 0.02%* | 0.85%** | 100% |
Information content | Lowest | Very low | Full |
Typical size | ~20 MB | ~1 GB | ~100 GB |
Storage required | Lowest | Low | High |
Analysis required | Simple | Manageable | Complex |
Cost | Lowest | Low | Medium |
*One of the most popular genotyping chips (23&Me's v5 SNP chip) accounts for only ~640,000 SNPs compared to 3,200,000,000 bases in the genome, a number which does not account for the additional ~300,000,000 new bases identified in african and asian populations.
**The exome makes up ~1% of the genome, but 15-20% of the exome is not included in WES due to cloning difficulty.
More daunting still, because of the great noise present in intergenic regions, it is exceedingly difficult to tease out statistically significant differences that may underlie disease. Some have called it an impossible task. Because WES is much less noisy, the task is a little easier there to go along with the previously described advantages around costs, time, and data size. It is why the industry has favored WES research and are building vast WES datasets. However, the numerous drawbacks of WES point to a much needed course correction towards WGS (even in the clinic where WES has seen the most success as a diagnostic, it is only effective in 28% of cases). It is on the WGS turf that the battle to elucidate inherited disease and bring about precision, preventative medicine and curative therapies will be fought and won.
In upcoming posts, we will address the "data size" question (are vast genetic datasets of millions of samples really required?) and Genetic Intelligence's singular, proven solutions to the described problems.