For the past two decades, hundreds of international scientists affiliated with the $3 billion Human Genome Project have been on a quest to read every letter in a person's genetic blueprint. In 2003, the first map of the human genome was reported, a momentous breakthrough that would usher in a new age of science. But this wasn't a complete map. About 8% of the genome, equivalent to the information contained by one chromosome, was missing. This tricky region of the genome contained highly repetitive sequences that were difficult to assemble -- until now.
How to build a human: now with the complete instructions
"Generating a truly complete human genome sequence represents an incredible scientific achievement, providing the first comprehensive view of our DNA blueprint," Eric Green, director of the National Human Genome Research Institute (NHGRI), part of the U.S. National Institutes of Health, said in a statement.
"This foundational information will strengthen the many ongoing efforts to understand all the functional nuances of the human genome, which in turn will empower genetic studies of human disease," he added.
The genome is the sum of all the DNA and mitochondrial DNA (mtDNA) sequences in the cell. It contains all the instructions a living being needs to survive and replicate, consisting of chemical building blocks or “bases” (G, A, T, and C), whose order or sequence encodes biological information.
For humans, the size of the genome is considered to be the total number of bases in one copy of its nuclear DNA. That's despite the fact that humans and other mammals contain duplicate copies of almost all of their DNA. For instance, we have pairs of chromosomes, with one chromosome of each pair inherited from each parent. But scientists are only interested in sequencing the sum of the bases of one copy of each chromosome pair. A person’s actual genome is roughly six billion bases in size, but a single “representative” copy of the human genome is about three billion bases in size.
Since it's so large, it's virtually impossible to read it all in one sitting from head to tail. To sequence the genome, researchers first break down the DNA into smaller, more manageable pieces. Each string is then subjected to chemical reactions that allow scientists to pinpoint the location of bases and their respective positions in the sequence. It's then just a matter of stitching together these pieces like solving a giant jigsaw puzzle.
The problem is that some regions of the genome, particularly the centromeres (the parts that hold the two strands of chromosomes together), repeat the same sequences over and over again. In the past, these repetitive regions made it virtually impossible to assemble the genome in its right order, like putting together an intricate puzzle with many identical pieces. To add to the challenge, the genome contains two copies -- one from the mother, one from the father -- and it is easy to mix together their sequences.
Initially, this problem wasn't seen as all that important. Repetitive sequences of DNA were thought to be 'junk' -- just some extra copies with little to no relevance to the bigger picture. That was later shown to be wishful thinking. Centromeres contain vital instructions for protein manufacturing while other repetitive sections may contain unique genes that have helped our species adapt across our evolutionary history.
Because centromeres play such a vital role in accurately copying genetic material, as one cell divides into two, malfunctions can lead to diseases such as Down syndrome, an inherited condition in which children are born with an extra chromosome.
A 100% complete human genome sequence
These challenges were overcome thanks to modern technology and some very clever thinking. At the University of Pittsburgh, reproductive geneticists house very rare human cell samples which, because of glitches in their development, have two copies of the father's DNA and no information from the mother. This single-genome line is orders of magnitude easier to sequence than the typical two-copies genome that we all inherit.
This single-genome line sample was analyzed using a novel Nanopore machine that can read a million bases of DNA at a time. While they worked on this effort, the researchers banded together to form the Telomere-to-Telomere (T2T) consortium to sequence each chromosome from one end, or telomere, to the other.
For six months, T2T researchers sequenced the human genome nonstop. The last piece of the puzzle was a new sequencing machine developed by Pacific Biosciences that could perform long-read sequencing with an accuracy greater than 99%.
“It was the last piece of the puzzle – like putting on a new pair of glasses,” said Adam Phillippy of the National Human Genome Research Institute who led the team of over 100 scientists involved in this research.
Overall, the now complete full version of the human genome contains 3.055 billion base pairs, corresponding to 19,969 genes. Of these genes, about 2,000 are new to science. However, most are disabled, meaning they do not express proteins, and only 115 are active. The researchers also uncovered unexpected high levels of genetic variations in centromeres, whose significance remains to be determined.
Finishing the human genome sequence, as described by the authors of the study, is like putting on a new pair of glasses -- we can now clearly see everything. In the future, armed with this complete genomic information, doctors will be able to provide personalized healthcare to patients. The first human genome sequence cost billions. Now, mapping the genome of a patient -- including the recently filled gaps -- could cost as little as $1,000.
But ultimately, having a complete genome is paramount to reverse-engineering the course of human evolution. Some of the genes previously associated with bigger brains were found to be highly variable. One person might have 10 copies of a particular gene, while others might have only one or two. This variation is a double-edged sword as “these regions become a crucible for both rapid evolutionary changes and disease susceptibility, both within and between species,” Eichler explained.
This monumental work is far from over though. The T2T consortium is busy sequencing a new genome, this time one with different chromosomes inherited from each parent. Another ambitious goal is to perform a pan-genome project, in which the DNA of hundreds of people across the world is fully sequenced in order to capture as closely as possible the richness of human diversity.
The new results appeared in six papers in Science and more than a dozen papers elsewhere. Previously in 2021, the authors released a paper detailing these results in a pre-print server but now the findings have been peer-reviewed.