Twenty-one years ago, researchers announced the first “draft” of sequencing the complete human genome. It was a monumental achievement, but the sequence was still missing about 8 percent of the genome. Now, scientists working together around the world say they’ve finally filled in that reclusive 8 percent.
If their work holds up to peer review and it turns out they really did sequence and assemble the human genome in its entirety, gaps and all, it could change the future of medicine
What’s in a Genome?
Sequencing the human genome has long been a huge project with worthy goals. Why? Because as humans understand their genetic code better, they can make better, more customized medicines, for example—including the kind of gene-focused medicine that powered the first effective COVID-19 vaccines.
Humans have 46 chromosomes, in 23 pairs, that represent tens of thousands of individual genes. Each gene consists of some number of base pairs made of adenine (A), thymine (T), guanine (G), and cytosine (C). There are billions of base pairs in the human genome.
In June 2000, the Human Genome Project (HGP) and private company Celera Genomics announced that first “draft” of the human genome. This was the result of years of work that picked up the pace as humans continued to make better computers and algorithms for processing the genome. At the time, scientists were surprised that of the over 3 billion individual “letters” of base pairs, they estimated humans have just 30,000 to 35,000 genes. Today, that number is far lower, hovering just above 20,000.
Three years later, HGP completed its mission to map the whole human genome and defined its terms this way:
“Current technology” is doing a lot of heavy lifting here. At the time, HGP used a process called bacterial artificial chromosome (BAC), where scientists used a bacterium to clone each piece of the genome, and then study them in smaller groups. A complete “BAC library” is 20,000 carefully prepared bacteria with cloned genes inside.
But that BAC process inherently misses some portions of the whole genome. The reason why is a great lead-in to what the new team of scientists has helped to accomplish.
A Sequencing Breakthrough
What’s lurking in the secretive 8 percent of the genome that the 2000 “draft” of the genome left untouched? The base pairs in this section are made of many, many repeated patterns that just made it too unwieldy to study using the bacteria cloning method.
BAC and other approaches just weren’t right for the repeats-heavy remaining 8 percent of the genome. “The current workhorse DNA sequencers, made by Illumina, take little fragments of DNA, decode them, and reassemble the resulting puzzle,” Stat’s Matthew Herper reports. “This works fine for most of the genome, but not in areas where DNA code is the result of long repeating patterns.”
That makes intuitive sense; imagine counting from 1 to 50 versus simply counting 1, 2, 1, 2, . . . over and over again. Part of what made the BAC method successful is scientists took care to minimize and match up the overlaps, which became almost impossible in the repeats-heavy unexplored portion of the genome.
So, what’s different in the new approaches? Let’s first look at what they are. The California-based Pacific Biosciences (PacBio) the U.K.-based Oxford Nanopore have different technologies, but are racing toward the same goal.
PacBio uses a system called HiFi, where base pairs are circulated, literally as circles, until they’re read in full and in high fidelity—hence the name. The system dates back just a few years and represents a big step forward in both length and accuracy for those longer sequences.
Oxford Nanopore, meanwhile, uses electrical current in its proprietary devices. Strands of base pairs are pressed through a microscopic nanopore—just one molecule at a time—where a current zaps them in order to observe what kind of molecule they are. By zapping each molecule, scientists can identify the full strand.
In the new study published in the biology preprint server bioRxiv, an international consortium of about 100 scientists used both PacBio and Oxford Nanopore technologies to chase down some of the remaining unknown sections of the human genome.
The amount of ground the consortium covered is staggering. “The consortium said that it increased the number of DNA bases from 2.92 billion to 3.05 billion, a 4.5 [percent] increase. But the count of genes increased by just 0.4 [percent], to 19,969,” Stat reports. This shows how big the heavily repeating base pair sequences in this zone are compared to the genes they represent.
The Missing Links
Sequencing godfather George Church, a biologist at Harvard University, told Stat if this work goes through peer review successfully, it will be the first time any vertebrate genome has been fully mapped. And the reason seems to be simply that both new technologies allow very long strings of base pairs to be read at once.
Why is the missing gene information so important? Well, the study of genes experiences a lot of favoritism, with a handful of most popular genes taking up the bulk of research interest and funding. The overlooked genes hold a lot of key mechanisms that cause disease, for example.
There’s one little snag, although it was also a snag for the 2000 announcement of the first draft of the genome. Both projects studied cells that had just 23 chromosomes instead of the full 46. That’s because they use cells derived from the reproductive system, where eggs and sperm each carry half of a full chromosomal load.
The cell is from a hydatidiform mole, a kind of reproductive growth that represents an extremely early, unviable union between a sperm and an egg cell that has no nucleus. Choosing this kind of cell, which has been kept and cultured as a “cell line” used for research purposes, cuts the huge sequencing job in half.
The next step is for the study to appear in a peer-reviewed publication. After that, though, both PacBio and Oxford seek to sequence the entire 46-chromosome human genome. But we might be waiting a while.