Researchers at the EMBL-European Bioinformatics Institute (EMBL-EBI) have created a way to store data in the form of DNA – a material that lasts for tens of thousands of years. The new method, published January 23 in the journal Nature, makes it possible to store at least 100 million hours of high-definition video in about a cup of DNA.
There is a lot of digital
information in the world – about three zettabytes’ worth (that’s 3000 billion
billion bytes) – and the constant influx of new digital content poses a real
challenge for archivists. Hard disks are expensive and require a constant supply
of electricity, while even the best ‘no-power’ archiving materials such as
magnetic tape degrade within a decade. This is a growing problem in the life
sciences, where massive volumes of data – including DNA sequences – make up the
fabric of the scientific record.
"We already know
that DNA is a robust way to store information because we can extract it from
bones of woolly mammoths, which date back tens of thousands of years, and make
sense of it,” explains Nick Goldman of EMBL-EBI. “It’s also incredibly small,
dense and does not need any power for storage, so shipping and keeping it is
easy.”
Reading DNA is fairly
straightforward, but writing it has until now been a major hurdle to making DNA
storage a reality. There are two challenges: first, using current methods it is
only possible to manufacture DNA in short strings. Secondly, both writing and
reading DNA are prone to errors, particularly when the same DNA letter is
repeated. Nick Goldman and co-author Ewan Birney, Associate Director of
EMBL-EBI, set out to create a code that overcomes both problems.
“We knew we needed to
make a code using only short strings of DNA, and to do it in such a way that
creating a run of the same letter would be impossible. So we figured, let’s
break up the code into lots of overlapping fragments going in both directions,
with indexing information showing where each fragment belongs in the overall
code, and make a coding scheme that doesn't allow repeats. That way, you would
have to have the same error on four different fragments for it to fail – and
that would be very rare," says Ewan Birney.
The new method requires
synthesising DNA from the encoded information: enter Agilent Technologies, Inc,
a California-based company that volunteered its services. Ewan Birney and Nick
Goldman sent them encoded versions of: an .mp3 of Martin Luther King’s speech,
“I Have a Dream”; a .jpg photo of EMBL-EBI; a .pdf of Watson and Crick’s
seminal paper, “Molecular structure of nucleic acids”; a .txt file of all of
Shakespeare's sonnets; and a file that describes the encoding.
“We downloaded the files
from the Web and used them to synthesise hundreds of thousands of pieces of DNA
– the result looks like a tiny piece of dust,” explains Emily Leproust of
Agilent. Agilent mailed the sample to EMBL-EBI, where the researchers were able
to sequence the DNA and decode the files without errors.
“We’ve created a code
that's error tolerant using a molecular form we know will last in the right
conditions for 10 000 years, or possibly longer,” says Nick Goldman. “As long
as someone knows what the code is, you will be able to read it back if you have
a machine that can read DNA.”
Although there are many
practical aspects to solve, the inherent density and longevity of DNA makes it
an attractive storage medium. The next step for the researchers is to perfect
the coding scheme and explore practical aspects, paving the way for a
commercially viable DNA storage model.
How much data can you get in one gram of DNA?
Scientists from the European Bioinformatics
Institute are squeezing unparalleled amounts of data in to synthetic DNA, and
now they’ve achieved something absolutely amazing: they can store 2.2 petabytes of information in a single
gram of DNA, and recover it with 100 per cent accuracy.
The researchers have encoded an MP3 of
Martin Luther King’s 1963 “I have a dream” speech, along with all 154 of
Shakespeare’s sonnets, into a string of DNA. Scaled up, that represents a
storage density of 2.2 petabytes per gram. What’s amazing, though, is that
they’ve managed to achieve that whilst also implementing error correction in
the complex chains of molecules, allowing them to retrieve content with 100 per
cent accuracy.
The technique uses the four bases of DNA —
A, T, C and G — to achieve the high information density. It is, understandably,
still incredibly expensive: creating synthetic DNA and then sequencing it to
read off the data is getting far easier, but it’s still a time- and
cash-consuming business. Keep hold of your hard drives for now, but DNA could
represent a viable storage solution in the future.