Digital Data Storage – DNA as a Tool

Reading Time: 5 minutes

INTRODUCTION

The explosion of information is posing a challenge to the existing means for data storage. It is anticipated that current storage methods (ex. magnetic and optical media) will be inadequate to store the exponentially growing data. In 2020, every individual in the world had generated around 1.7 MB of data each second, which amounted to 418 zettabytes, requiring approximately 418 billion one-terabyte hard drives for storage.^[1] To provide a perspective, the Large Hadron Collider, located at the European Organization for Nuclear Research (CERN), is an example of a contributor to the unprecedented research data that is being added continuously. It generates about 50 million GB of data per year as it records the results of experiments involving approximately 600 million particle collisions per second. In the area of life sciences, DNA sequencing alone generates millions of GB of data per year; and it is predicted that within a decade, we will be swamped with 40 billion GB of genomic data.

Traditional mass-storage technologies are starting to approach their exhaustive limits while the need for data storage keeps surging. With hard-disk drives, there is a limit of 1 TB per square inch.^[2] The biggest tape archive facilities can store an exabyte of data, but these facilities take up ample space, cost in billions for upkeep, use considerable amounts of energy, and are required to be copied at regular intervals to ensure that the data is not lost due to degradation.^[3] Also temperature fluctuations can induce the magnetically charged material of the disk to flip, thus corrupting the data it holds. Better heat-resistant material needs to be utilized for which technology has to be altered. This, in turn, would require huge investments.^[2] Therefore, data scientists are looking for better, more stable, and space-efficient alternatives to store huge datasets. DNA-based data storage has recently emerged as a promising approach for long-term digital information storage. Highly condensed DNA has great potential to become a storage material of the future.

A solution to this issue of requirement of digital storage space may be found in deoxyribonucleic acid (DNA), the molecular repository of biological information. DNA has an astonishing ability to store biological data, as it is the basic unit of storage system for all the information that governs biological life. DNA is not only abundant and sustainable, it also provides greater storage density than the currently available data storage media.^[4] Comparison of amount of traditional data storage systems versus DNA required to store 40 ZB data has been provided in figure 1. Furthermore, the data can be stored and accessed for longer periods of time without losing any information.^[5]^[6]

Figure 1: Schematic representation of the amount of traditional storage media needed to store 40 ZB of data versus DNA (Source)

The process of storing digital information^[7]^[8] using DNA as a storage medium involves the following steps:

Coding: This involves encoding binary data into synthetic strands of DNA. In order to store a binary digital file in DNA, the 1s and 0s of binary digits are converted into the letters A, C, G and T, which represent the four basic nucleotides of DNA, i.e., adenine, cytosine, guanine and thymine.

Synthesis: The binary digits converted into ‘A’s, ‘C’s, ‘G’s and ‘T’s are synthesized in a sequential order corresponding to the digital file and obtained as a physical storage medium.

Decoding: To recover the data, the chain of DNA is sequenced and the order of ‘A’s, ‘C’s, ‘G’s and ‘T’s are decoded back to the original digital sequence.

The proof of principle of storing digital data in the DNA was first presented in 1988 by an artist named Joe Davis in collaboration with researchers from Harvard University. Davis stored the image of a runic symbol representing 35 bits of data by representing the light and dark pixels as binary 1s and 0s, and encoding the same in 28 base pairs within the DNA of E. coli.^[9] Later, Seth Shipman and his colleagues at Harvard University used a version of Clustered Regularly Interspaced Short Palindromic Repeats (‘CRISPR’) with a different enzyme, called CRISPR/Cas1-Cas2, which allowed them to add a message to the genome rather than cutting a notch. The message was that of a recorded image of a human hand and five images showing a galloping horse from Eadweard Muybridge’s 1878 photographic study of the animal’s motion. To get the data of the DNA sequence encoded inside the cells, the team applied an electrical current that opened channels in the cells’ walls, which allowed the DNA flow in. Once inside, the CRISPR came into action and embedded the code. To read the data back again, the team sequenced the DNA of more than 600,000 cells. The sequencing of such a large number of cells was necessary as most of the cells would not have edited their genome entirely accurately.^[10]

According to Microsoft Inc., synthetic DNA could be the next major milestone in long-term data storage, with just one gram of DNA capable of storing 215 petabytes of data for up to 2,000 years. If scientists could realize its full potential, the technology could drastically reduce the space required to store the world’s ever growing data.^[11]

RECENT DEVELOPMENTS

Shuichi Hoshika et al., have reported DNA and RNA-like systems built from eight nucleotide “letters” coined as ‘hachimoji’. These eight nucleotides form four orthogonal pairs. These synthetic systems meet the structural requirements, including a polyelectrolyte backbone, predictable thermodynamic stability, and stereo-regular building blocks that fit a Schrödinger aperiodic crystal that supports the Darwinian evolution theory.^[12]

Researchers from the University of Washington and Microsoft collaborated to demonstrate the first fully automated system for storage and retrieval of data in manufactured DNA. The team successfully encoded the word “hello” in snippets of fabricated DNA and converted the same back to digital data using a fully automated end-to-end system.^[13]

In another event, North Carolina State University (NCSU) researchers have developed a system called Dynamic Operations and Reusable Information Storage (DORIS) which can work at room temperature, as against the Polymerase Chain Reaction (PCR) that typically requires heating to access stored files. DORIS’s primer-binding sequences are made up of a single-stranded tail of DNA that hangs off the end, which allows the system to find and retrieve files without needing to rip open the data-encoded DNA strands, through the heating and cooling process, as in the case of PCR.^[14]

Machine learning methods have been employed by researchers from University of Cambridge, for predicting DNA hybridisation which will aid in scaling up digital data storage. For this purpose, an in silico-generated hybridisation dataset of over 2.5 million data points was introduced which enabled the usage of deep learning.^[15]

Twist Bioscience Corporation has stored an episode of a Netflix Original Series in Twist’s synthetic DNA. Twist manufactures more than one million pieces of DNA on a single silicon chip using semiconductor technology and is working towards the next generation of silicon chip that will enable synthesize or write 10 GB of DNA on each silicon chip which will help achieve cost reduction of digital data storage.^[16]

START-UPS

CATALOG, a DNA based data storage and computation platform based out of Boston, hopes to provide commercial DNA data storage services with DNA synthesis process. CATALOG has stored around 1 KB of data which includes literary works from Douglas Adams and Robert Frost in DNA, by utilizing a large collection of premade molecules.^[17]

Iridia, formerly known as Dodo OmniData, is a San Diego–based start-up working on developing novel methods for storing data in DNA. Its prime focus is to develop a highly parallel format enabling an array of nanomodules to store data with high density. This will be achieved through the combination of DNA polymer synthesis, electronic nano-switches and semiconductor fabrication technologies.^[18]

Helixworks, an Irish start-up, is set to offer a DNA storage drive with a capacity to store 512 KB of data in specially encoded DNA, encapsulated in a gold pill, with a potential shelf-life of thousands of years.^[19]

INVESTMENTS AND COLLABORATIONS

Roswell Biotechnologies Inc., in collaboration with Georgia Tech Research Institute (GTRI), has received a $25 million contract from Intelligence Advanced Research Projects Activity’s (IARPA) Molecular Information Storage (MIST) program. The investment is to support the development of a sequencing technology capable of reading data stored in DNA that speeds up to 10 TB per day, more than 400 times the speed of the currently available sequencing technologies.^[20]

French biotech company DNA Script has been awarded a contract worth €20.7 million by the US Intelligence Advanced Research Projects Activity Agency towards development of a prototype instrument that is able to store and retrieve 1 terabyte of information in 24 hours.^[21]

CATALOG has secured a $10 million funding, which will be used to fund early product trials and continued research and development, in the form of Series ‘A’ funding from Horizons Ventures and Airbus Ventures. In all, Catalog has raised $21 million through additional investors such as NEA, OS Fund, Data Collective, AME Cloud, SOSV and others.^[22]

Illumina, Microsoft, Twist Bioscience, and Western Digital are leading the effort as founding members of the DNA Data Storage Alliance formed by fifteen tech-based companies and institutions. The alliance, led by the four founding members, is committed to addressing the exponential growth of digital data by establishing the foundation for a cost-effective commercial archival storage ecosystem.^[23]

RECENT PATENT PUBLICATIONS

In the patent US20200143909A1, researchers from Seoul National University and Kyung Hee University have disclosed a biochemical carrier consisting of biochemical molecules that have a sequence into which digital data information is encoded; a carrier particle consisting of a polymer matrix with biochemical molecules which is attached on the surface or located inside and an index code is introduced into the carrier particle.

ETH Zurich has developed a method, as disclosed in their PCT patent application WO2019081145A1, for encoding information based on the generation of an encryption key according to polymorphic features of nucleic acids from one or more entities. This is followed by information encryption based on the generated key and then encoding the encrypted information into the synthetic DNA.

A method for encoding binary data in a double stranded Deoxyribose Nucleic Acid (dsDNA), has been developed by Microsoft Inc. which is disclosed in EP3478852A1. This has been achieved by creating a double strand break (DSB) at a target site in the dsDNA with an enzyme, selecting a homologous repair template according to a binary digit, the target site, and an encoding scheme and contacting the dsDNA with the homologous repair template.

In US20180189448A1, Intel Corporation has disclosed a data storage apparatus, consisting of a microfluidic droplet storage array which contains information-encoding polymer molecules and an interface to receive the droplets from a data writer that writes the droplets into the microfluidic droplet storage array.

CATALOG Technologies Inc. has recently filed US20210079382A1 that discloses a method for encoding digital information in DNA molecules without base-by-base synthesis. The method involves encoding bit-value information based on unique nucleic acid sequences within a pool by specifying each bit location and value.

CONCLUSION

The challenges associated with storage of huge amount of data could be resolved using DNA-based technologies. The ample availability and long life of DNA can be harnessed to utilize it as a data storage unit. However, drawbacks such as exorbitant costs, slow writing and reading mechanisms, uncertainty in the in vitro DNA synthesis and sequencing techniques, along with lack of preservation techniques, can lead to severe errors and data loss, which may limit its practical applications. To address these issues, technologies for DNA synthesis, sequencing and retrieval that were developed for life sciences applications need to be tailored to support digital data storage applications. In addition to tailoring DNA for digital data storage, integration of DNA technologies with life sciences, material science and information technology is necessary to eliminate the barriers and facilitate the commercialization of DNA as a tool for digital storage.

References

How Much Data Is Created Every Day in 2022?
Source
Exabytes In A Test Tube: The Case For dna Data Storage
Source
DNA Data Storage – Setting the Data Density Record with DNA Fountain
Source
Team Edinburgh UG
Source
Bacterial nanopores open the future of data storage
Source
DNA could store all of the world’s data in one room
Source
How DNA could store all the world’s data in a semi-trailer
Source
Digital Data Storage on DNA
Source
How DNA could store all the world’s data
Source
CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria
Source
Microsoft: This is world’s first automated DNA data storage, retrieval system
Source
Hachimoji DNA and RNA: A genetic system with eight building blocks
Source
fully-automated-dna-storage
Source
Breakthrough tech makes DNA data storage more practical and scalable
Source
Scaling up DNA digital data storage by efficiently predicting DNA hybridisation using deep learning
Source
Twist Bioscience Synthetic DNA Stores New Netflix Original Series ‘BIOHACKERS’
Source
Bostons Catalog Secures 9-Million In Funding To Advance dna Data Storage Technology
Source
Data storage Solution May Be In the dna
Source
DNA Drive
Source
Roswell Biotechnologies Awarded Government Contract for DNA Digital Data Storage
Source
DNA Data Storage Project Receives €20.7M from US Intelligence Agency
Source
DNA-based Data Storage and Computation Provider CATALOG, Founded by MIT Scientists, Raises $10 Million
Source
Illumina, Microsoft, Twist Lead New DNA Data Storage Alliance
Source

Disclaimer:

This document has been created for educational and instructional purposes only
Copyrighted materials used have been specifically acknowledged
We claim the right of fair use as ascertained by the author

Author

Dr. Sivaprasad

Dr. Sivaprasad earned his Doctrate in Materials Science and Engineering. He has vast experience in nanotechnology, in particular drug delivery systems based on polymer nanoparticles, other interested areas include energy storage systems. Currently, he works for SciTech Patent Art as a Group Leader.

View all posts

Submit your review
Name:
Email:
Rating:	1 2 3 4 5
Review:

Check this box to confirm you are human.
Submit Cancel

Create your own review

Scitech Patent Art

Average rating:

0 reviews