Layers of Data Infrastructure 3: Storage
Design decisions for how your systems and pipelines store data.
New Post: Layers of Data Infrastructure 3: Storage
This post is the second in a three-part series exploring the high-level design decisions you need to make for each stage in each use case category. The Storage layer defines how data is stored between, during and after the all the stages in a pipeline or system.
The quick-read version:
Storage options are defined by the structure of the data and the ways you’ll write and query it.
Local files - Your hard drive. Simple to implement, but mostly for local compute.
Remote files - Shared hard drive. Scales well. Built-in backup & security. Limited read/write compared to local files, and it’s still just files. AKA “Data Lake”
Relational database - Excel-like data, read and write a few entries at a time, or query with SQL-like languages. Up to millions of rows, but not billions.
Analytics database - Similar structure to relational. Write in large batches, query at any scale. AKA “Data Warehouse”
KV-store - Write anything, with or without structure, one entry at a time or in large batches. Scales well, but querying is limited. AKA “NoSQL”
Graph Database - Similar data structure to relational, but viewed as graph/network. Slow for SQL-like queries, but only option for graph-like queries.
For Further Consideration
How many of these options is your organization currently using? Do you store the same data in different places for different purposes?
What types of queries do your users currently rely on? Could you improve performance by switching to a more or less structured form of storage?
Further Reading
The rise of graph databases is a relatively recent trend in the storage layer.
Some of the earliest interest in graph databases came from Google’s Knowledge Graph, which was introduced in a blog post in 2012.
For a slightly more technical take, Graph Databases. What’s the Big Deal? provides a nice introduction, starting from the definition of a graph.
And Understanding benefits of Graph Databases over Relational Databases looks at how a specific schema is stored in a graph or relational database, and compares queries in each.
Up Next
My next post will attempt to demystify some aspects of data governance,
Followed by a series of case studies of exploring design options of specific use cases.
Then I want to take a step back and examine what it means to have a coherent, integrated data platform, and why you might want to invest in one.
The myriad ways of organizing data is confusing. The articles you've linked to are good reads. However, designing the data structures for a given industry is more than an afternoon's work. It feels far removed from the day to day of biotech. Any strategies to avoid "paralysis by analysis"?
https://miro.medium.com/max/720/0*h60AcWEOy-5Qdmr2
https://www.sqlshack.com/wp-content/uploads/2018/05/word-image-281.png
For storage options I like to consider capacity, cost, convenience, and latency. Over the years there have been many expensive high tech solutions such as tape libraries and data closets.
The ETL vs ELT analysis, you mentioned is a a good place to start. Understanding scale and scope is hard to do in advance, so it's important to leverage lessons learned. Data Lakes and Graph Databases require understanding of the broader objectives, significant planning, and commitment of resources. Biologists grapple with the layering of biochemical, cellular, organ, system, and behaviour. A haphazard storage strategy will be as temperamental as a hyena and as sluggish as, well, as sluggish as a slug.
https://media.sciencephoto.com/image/c0049078/400wm/C0049078-Computer_Tape_Library.jpg
https://images.computerhistory.org/revonline/images/500004392-03-01.jpg