Layers of Data Infrastructure 3: Storage

Dec 9, 2020

Design decisions for how your systems and pipelines store data.

3 Comments

Dec 17, 2022Edited

The myriad ways of organizing data is confusing. The articles you've linked to are good reads. However, designing the data structures for a given industry is more than an afternoon's work. It feels far removed from the day to day of biotech. Any strategies to avoid "paralysis by analysis"?

https://miro.medium.com/max/720/0*h60AcWEOy-5Qdmr2

https://www.sqlshack.com/wp-content/uploads/2018/05/word-image-281.png

Expand full comment

Reply (1)

Jesse Johnson

Jan 1, 2023

That's a great point - schema/ontology design is often much more complex and difficult than infrastructure, and often runs into issues with politics and personal preferences. I have some thoughts on this that I might write about in the future, but more than I want to put in the comments section :)

Expand full comment

en zyme

Dec 17, 2022

For storage options I like to consider capacity, cost, convenience, and latency. Over the years there have been many expensive high tech solutions such as tape libraries and data closets.

The ETL vs ELT analysis, you mentioned is a a good place to start. Understanding scale and scope is hard to do in advance, so it's important to leverage lessons learned. Data Lakes and Graph Databases require understanding of the broader objectives, significant planning, and commitment of resources. Biologists grapple with the layering of biochemical, cellular, organ, system, and behaviour. A haphazard storage strategy will be as temperamental as a hyena and as sluggish as, well, as sluggish as a slug.

https://media.sciencephoto.com/image/c0049078/400wm/C0049078-Computer_Tape_Library.jpg

https://images.computerhistory.org/revonline/images/500004392-03-01.jpg

Expand full comment

Scaling Biotech

Layers of Data Infrastructure 3: Storage