DDIA stands for Design data intensive applications

2026-06-14

Ch3 Storage and Retrieval

Two engines: - log-structured storage engine - basically appending in a log and keep indexes in a hash map - pros: - appending & segment merging are sequential write, faster than random write - cons: - hash table has to be in memory - range queries difficult, k1234-4455 - SSTables & LSM-Trees - SST(sorted string table), sparse in-memory index table, memtable before saving on disk as SSTable, - one can use red-black trees / AVL trees to maintain the order - log-structured merge-tree - very painful to search as it has to go through, memtable, segment maybe disk, can be mitigated using bloom filter - advantages: sequential -> reduce write amplification, compaction makes storage smaller

page-oritented storage engines (e.g., B-tree)
- balanced trees
Transcation processing / analytics
- OLTP (online transcation processing)
- OLAP (online analytic processing)
Datawarehousing
- ETL (extract, transform, load)
- Star and snowflake: schemas for analytics
Column-oriented storage
- faster to query

Ch4, Encoding and Evolution

backward: newer code can read data written by older code
forward: older code can read data written by newer code

Part II, distributed data

Replication

Goal

reduce latency (close to the user)
increase availability (some cluster may fail)
increase read throughput