DDIA stands for Design data intensive applications
Ch3 Storage and Retrieval
Two engines: - log-structured storage engine - basically appending in a log and keep indexes in a hash map - pros: - appending & segment merging are sequential write, faster than random write - cons: - hash table has to be in memory - range queries difficult, k1234-4455 - SSTables & LSM-Trees - SST(sorted string table), sparse in-memory index table, memtable before saving on disk as SSTable, - one can use red-black trees / AVL trees to maintain the order - log-structured merge-tree - very painful to search as it has to go through, memtable, segment maybe disk, can be mitigated using bloom filter - advantages: sequential -> reduce write amplification, compaction makes storage smaller
-
page-oritented storage engines (e.g., B-tree)
- balanced trees
-
Transcation processing / analytics
- OLTP (online transcation processing)
- OLAP (online analytic processing)
-
Datawarehousing
- ETL (extract, transform, load)
- Star and snowflake: schemas for analytics
-
Column-oriented storage
- faster to query
Ch4, Encoding and Evolution
- backward: newer code can read data written by older code
- forward: older code can read data written by newer code
Part II, distributed data
Replication
Goal
- reduce latency (close to the user)
- increase availability (some cluster may fail)
- increase read throughput