DDIA stands for Design data intensive applications

2026-06-14

Ch3 Storage and Retrieval

Two engines: - log-structured storage engine - basically appending in a log and keep indexes in a hash map - pros: - appending & segment merging are sequential write, faster than random write - cons: - hash table has to be in memory - range queries difficult, k1234-4455 - SSTables & LSM-Trees - SST(sorted string table), sparse in-memory index table, memtable before saving on disk as SSTable, - one can use red-black trees / AVL trees to maintain the order - log-structured merge-tree - very painful to search as it has to go through, memtable, segment maybe disk, can be mitigated using bloom filter - advantages: sequential -> reduce write amplification, compaction makes storage smaller

  • page-oritented storage engines (e.g., B-tree)

    • balanced trees
  • Transcation processing / analytics

    • OLTP (online transcation processing)
    • OLAP (online analytic processing)
  • Datawarehousing

    • ETL (extract, transform, load)
    • Star and snowflake: schemas for analytics
  • Column-oriented storage

    • faster to query

Ch4, Encoding and Evolution

  • backward: newer code can read data written by older code
  • forward: older code can read data written by newer code

Part II, distributed data


Replication

Goal

  • reduce latency (close to the user)
  • increase availability (some cluster may fail)
  • increase read throughput