SerCeMan / intensivedata.txt

Save SerCeMan/e0c1232ab5d9219bb86d5571b7ef53e8 to your computer and use it in GitHub Desktop.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

# Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
1. **Chapter 1. Reliable, Scalable and Maintainable Applications**
1. R faults != failures, faults cause failures. Systems should be fault-tolerant, resistant to some types of faults
2. S Amazon cares about 99.9% percentile because people with higher latencies usually are people who have the most data and therefore, they’re most valuable customers
3. S tail latency amplification - multiple requests one critical path during one page served
2. **Chapter 2. Data Models and Query Languages**
1. Hierarchical model - imperative querying, no way to change schema, children are ordered, no many-to-many
2. CODASYL (network) vs SQL
3. NoSQL - often no schema (precisely - schema on read vs schema on write)
1. Document databases - docs are self-contained, map reduce
2. Graph databases - vertices and unordered set of edges. Propety Graphs (Cypher) or Tripples (SPARQL, RDF model), +Datalog.
3. **Chapter 3. Storage and Retrieval**
1. Storage engines - log-structured (LSM Trees) vs page-oriented (B-tree).
2. SSTables - String Sorted Tables. (LevelDB, RocksDB, similar - CS and HBase)
1. Grow a sorted collection (rbtree or avl tree), once it’s big dump it onto disk
2. Sometimes do compaction, it’s a simple mergesort
3. sparse index for search
4. write a separate append only log - for recovery
3. B-Trees
1. in-place updates/copy on write - interesting
2. Don’t suffer from compaction
4. Indexes:
1. clustered vs secondary index
2. R-tree - multidimentional
3. fuzzy search - Lucene
4. all in-memory, but distributed - VoltDB, MemSQL, Redis/Couchbase - weak durability
5. OLTP (online translation processing - classic) vs OLAP (online analytics processing)
1. OLTP - rows vs OLAP - columnar is better
2. Data Warehouses
1. Extract Transform Load (ETL)
2. OSS - Apache Hive, Spark SQL, Cloudera Impala, FB Presto, Apache Drill, based on Google Dremel:
3. columnar storage, column compression!, 1 column is sorted
4. materialised views ~= multidimentional data cubes
6. TODO:
1. https://blog.acolyer.org/2015/01/26/dremel-interactive-analysis-of-web-scale-datasets/
2. https://blog.acolyer.org/2014/11/26/the-log-structured-merge-tree-lsm-tree/
3. [Bigtable: A Distributed Storage System for Structured Data](https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf)
4. **Chapter 4. Encoding and Evolution**
1. Backward compatibility vs Forward compatibility, rolling upgrades, evolvability
2. embedded into programming language are not crossplatform
3. text JSON, XML, CSV vs binary (protobuf, thrift, avro)
1. protobuf, thrift - tags + single schema, avro - no tags + writer schema and reader schema.
1. avro - suitable for dynamic schema generation
4. Dataflows: db, service calls, async messages
1. Has to be careful with passing unrecognised fields through and not deleting them
2. DB: data outlives code
3. REST vs RPC (sometimes through SOAP - nooo), RPC vs local calls
4. Message passing dataflow - buffer, redelivering, avoids service discovery, one message several recipients, decoupling
1. actors - broker + actor programming
5. **Chapter 5. Replication**
1. geo-closeness, availability++, throughput++
2. Single leader (master-slave):
1. + simple,
2. - requires fail-over
3. Sync vs Async replication
1. sync - up-to date, but bigger chance of failure, fully sync is impractical
2. async - a number of nodes can get writes asynchronously (semi synchronous - at least 2 nodes are synchronous)
4. Failover:
1. determine that the leader has failed - e.g. no response in 30 sec
2. choose new leader - election (usually most up-to-date)
3. reconfiguring system to use the new leader, old leader might still think that he is the leader (split brain), might violate client’s reliability expectations
5. Replication:
1. statement based - compact, but might be nondeterministic (mysql < 5.1)
2. WAL - low lever writes to the engine (not portable)
3. Logical (row-based) - heavier, but portable (mysql binlog)
4. trigger-based - application layer, slow but flexibel
6. Replication lag
1. read-after-write
2. monotonic reads (no back in time when read from different followers)
3. consistent prefix reads - if a sequence of bytes written in a certain oder, it will always be read in the same order
3. Multi-Leader - e.g. across multiple datacenter
1. Conflict resolving
1. conflict detection is async (or else the same as single leader)
2. either avoid conflicts or converge:
1. LWW - data loss
2. higher ID - data loss
3. merging- CRDT (2-way merge), OT, mergeable persistent data structures (3-way merge)
4. record conflict and ask user
3. on read - when next read happens vs on write - as soon as db detects
4. topologies - circular, start, all-to-all
4. Leaderless
1. Catching-up:
1. read repair - when client reads in parallel from several replicas
2. anti-entropy - constantly look for differences
2. Quorums
1. r+w>n
2. staleness can be monitored, quantify eventually
3. sloppy quorum - use neighbour nodes
3. Happens before
1. meging
2. version vectors
5. TODO:
1. https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/
2. [A Conflict-Free Replicated JSON Datatype](https://arxiv.org/pdf/1608.03960.pdf)
6. **Chapter 6. Partitioning**
1. Partitioning algorithms:
1. By key range, good for range scans, but tend to be hot spots
2. By hash of key, no efficient range scans, still can be hotspot if (celebrity with 1M followers)
2. Secondary indexes - don’t map neatly to partitions
1. By document - local indexes + scatter/gather
2. By term - global index - often async and needs to be partitioned itself
3. Rebalancing
1. Mod - how **not to** - too many movements = very inefficient
2. Fixed number of partitions - new node steals a few partitions from other nodes
3. Dynamic partitioning - like b-trees, adapts to the data volume, might need pre-split
4. Proportionally to nodes - when new node joins, it randomly choses a fixed number of partitions to split - unfair, but good on average
4. Automatic vs Manual - operational tradeoff
5. Request routing
1. clients do round-robin, nodes redirect
2. routing tier, often with ZK-like subsystem
3. client knows where to send a request
6. TODO:
1. [A Fast, Minimal Memory, Consistent Hash Algorithm](https://arxiv.org/pdf/1406.2294.pdf)
7. **Chapter 7. Transactions**
1. ACID
1. Atomicity - (abortability) changes visible together
2. Consistency - **overloaded**, consistent data invariants, doesn’t really belong to database, it’s rather app property
3. Isolation - serialisability, like it’s only one at a time
4. Durability - if DB said yes, it’s safely written to disk, but:
1. if disk dies, data might be lost, replication solves the problem but all dbs might be knocked out due to bad input
2. in async replication data might be lost
3. SSD might violate expectations, fsync isn’t guaranteed to work correctly
4. 30-80% SSDs develop at least 1 bad block during first year
5. If SSD disconnected from power, it might lose data
2. Isolation levels
1. Read committed - often default
1. No dirty reads - only see what’s been committed - solved by storing both values old&new + transaction id
2. No dirty writes - only overwrite already committed data - solved by row level locks
2. Repeatable read
1. Non-repeatable read/read skew - solved by snapshot isolation, implemented via MVCC. Indexes might be impl via persistent data structures or copy-on-write
2. Lost updates. Read-Modify-Write, overwriting without taking changes into account. Some repeatable read implement this, some not
3. Serialisable (oracle can’t do that)
1. Write skew - transaction make decisions based on some calculations (on-call doctors)
2. Phantom reads - reads object that match search and make a decision based on that, seems to be the same as write skew
3. Implementations:
1. Execute in a serial order (transactions have to be fast)
2. 2-phase locking - standard, read-write locks
3. SSI - serialisable snapshot isolation, new, optimistic approach, checks before commit
3. TODO:
1. https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/
2. [Unix’s File Durability Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem)
3. https://dzone.com/articles/debunking-myths-about-voltdb
8. **Chapter 8. The Trouble with Distributed Systems**
1. HPC vs cloud computing - offline jobs vs online processing
2. Network faults - async vs synchronous networks
1. async networks have unbounded delays - but good for bursty traffic
3. Unreliable clocks
1. Monotonic vs time-of-day clocks
1. time-of-day `System.currentTimeMillis()/clock_gettime(CLOCK_REALTIME)`
1. time since epoch
2. synchronised via NTP
3. not very good resolution
4. can jump back in time
5. users can change on their devices
2. monotonic - `System.nanotime()/clock_gettime(CLOCK_MONOTONIC)`
1. for duration
2. monotonic
3. absolute value is meaningless
3. NTP might fail on many levels, jumps back and forwards are usual, but can be reduced by using GPS
1. therefore LWW shouldn’t rely on timestamps
2. google does confidence intervals
4. Pauses - gc, SIGSTOP, swapping, virtual machines, closed laptops
5. Fencing tokens - leases should be validated
6. byzantine faults - make everything complicated
7. TODO:
1. https://shipilev.net/blog/2014/nanotrusting-nanotime/ - latency, granularity, monotonicity
2. http://blog.scalyr.com/2012/10/a-systematic-look-at-ec2-io/
3. https://aphyr.com/posts/299-the-trouble-with-timestamps
4. https://martinfowler.com/articles/lmax.html
5. https://lamport.azurewebsites.net/pubs/pubs.html
9. **Chapter 9. Consistency and Consensus**
1. Eventual consistency ~= convergence, weak
2. Linearizability
1. once value is written, everybody can see it, doesn’t jump back
2. It is possible to test if a behaviour is linearizable by recording the timing of all req and res and checking whether they can be arranged into a valid seq order (but it’s compute expensive, see jepsen)
3. vs Serializability
1. serializability - isolation property, transaction were executed in some serial order
2. llinearazibility doesn’t prevent write skew
3. combination = strict serializability
4. Serialisable snapshot isolation is not linearisable
4. Implementations:
1. Single leader replication (sync)
2. Consensus alrorithm
3. Leaderless replication where a reader does read repair synchronously (anyways without CAS)
4. NOT: async leader replication, multi-leader, leaderless replication (usually)
5. CAP - classic, but it’s unhelpful because it defines only one kind of network faults
3. Ordering guarantees - helps to preserve causality
1. causal ordering - between a question and an answer, can’t be used to implement constraints
2. happens before vs concurrent
3. total order vs partial order
1. linearizability - we have total order of ops
2. causality - defines partial order
4. Lamport timestamps - consistent with causality
5. Total order broadcast - knowing when the total order is finalized
1. reliable delivery - no messages are logs
2. totally ordered delivery - to each node in the same order
3. total order broadcast = linearizability = consensus (can be implemented through each other)
4. Consensus
1. FLP result = valid only in a restrictive model with no timeouts
2. Distributed transactions
1. 2PC - two phase commit
1. XA Transactions - can be implemented
2. Coordinator can fail, so coordinator should be a distributed system by itself
3. operations nightmare
2. 3PC - only works with bounded delays in network
3. database-internal - VoltDB, MySQL NDB vs heterogeneous - XA
4. very slooooow
3. Fault tolerant consensus
1. Properties
1. Uniform agreement - no two nodes decide differently
2. Integrity - no node decide twice
3. Validity - if a node decides value v, then v was proposed by some node
4. Termination - every node that doesn’t crash eventually decides a value
4. impl: VSR, Paxos, Raft, Zab
5. TODO:
1. [Eventual Consistency Today: Limitations, Extensions, and Beyond](http://queue.acm.org/detail.cfm?id=2462076)
2. [Adventures in building your own database](https://vimeo.com/album/3660528/video/145842297)
3. https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
4. https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions
5. http://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html
10. **Chapter 10. Batch Processing**
1. unix tools - simplest batch processing (uniform interfaces), sorting + piping vs in-memory aggregation (programming languages)
2. distributed = 2 problems - partitioning & fault tolerance
3. Map Reduce
1. stdin, stdout = HDFS
2. Job execution: read into records, extract keys, sort, reduce by iteration
1. Mapper - called once for every record, extracts the key and value
2. Reducer - takes the key-value pairs, collects all values belonging to the same key and calls the reducer with an iterator over this collection
3. shuffle - partitioning and distributing by key, typically implicit
3. Reduce-side joins and grouping
1. Sort-merge joins: sort 1, sort 2, join 1 & 2
1. handling skew (hot keys): sampling (Pig) vs manual (Hive, Crunch)
4. Map-side joins - require assumptions (metadata) about the input
1. Broadcast hash joins: each mapper reads the smallest dataset instance entirely into memory to join
2. Partitioned hash joins: if the inputs are partitioned in the same way, then the hash join can be applied independently. Each mapper one small partitions from each dataset
3. Map-side merge joins: the same as ↑ but also sorted - doesn't have to fit in memory
5. VS distributed databases
1. storage - just byte sequences
2. processing models - both OLTP and OLAP
3. frequent faults - retries should be safe
4. Dataflow engines (Spark, Flink, Tez,…)
1. map-reduce functions => operators that can be piped
1. + repartition and sort by key 0 - sort-merge joins & grouping
2. + take a few inputs, partition but skip sorting, saves up for hash joins
2. pros:
1. sorting only where it’s required
2. no unnecessary map tasks, (the work might be included into preceding reducer)
3. all steps are explicitly declared = the workflow can be optimised by placing everything on the same machine
4. intermediate state can be kept in memory (crash = retry)
3. fault tolerance: jobs should be deterministic
5. Graph processing (iterative style) (OLAP way)
1. run a batch process → check condition → go 1 or stop
2. Pregel model (bulk synchronous parallel BSP): Apache Giraph, Spark GraphX, …
1. one vertex can send a message to another vertex (by ID) along the edges →
in each iteration a function called for each vertex which processes all messages →
(stateful - each vertex remembers state, similar to the actor model)
2. a lot of cross-machine communication, optimisation is ongoing research, so one machine is better if possible
6. TODO:
1. [The Elephant was a Trojan Horse: On the Death of Map-Reduce at Google](http://the-paper-trail.org/blog/the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/)
2. [Apache Kafka, Samza, and the Unix Philosophy of Distributed Data](https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/)
3. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing)
11. **Chapter 11. Stream Processing**
1. log is partitioned and clients use batch reading with offsets
1. messages may be expensive - JMS/AMQP broker
2. messages are fast to process - log based broker
2. disk - circular buffer, it takes 11h to fill 6TB disk on max throughput
3. change data capture - observe all changes and replicate them to other storages
4. Event sourcing
1. db = replayed log of events, real db = performance optimisation
2. Request comes (it’s command), once committed, it becomes an event (fact)
3. CQRS = separate the form in which data is read to the form in which data is written
4. Sometimes immutable data needs to be edited for audit, etc..
5. Processing streams
1. Complex event processing (CEP 1990s) - you specify rules in a declarative language, CEP looks for patterns in the stream
2. streams need to maintain materialised view - app state
3. Time: ideally, system should use event times and not local processing times, but events might be delayed (straggler events) which makes it hard to calculate windows
1. you can ignore straggler events
2. or publish corrections
4. Window types
1. Tumbling window - fixed length (03:00-03:59), take event, round down minutes
2. Hopping window - same fixed length, but allows for overlapping (01-06, 02-07, etc…)
3. Sliding window - all the events that occur within some interval of each other. Can be implemented buy keeping queue and removing old ones
4. Session window - no fixed duration, all events of the same user\
5. Stream joins
1. Stream-stream join (window join)
1. state is needed
2. Stream-table join
1. local copy of DB, but the DB needs to be updated
3. Table-table join
1. maintain materialized view
6. Fault tolerance
1. batch processing - exactly once
2. in streams, similar can be with “microbatching” and checkpointing (state to disk)
3. atomic commit is possible (similar to XA)
4. events may be idempotent instead
6. TODO
1. [The Dataflow Model: A Practical Approach to Balancing. Correctness, Latency, and Cost in Massive-Scale](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)
2. [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/)
3. http://tech.labs.oliverwyman.com/blog/2016/05/05/pushing-back/
4. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
5. https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda
6. https://www.oreilly.com/ideas/questioning-the-lambda-architecture
12. **Chapter 12. The Future Of Data Systems**
1. Unbundling databases
1. index = materialised view
2. piping data
3. reactions, subscribe to changes, pushing them to clients
4. shifting boundaries between the write path and the read path (index)
5. reads are events too (helpful to determine causality)
2. Correctness
1. op ids for safe retries
2. end to end fault-tolerance would be nice to have
3. Timeliness vs integrity
1. timeliness = eventual consistency, can be violated in many apps
2. integrity = general integrity, can’t be violated most of the time
4. Verification, e.g. Merkle trees
3. Culture
1. Data might be “biased”
2. Be careful with privacy
4. TODO
1. https://www.voltdb.com/blog/2016/03/23/winning-now-future-voltdb-shines/
2. https://blog.twitter.com/engineering/en_us/a/2016/strong-consistency-in-manhattan.html