Notes on Google Bigtable.
Overview.
- Google Bigtable - a horizontally-scalable multidimensional sorted map.
- API allows users to write rows with a unique key, and a set of columns. Tables are sorted by key and queryable with the basic operators (e.g. equality, lt, gt).
- The key feature of Bigtable is its horizontal scalability.
- Let’s use the example of an English dictionary for our database table. The keys are the words, the values are the definitions.
- When we add more words, the dictionary might become too large for a single node to host. So we split the table into two halves - these are called tablets.
- A tablet covers a key range. So for example, entries A-L are stored in tablet #1, and entries M-Z are stored in tablet #2. This covers the entire keyspace of A-Z.
- Tablets have a fixed capacity, and get split when they become too large. For example, if we added 10K new words beginning with A to tablet #1, it might become over capacity and get split. Splitting would produce new tablets, each with smaller key ranges - e.g. tablet #1 of [Aa, Ae], tablet #2 of [Ae, L], and tablet #3 would remain as [L, Z].
- Now that we can partition work into equal-sized workloads, it is simple to distribute.
- Network architecture:
- The Bigtable network is organised into tablet nodes and a master node. The master node assigns tablets to tablet servers, and tablet servers are responsible for serving reads and writes.
- A tablet is only owned by one tablet server. Writes to tablets are atomic, and it is straightforward to implement mutable data this way.
- The Bigtable client libraries automatically handle routing read/writes to the correct tablet servers, based on the tablet identified as being read/written.
- Summary:
- Tables are split into tablets, the basic unit of load balancing. Tablets have a fixed capacity and cover a key range.
- The master node is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, and balancing tablet-server load (splitting/merging tablets).
Database storage.
Log structured merge tree (memtable, sstable) is used on tablet servers.
The underlying storage for sstables is GFS. Tablet servers store their data in /gfs/tablets/<id>/x.dat
. Upon tablet server failure, another tablet server can start and read from this log. GFS provides the durability of data across failures.
Consistency.
- Master role is kept on Chubby (single lock)
- Master state is kept on Chubby (strongly consistent)
- Master state is the tablet location hierarchy.
- They use something like a B+Tree so that the master looks up the initial locations through Chubby, which then point to sublocations, etc.
- Master and tablet servers are strongly consistent - only a single node.
- Failures: if master dies, chubby lock released, another node takes it.
Node roles.
- Tablet node - responsible for serving read and write requests to a specified tablet.
- Master node - responsible for:
- assigning tablets to tablet servers,
- detecting the addition and expiration of tablet servers,
- balancing tablet-server load
Master.
The controller regulates the state:
- partitioning tables into tablets, splitting/merging tablets when they get too large/small.
- assigning tablets to tablet servers.
- detecting failures in tablet servers.
Algorithm.
- handle events:
- new tablet server
- assign tablets
- tablet server liveness changes - dies
- re-assign tablets
- tablet.size ≥ 70% max capacity:
- split tablet.
- assign tablets.
- tablet.size ≤ 20% max capacity:
- merge tablet.
- assign tablets.
- create table
- new tablet server
- mechanisms
- tablet server liveness
- challenge on-chain for latest state, store it for later and slash if incorrect
- getting tablet size
- tablet servers self-report this to master when it hits upper/lower thresholds.
- tablet server liveness
Tablet server.
- tablets
- handle writes of rows (append, mutate)
- handle reads of rows
- heartbeat to master
- report load (for all tablets, each size)
- handle master instructions
- split:
- mark old tablet as read only
- create 2 new tablets
- copy rows to the new smaller tablet
- merge:
- opposite of split
- split: