‹ Notes

Notes on Google Bigtable.

Paper

Overview.

  • Google Bigtable - a horizontally-scalable multidimensional sorted map.
    • API allows users to write rows with a unique key, and a set of columns. Tables are sorted by key and queryable with the basic operators (e.g. equality, lt, gt).
  • The key feature of Bigtable is its horizontal scalability.
    • Let’s use the example of an English dictionary for our database table. The keys are the words, the values are the definitions.
    • When we add more words, the dictionary might become too large for a single node to host. So we split the table into two halves - these are called tablets.
    • A tablet covers a key range. So for example, entries A-L are stored in tablet #1, and entries M-Z are stored in tablet #2. This covers the entire keyspace of A-Z.
    • Tablets have a fixed capacity, and get split when they become too large. For example, if we added 10K new words beginning with A to tablet #1, it might become over capacity and get split. Splitting would produce new tablets, each with smaller key ranges - e.g. tablet #1 of [Aa, Ae], tablet #2 of [Ae, L], and tablet #3 would remain as [L, Z].
    • Now that we can partition work into equal-sized workloads, it is simple to distribute.
  • Network architecture:
    • The Bigtable network is organised into tablet nodes and a master node. The master node assigns tablets to tablet servers, and tablet servers are responsible for serving reads and writes.
    • A tablet is only owned by one tablet server. Writes to tablets are atomic, and it is straightforward to implement mutable data this way.
    • The Bigtable client libraries automatically handle routing read/writes to the correct tablet servers, based on the tablet identified as being read/written.
  • Summary:
    • Tables are split into tablets, the basic unit of load balancing. Tablets have a fixed capacity and cover a key range.
    • The master node is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, and balancing tablet-server load (splitting/merging tablets).

Database storage.

Log structured merge tree (memtable, sstable) is used on tablet servers.

The underlying storage for sstables is GFS. Tablet servers store their data in /gfs/tablets/<id>/x.dat. Upon tablet server failure, another tablet server can start and read from this log. GFS provides the durability of data across failures.

Consistency.

  • Master role is kept on Chubby (single lock)
  • Master state is kept on Chubby (strongly consistent)
    • Master state is the tablet location hierarchy.
    • They use something like a B+Tree so that the master looks up the initial locations through Chubby, which then point to sublocations, etc.
  • Master and tablet servers are strongly consistent - only a single node.
  • Failures: if master dies, chubby lock released, another node takes it.

Node roles.

  • Tablet node - responsible for serving read and write requests to a specified tablet.
  • Master node - responsible for:
    • assigning tablets to tablet servers,
    • detecting the addition and expiration of tablet servers,
    • balancing tablet-server load

Master.

The controller regulates the state:

  • partitioning tables into tablets, splitting/merging tablets when they get too large/small.
  • assigning tablets to tablet servers.
  • detecting failures in tablet servers.

Algorithm.

  • handle events:
    • new tablet server
      • assign tablets
    • tablet server liveness changes - dies
      • re-assign tablets
    • tablet.size ≥ 70% max capacity:
      • split tablet.
      • assign tablets.
    • tablet.size ≤ 20% max capacity:
      • merge tablet.
      • assign tablets.
    • create table
  • mechanisms
    • tablet server liveness
      • challenge on-chain for latest state, store it for later and slash if incorrect
    • getting tablet size
      • tablet servers self-report this to master when it hits upper/lower thresholds.

Tablet server.

  • tablets
    • handle writes of rows (append, mutate)
    • handle reads of rows
  • heartbeat to master
    • report load (for all tablets, each size)
  • handle master instructions
    • split:
      • mark old tablet as read only
      • create 2 new tablets
      • copy rows to the new smaller tablet
    • merge:
      • opposite of split