Notes on Google File System.
Paper, Presentation to FTF group
Overview.
Google File System was designed in 2004.
It follows the Jeff Dean philosophy of - split into one unit of load balancing and centrally allocate at a master to workers.
The file system index is stored at a master server, in-memory.
Each file is split into fixed-size chunks, and distributed to chunkservers.
The system scales horizontally by nodes called chunkservers, which are assigned to store chunks by a master.
Each file has configurable replication. By default, the replication factor is r=3, meaning each chunk is replicated on 3 different chunkservers.
The master handles file metadata creation, chunk creation, chunk allocation, and chunk reallocation when chunkservers die.
The consistency model of GFS is a bit complex, and not worth covering, just read the paper.
In GFS v2 (Colossus), replication is done using Reed-Solomun codes, as in Amazon S3. This is referred to as striping.
Design.
Modelling a file system.
- The Google File System is a distributed file system.
- What is a file system?
- Content
- Files
- Directories
- Paths
- FAT32 File system:
- Fixed-size allo
- Google models the file system as such:
- The directory structure is stored as a prefix trie.
- The root node is the root directory - “/”.
- The intermediate nodes are the directories - “/usr”, “/home”, “/home/jim”
- And the leaf nodes are files.
- “the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.“
Scaling a file system - chunkservers.
- GFS scales file storage by:
- Splitting each file into fixed-size chunks and storing these across a number of chunkservers.
- Master server allocates chunks, in the same way a memory allocator allocates chunks.
- The master server tracks the chunkservers and the assignment of chunks to chunkservers.
- GFS enables files to be mutable:
- The master tracks the chunkservers.
- For each chunk, GFS designates one chunkserver as the primary.
- The primary is given a short-lived lease (60s). If a server fails to respond, the lease is let go and another chunkserver is given it.
- The primary serialises writes to the chunk.
- Whenever the master grants a new lease on a chunk, it increments the chunk version number.
- How do clients read files?
- The GFS client library is the key.
- Ask the master server for the file and byte range.
- Master server returns a list of chunks and chunk servers.
- What are the operations of GFS?
- Read - standard.
- Write -
- Data is pushed to replicas.
- Stored in LRU cache.
- Once all replicas acknowledge receiving data, client sends write to primary.
- Primary serialises writes (determines order).
- Primary sends write request to all replicas.
- Replicas acknowledge, and then primary applies write.
Consistency model.
- What is the GFS consistency model?
- A file region is consistent if all clients will always see the same data, regardless of which replicas they read from
- A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.
- After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation
- Scenarios:
- Non-concurrent. When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written.
- Concurrent. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations
- Failure. A failed mutation makes the region in- consistent (hence also undefined): different clients may see different data at different times.
- After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation.
- Failure modes:
- GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas (Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mu- tations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations
Mutation model.
- What is the GFS mutation model?
- Two classes of mutation: writes and. appends:
- A write causes data to be written at an application-defined offset.
- An append causes data to be appended atomically at least once, but at an offset of GFS’ choosing.
- Leases.
- We use leases to maintain a consistent mutation order across replicas.
- The master grants a chunk lease to one of the replicas, which we call the primary.
- The primary picks a serial order for all mutations to the chunk.
- All replicas follow this order when applying mutations.
- Thus, the global mutation order is defined:
- by the lease grant order chosen by the master, and
- within a lease by the serial numbers assigned by the primary.
- Two classes of mutation: writes and. appends:
Master server roles.
- What are the roles of the master?
- Chunkserver tracking.
- Push rather than pull model. Chunkservers report status via Heartbeats and report chunks hosted, rather than the other way around.
- File metadata management.
- Prefix trie of file paths to chunks.
- Permissions. (stored in Chubby)
- Chunk creation.
- Chunk re-replication.
- Chunk re-balancing.
- Chunk garbage collection.
- Replica placement.
- Distribute across racks not just machines.
- Chunkserver tracking.
Software architecture.
- GFS architecture is split into:
- Client library.
- GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
- Caching.
- Master server.
- Acquires its role by locking a Chubby file. Chubby is a strongly-consistent lock file service.
- Chubby.
- Chunkservers.
- Client library.
Media.
Diagrams.
Readings.
https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
http://sghose.me/talks/storage systems/2015/11/23/GFS-Talk/
https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/papers/colossus.md