Wednesday, July 24, 2013

Technical overview of OneFS continued

Locks and Concurrency in OneFS is implemented by a lock manager that marshals locks across all nodes in a storage cluster. Multiple and different kinds of locks referred to as lock "personalities" can be acquired. File System locks as well as cluster coherent protocol-level locks such as SMB share mode locks or NFS advisory-mode locks are supported. Even delegated locks such as CIFS oplocks and NFSv4 delegations are supported.
Every node in a cluster is a coodinator for locking resources and a coordinator is assigned to lockable resources based upon an advanced hashing algorithm. Usually the co-ordinator is different from the initiator. When a lock is requested such as a shared lock for reads or an exclusive lock for writes, the call sequence proceeds something like this:
1) Let's say Node 1 is the initiator for a write, and Node 2 is designated the co-ordinator. Node 3 and Node 4 are shared readers.
2) The readers request a read lock from the co-ordinator at the same time.
3) Coordinator checks if an exclusive lock is granted for the file.
4) if no exclusive locks exist, then the co-ordinator grants shared locks to the readers.
5) The readers begin their read opeations on the requested file
6) An exclusive lock for the same file that is being read by the readers is now requested by the writer.
7) The co-ordinator checks if the locks can be reclaimed
8) The writer is made to block/wait while the readers are reading.
9) The exclusive lock is granted by the coordinator and the writer begins writing to the file.
When the files are large and the number of nodes is large, high throughput and low latency become important. In such cases multi-writer support is made available by dividing the file into separate regions and providing granular locks for each region.
Failures are tolerated such as in the case of power loss. A journal is maintained to record changes to the file system and this enables fast, consistent recovery from a power loss or other outage. No file scan or disk check is required with a journal. The journal is maintained on a battery backed NVRAM card. When the node boots up, it checks its journal and replays the transactions. If the NVRAM is lost or the transactions are not recorded, the node will not mount the file system.
In order for the cluster to function, a quorum of nodes must be active and responding. This can be a simple majority where one more than half the nodes are functioning. A node that is not part of the quorum is in a read only state. The simple majority helps avoid split-brain conditions when the cluster temporarily splits into two. The quorum also dictates the number of nodes required in order to move to a given data protection level. For an N+M protection level, 2*M+1 nodes must be in quorum.
The global cluster state is available via a group management protocol that  guarantees a consistent view across the entire cluster of the state of the other nodes. When one or mode nodes become unreachable, the group is split and all nodes resolve to a new consistent view of their cluster. In the split state, the file system is reachable and for the group with the quorum, it is modifiable. The node that is down is rebuilt using the redundancy stored in the cluster. If the node becomes reachable again, a "merge" occurs where the two groups are brought back into one. The nodes can rejoin the cluster without being rebuilt and reconfigured. If the protection group changes during the merge, files may be restriped for rebalance. When a cluster splits some blocks may get orphaned because they are re-allocated on the quorum side. Such blocks are collected through a parallelized mark and sweep scan.

No comments:

Post a Comment