Wednesday, July 24, 2013

Technical overview OneFS continued

In a file write operation on a three node cluster, each node participates with a two layer stack - the initiator and the participant. The node that the client connects to acts as the Captain. In an Isilon cluster, data, parity, metadata and inodes are all distributed on multiple nodes.Reed Solomon erasure encoding is used to protect data and this is generally more efficient (upto 80%) on raw disk with five nodes or more. The stripe width of any given file is determined based on the number of nodes in the cluster, the size of the file, and the protection setting such as N+2.
OneFS uses InfiniBand  back-end network for fast network access. Data is written in atomic units called protection groups. If every protection group is safe, the entire file is safe. For files, protected by erasure codes, a protection group consists of a series of data blocks as well as a set of erasure codes Types of protection groups can be switched dynamically, temporarily relying on mirroring when node failures prevent the desired number of erasure codes from being used and then reverting to protection groups without admin intervention.
The OneFS file system block size is 8 KB and a billion such small files can be written at high performance because the on disk structures can scale to that size. For large files, mutliple contiguous 8KB blocks upto sixteen can be striped into a single node's disk. For even larger files, OneFS can maximize sequential performance by writing in stripe units of sixteen contiguous blocks.An AutoBalance service reallocates and rebalances data to make storage space more efficient and usable.
The captain node uses a two phase commit transaction to safely distribute writes to multiple NVRAMS across clusters.  The mechanism relies on NVRAMs for journaling all transactions on every node in the cluster.Using NVRAMs  in parallel allows high throughput writes and safety against failures. When a node returns from failure, the only actions required is to replay its journal from the NVRAM and occassionally from AutoBalance to rebalance files that were involved in the transaction. No resynchronization event ever needs to take place.  Access patterns can be chosen from the following at a per file or per directory level. Concurrency - optimizes for current load on the cluster, featuring many simultaneous clients.
Streaming - optimizes for high speed streaming of a single file, for example to enable very fast reading within a single client.
Random - optimizes for unpredictable access to the file, by adjusting striping and disabling the use of any pre-fetch cache.
Caching is used to accelerate the process of writing data into an Isilon cluster.  Data is written into the NVRAM based cache of the initiator node before writing to disk, then batched up and flushed to disks later at a more convenient time. In the event of a cluster split or unexpected node outage, uncommitted cached writes are fully protected.
Caching operates as follows :
1) an NFS client sends node 1 a write request for file with N+2 protection
2) Node 1 accepts the writes into its NVRAM write cache fast path and then mirrors the writes to participants nodes' logfile
3) Write acknowledgements are returned to the NFS clients immediately and as such write to disk latency is avoided
4) As node 1's write cache fills, it is flushed via two phase commit protocol  and appropriate parity protection
5) the write cache and participant node files are cleared and available to accept new writes.

No comments:

Post a Comment