Tuesday, July 23, 2013

Technical overview of OneFS file system continued

Each cluster creates a single namespace and file system.  This means that the file system is distributed across all nodes. There is no partitioning and no need for volume creation. Each cluster can be accessed by any node in the cluster. Data and metadata are striped across the nodes for redundancy and availability. The single file tree can grow with node additions without any interruption to user. Services take care of all the associated management routines such as tiering, replication etc. Files are transferred parallely without any regard for the depth and breadth of the tree.
This design is different from having different namespaces or volumes and aggregating them to make it appear as a single. There the administrator has to layout the tree, move files between volumes, purchase cost, power and cooling etc.
OneFS uses physical pointers and extents for metadata and stores files and directory metadata in inodes. B-Trees are used extensively in the file system. Data and metadata are redundant and the amount of redundancy can be configured by the administrator. Metadata only takes up about 1% of the system. Metadata access and locking are collectively managed in a peer to peer architecture.
The OneFS blocks are accessed by multiple devices simultaneously so the addressing scheme is indexed by a tuple of {node, drive, offset}.Data is protected by erasure code and is labeled by the number of failures it can tolerate simultaneously such as "N+2" that can withstand two failures when the metadata is mirrored 3x.
OneFS controls the placement of files directly down to the sector level on any drive. This allows for optimized data placement I/O patterns and avoids unnecessary read-modify-write operations. As opposed to dedicating entire RAID volumes to a particular performance type and protection setting, the file system is homogeneous and can be used to optimize spindle usage with configuration changes at any time and online.
Every node that participates in the I/O has a stack that comprises of two layers. The top half is the initiator that serves client side protocols such as  NFS, CIFS, iSCSI, HTTP, FTP etc. and the bottom layer is the participant that comprises of NVRAM. When a client connects to a node to write to a file, it connects to the top half where the files are broken into smaller logical chunks called stripes before being written. Failure safe buffering using a write coalescer is used to ensure that the writes are efficient.
OneFS stripes the data across all nodes and not simply across all disks.

No comments:

Post a Comment