Cluster computing

Saturday, July 27, 2013

API Explorer for Starbucks and a Graph API implementation

Starbucks API are OAuth enabled. This means that they don't just grant access based on api keys but require an access token that is provided by an OAuth Provider. Starbucks APIs are available from Mashery that provides a redirect to Starbucks Authorization endpoint and this is where API users get their access token. OAuth enables one of four different workflows to get access tokens.
Implicit Grant - such as when a mobile application tries to get an access token from the authorization endpoint based on client id and user id.
Authorization Code grant - such as when a user login to an IIS hosted site and the user's browser is redirected to the Starbucks' authorization endpoint to get a one time short lived authorization code. The client can then exchange the code for an access token.
Credentials Grant - such as when a user provides his or her username and password for a token.
Client Credentials Grant - such as when an application from a secured kiosk or site provides context regardless of the user.
In building an explorer for Starbucks API, we will need to get an access token to make the API calls. Since this application that we call the API explorer enables API users to try out the different APIs based on input parameters and responses, we will choose either the client credentials grant or the implicit grant to retrieve an access token at push button demand. Both XML and JSON responses can be displayed in the text area panel of the API explorer. This is conceived to be very similar to the Graph API Explorer from Facebook.

Another application of Starbucks API could be a deeper integration with the FaceBook's location data. For example Starbucks customers would like to know which of their friends from FaceBook frequented the same Starbucks store the same day as the one they are at currently. Starbucks mobile application today maintains card history and rewards on their application. If they could push FaceBook location updates on purchases that they track with their mobile application at the store that they visit, then Facebook friends could see where each other have been on a given day. This could encourage more sales at the Starbucks store as friends try to catch up with each other and at the very least provides useful knowledge to the Starbucks coffee customer of who else has been doing the same at this store. Additionally Starbucks mobile application need not take the user to their Facebook page to view or post this data, but offer a tip or balloon notification of which of the application user's friends had been at this store and when, if any. Such tips are non-invasive, information only and enables the coffee experience to be an avenue for social networking. Interested users could be taken to a map that displays not just the stores but the Facebook friends that have visited that store in the past day, week or month.

Localizatioon and globalization testing of websites

usually referred to by their notations L10N and I18N, locale specific website rendering is a significant test concern both in terms of resources required for the testing and the time it consumes. The primary considerations for this testing is the linguistic, cosmetic or basic functionality issue in displaying information in the culture specific manner. Some languages such as German require around 30% more space while Chinese for instance requires around 30% less. Morever, right to left languages such as Arabic and Hebrew require alignments, proper indentations and layout. Since UI resources for a website are typically collected and stored in resx files their collation and translation is made easy with tools such as resgen.exe. However the content alone does not guarantee their appropriateness to the website rendering, hence additional testing is required. As with any variation of the website, a full test pass using functionality test and load test is incurred. These sites also require significant environment resources to be allocated, including culture specific domain name registrations and associated servers. Each such resource requires setup, maintenance and constant tracking in various measurement and reporting systems. Such tasks increase the matrix of the web testing. Fundamentally, these testings are rigorous, end to end and repeated for each locale. What would be desirable is to unify the testing for the common content and factor out the testing specific to the locale. By unifying the tests upstream for much of the content and their display, there are significant savings made in the test cost. Consider the steps involved in the culture specific testing today as depicted below. Each of them is a full iteration of a common content with repeated functionality and load tests even though the locale specific testing is focused on linguistic translation and cosmetics.
test-en-us : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
test-uk-en : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
test-de-de : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
If there were a solution that enables a common test bed for much of the redundancies such as below
-> linguistic and translation tests
test-neutral: setup->deployment->functionality testing->load testing -> layout, width, height, indentation checks from static resource checking
-> variations of locale for repeating the above.
This way, the redundancies are removed, testing is more streamlined and focused on explicit culture specific tasks.
Moreover, in the earlier model, test failures with one locale environment could be different from other local environment on a case by case basis. By unifying the resources and the operations, much of this triage and variations can be avoided. The blogposts on Pseudoizer can be very helpful here.

Friday, July 26, 2013

Technical overview OneFS continued

Software upgrade of the Isilon cluster is done in one of two methods:
Simultaneous upgrade - This method installs the software updates and reboots the nodes all at the same time. This does cause a temporary interruption of service in serving data to clients but it is typically kept under two minutes. The benefits are that system wide changes can be made without any data operations. This enables us to make changes without impacting the customer and can be considered safer even though the service is interrupted albeit temporarily.
Rolling upgrade - This method upgrades and restarts into each node in the cluster sequentially. The cluster remains online and there is no disruption of service to the customer. This is ideal for minor revisions but for major revisions of say OneFS code, it may be better to perform a simultaneous upgrade so that version incompatibilities are avoided.
The same holds true for an upgrade. Additionally, a pre-verification script is run to ensure that only supported configuration is permitted to upgrade. If the checks fail, instructions on troubleshooting the issues are typically provided. Upgrades can be invoked by the administrative interfaces mentioned earlier such as the CLI or the web admin UI. After the upgrade completes, the cluster is verified with a heatlh status check.
Among the various services for data protection and management in the OneFS, some are listed below:
InsightIQ : This is a performance management service. It maximizes the performance of your Isilon scale out storage system with innovative performance monitoring and reporting tools. A backend job called the FSAnalyze is used to gather the file system analytics data and used in conjunction with InsightIQ.
SmartPools is a resource management service which implements a highly efficient automated tiered storage strategy. It keeps the single file system tree in tact while performing the tiering of aged data. Recall that the SmartPool subdivides the large set of homogeneous nodes into smaller Mean Time to Data Loss (MTTDL)- friendly disk pools. By subdividing a node's disks into multiple, separately protected pools, nodes are also significantly more resilient to multiple disk failures.
SmartQuotas: is a data management service. This assigns and manages quota that seamlessly partition the storage into easily managed segments at the cluster, directory and sub-directory levels.
SmartConnect: is a data access service that enables client connection, load balancing and dynamic NFS failover and fallback of client connections. Connections target different nodes to optimize the use of cluster resources.
SnapShot IQ is a data protection service that takes near instantaneous snapshots while incurring little or no performance overhead. Recovery is equally fast with near-immediate on demand snap shot. Snapshot revert and delete are separate services.
Cloud management such as Isilon for VCenter is a software service that manages Isilon functions from VCenter. VCenter also comes with its own automatable framework.
SyncIQ is a data replication service that replicates and distributes large, mission critical data sets, asynchronously to one or more alternate clusters. Replication can be targeted to a wide variety of sites and devices and this helps disaster recovery. The replication has a simple push-button operation.
SmartLock is a data retention service that protects critical data against accidental premature or malicious alteration or deletion. It is also security standards compliant.
Aspera for Isilon is a content delivery service that provides high performance wide area file and content delivery.

Thursday, July 25, 2013

Technical overview OneFS continued

OneFS is designed to scale out as opposed to some storage systems that scale up. We can seamlessly increase the existing file system or volumes by adding more nodes to the cluster. This is done in three easy steps by the administrator:
1) adding another node into the rack
2) attaching the node to the Infiniband network
3) instructing the cluster to add the additional node
The data in the cluster is moved across to the new node by autobalance feature in an automatic coherent manner such that the new node will not be a hot spot and existing data gets benefit with the additional performance capabilities. This works in a transparent manner so that storage can grow from TB to PB without any administration overhead.
The storage system is designed to work with all kinds of workflows - sequential, concurrent or random. OneFS provides for all these workflows because throughput and IOPS scale linearly with the number of nodes present in the system. Balancing plays a large role in keeping the performance linear with capacity. Each node is treated the same as they are added and it's a homogeneous cluster. Since each of the nodes have a balanced data distribution and there is automatic rebalancing and distributed processing, each additional CPU, network ports and memory is utilized as the system scales.
Administrators have a variety of interfaces to configure the OneFS.
The Web administration User Interface ("WebUI")
The command line interface that operates via SSH interfaces or RS232 serial connection
The LCD panel on the nodes themselves for simple add/remove functions.
RESTful platform API for programmatic control of cluster configuration and management.
Files are secured by a variety of techniques :
Active Directory (AD)
LDAP Lightweight Directory Access Protocol
Network Information Service
Local users and groups.
Active Directory which is a directory service for the network resources is integrated with the cluster by joining the cluster to the domain. The nodes of the cluster are now reachable via the DNS and the users can be authenticated based on their membership to Active Directory.
LDAP provides a protocol to reach out to other directory services provider. So many more platforms can be targeted.
NIS is another protocol that is referred to as the yellow pages and provides a way to secure the users
And finally the local users and groups of a node can be used to grant permission to that node.
Cluster access is partitioned into access zones. Access Zones are logical divisions comprising of
cluster network configuration
file protocol access
authentication
Zones are associated with a set of SMB/CIFS shares and one or more authentication providers for access control.

Technical overview of OneFS continued

OneFS manages protection of its data directly by allocating data during normal operations and rebuilding data after recovery. It does not rely on hardware RAID levels. OneFS determines which files are affected by a failure in constant time. Files are repaired in parallel. As the cluster size increases, their resiliency increases.
Systems that use a "hot spare" drive, use it to replace a failed drive. OneFS avoids the use of hot spare drives and instead uses available free space to recover from failure. This is referred to as virtual hot spare and guarantees that the system can self-heal.
Data protection is applied at the file level and not the system level, enabling the system to focus on only those files affected by failure. Reed Solomon error codes are used for data but metadata and inodes are protected by mirroring only.
Further the data protection is configurable and can be appliedy dynamically and online. For a file protection involving N data blocks protected by M error code blocks and b file stripes, the protection level is (N + M ) / b, When b = 1, M members can fail simultaneously and still provide 100% availability.As opposed to the double failure protection of RAID-6, this system can provide upto quadruple failure protection.
OneFS also does automatic partitioning of nodes to improve Mean Time to Data Loss (MTTDL). If a 80 node cluster at +4 protection level is partitioned into four twenty node pools at +2, then the protection overhead is reduced, space is better utilized and there is no net addition to management overhead.
Automatic provisioning subdivides the nodes into pools of twenty nodes each and six drives per node. Furthermore, the node's disks are now subdivided into multiple, separately protected poolsand they are significantly more resilient to multiple disk failures than previously possible.
Supported protocols for client access to create, modify and read data include the following:
NFS Network file system used for unix/linux based computers
SMB/CIFS (server message block and Common Internet File System)
FTP : File Transfer Protocol
HTTP : Hypertext Transfer Protocol
iSCSI : Internet Small Computer System Interface
HDFS: Hadoop distributed file system
REST API : Representational state transfer Application Programming Interface.
By default only the SMB/CIFS and NFS are enabled. The root system for all file data is ifs Isilon OneFS file system. The SMB/CIFS protocol has an ifs share and the NFS has an /ifs export.
Changes made by one protocol is visible to all others because the file data is common.

Wednesday, July 24, 2013

Technical overview of OneFS continued

Locks and Concurrency in OneFS is implemented by a lock manager that marshals locks across all nodes in a storage cluster. Multiple and different kinds of locks referred to as lock "personalities" can be acquired. File System locks as well as cluster coherent protocol-level locks such as SMB share mode locks or NFS advisory-mode locks are supported. Even delegated locks such as CIFS oplocks and NFSv4 delegations are supported.
Every node in a cluster is a coodinator for locking resources and a coordinator is assigned to lockable resources based upon an advanced hashing algorithm. Usually the co-ordinator is different from the initiator. When a lock is requested such as a shared lock for reads or an exclusive lock for writes, the call sequence proceeds something like this:
1) Let's say Node 1 is the initiator for a write, and Node 2 is designated the co-ordinator. Node 3 and Node 4 are shared readers.
2) The readers request a read lock from the co-ordinator at the same time.
3) Coordinator checks if an exclusive lock is granted for the file.
4) if no exclusive locks exist, then the co-ordinator grants shared locks to the readers.
5) The readers begin their read opeations on the requested file
6) An exclusive lock for the same file that is being read by the readers is now requested by the writer.
7) The co-ordinator checks if the locks can be reclaimed
8) The writer is made to block/wait while the readers are reading.
9) The exclusive lock is granted by the coordinator and the writer begins writing to the file.
When the files are large and the number of nodes is large, high throughput and low latency become important. In such cases multi-writer support is made available by dividing the file into separate regions and providing granular locks for each region.
Failures are tolerated such as in the case of power loss. A journal is maintained to record changes to the file system and this enables fast, consistent recovery from a power loss or other outage. No file scan or disk check is required with a journal. The journal is maintained on a battery backed NVRAM card. When the node boots up, it checks its journal and replays the transactions. If the NVRAM is lost or the transactions are not recorded, the node will not mount the file system.
In order for the cluster to function, a quorum of nodes must be active and responding. This can be a simple majority where one more than half the nodes are functioning. A node that is not part of the quorum is in a read only state. The simple majority helps avoid split-brain conditions when the cluster temporarily splits into two. The quorum also dictates the number of nodes required in order to move to a given data protection level. For an N+M protection level, 2*M+1 nodes must be in quorum.
The global cluster state is available via a group management protocol that guarantees a consistent view across the entire cluster of the state of the other nodes. When one or mode nodes become unreachable, the group is split and all nodes resolve to a new consistent view of their cluster. In the split state, the file system is reachable and for the group with the quorum, it is modifiable. The node that is down is rebuilt using the redundancy stored in the cluster. If the node becomes reachable again, a "merge" occurs where the two groups are brought back into one. The nodes can rejoin the cluster without being rebuilt and reconfigured. If the protection group changes during the merge, files may be restriped for rebalance. When a cluster splits some blocks may get orphaned because they are re-allocated on the quorum side. Such blocks are collected through a parallelized mark and sweep scan.

Technical overview OneFS continued

In a file write operation on a three node cluster, each node participates with a two layer stack - the initiator and the participant. The node that the client connects to acts as the Captain. In an Isilon cluster, data, parity, metadata and inodes are all distributed on multiple nodes.Reed Solomon erasure encoding is used to protect data and this is generally more efficient (upto 80%) on raw disk with five nodes or more. The stripe width of any given file is determined based on the number of nodes in the cluster, the size of the file, and the protection setting such as N+2.
OneFS uses InfiniBand back-end network for fast network access. Data is written in atomic units called protection groups. If every protection group is safe, the entire file is safe. For files, protected by erasure codes, a protection group consists of a series of data blocks as well as a set of erasure codes Types of protection groups can be switched dynamically, temporarily relying on mirroring when node failures prevent the desired number of erasure codes from being used and then reverting to protection groups without admin intervention.
The OneFS file system block size is 8 KB and a billion such small files can be written at high performance because the on disk structures can scale to that size. For large files, mutliple contiguous 8KB blocks upto sixteen can be striped into a single node's disk. For even larger files, OneFS can maximize sequential performance by writing in stripe units of sixteen contiguous blocks.An AutoBalance service reallocates and rebalances data to make storage space more efficient and usable.
The captain node uses a two phase commit transaction to safely distribute writes to multiple NVRAMS across clusters. The mechanism relies on NVRAMs for journaling all transactions on every node in the cluster.Using NVRAMs in parallel allows high throughput writes and safety against failures. When a node returns from failure, the only actions required is to replay its journal from the NVRAM and occassionally from AutoBalance to rebalance files that were involved in the transaction. No resynchronization event ever needs to take place. Access patterns can be chosen from the following at a per file or per directory level. Concurrency - optimizes for current load on the cluster, featuring many simultaneous clients.
Streaming - optimizes for high speed streaming of a single file, for example to enable very fast reading within a single client.
Random - optimizes for unpredictable access to the file, by adjusting striping and disabling the use of any pre-fetch cache.
Caching is used to accelerate the process of writing data into an Isilon cluster. Data is written into the NVRAM based cache of the initiator node before writing to disk, then batched up and flushed to disks later at a more convenient time. In the event of a cluster split or unexpected node outage, uncommitted cached writes are fully protected.
Caching operates as follows :
1) an NFS client sends node 1 a write request for file with N+2 protection
2) Node 1 accepts the writes into its NVRAM write cache fast path and then mirrors the writes to participants nodes' logfile
3) Write acknowledgements are returned to the NFS clients immediately and as such write to disk latency is avoided
4) As node 1's write cache fills, it is flushed via two phase commit protocol and appropriate parity protection
5) the write cache and participant node files are cleared and available to accept new writes.