Cluster computing

Monday, July 29, 2013

Location services using spatial data

In applications like the NerdDinner samples posted online, the location data is often represented with latitudes and longitudes which are stored in the database as not nullable doubles or floating points. Stores or restaurants in the neighborhood of a location are found by their cartesian distance between the latitude and longitude of the location and the store / restaurant usually implemented as a database function or stored procedure. The table for the stores or restaurants with their locations are then scanned to find the distances with the input location and ordered . This is then filtered for the top handful and displayed on the maps such as with Google API. Application developers find it easy to write the controllers to use the said database objects or LINQ to SQL to display the appropriate views.
However, there are some limitations with these approach. First this doesn't scale when there are hundreds of stores or restaurants in the radius of interest. Second, it doesn't answer the often repeated queries such as the points in a polygon such as that formed by a zipcode. Queries could also more efficiently find distance between two points if the data type to store location was say GEOGRAPHY and GEOMETRY data type in SQL Server. The Geography data type stores ellipsoidal data such as GPS Latitude and Longitude and the geometry data type stores Euclidean (flat) co-ordinate system One could then have a table such as :
ZipCodes
- ZipCodeId
- Code
- StateID
- Boundary
- Center Point
Boundary could be considered the polygon formed by the zip and the Center point is the central location in this zip. Distances between stores and their membership to a zip can be calculated based on this center point. Geography data type also lets you perform clustering analytics which answer questions such as the number of stores or restaurants satisfying a certain spatial condition and/or matching certain attributes. These are implemented using R-Tree data structures which support such clustering techniques.
Spatial data types such as the Geography data type now enjoys support in Entity Framework 5 as described here and therefore is available in LINQ and explained here. .Net also supports these data types with the SQLGeography and SQLGeometry data types for easy translation and mapping to their equivalent sql server data types.
One approach is to store the co-ordinates in a separate table where the primary keys are saved as the pair of latitude and longitude and then to describe them as unique such that a pair of latitude and longitude does not repeat. Such approach is questionable because the uniqueness constraint for locations has a maintenance overhead. For example, two locations could refer to the same point and then unreferenced rows might need to be cleaned up. Locations also change ownership for example, a store A could own a location that was previously owned by store B but B never updates its location. Moreover, stores could undergo renames or conversions. Thus it may be better to keep the spatial data associated in a repeatable way along with the information about the location.
Map APIs such as Google Maps or Bing Maps lets you work with spatial data types along with their usual caveat of not to store or cache locations independently.
Referred StackOverflow and MSDN

Sunday, July 28, 2013

Let us look at an implementation for the map showing the coffee stores in your neighborhood and your friends who had visited these same stores in the past day, month or year.

The APIs to retrieve store information in your neighborhood is given by the Starbucks location API. The friends information is given by the Facebook API. The location updates by your friends are made on Facebook through automatic push notifications including a proposed Starbucks application.
Different applications make Facebook posts on their customer's wall using the permission requested from them. These show up under Location updates on the FB profiles. The same can be queried for other facebook friends for a given location. A list of such friends who have the given location on their wall is collected and this should be rendered alongwith the store mentioned.
The Starbucks API provides the following services:
1) OAuth APIs : provides access tokens based on one or more methods to acquire them, relying on
a) password b) client id and user id c) authentication code, d) client credentials. Refresh tokens are also provided.
2) Loyalty API : The loyalty API provides Licenses with a way to integrate their Siebel loyalty systems
3) Location API : Provides methods to list all stores, stores near a geographic point, stores by specific ids, stores within a region.
4) Account API: such as to create US Starbucks account and non-US Starbucks account, to validate and recover credentials, to create account profiles and to update social profile data.
5) Card API: Provides customer's card information to keep track of their purchases. This gives information such as Card URL, Card details, Card balances, and the ability to reload a card.
6) Payment methods : Enables collection of sums from the customer using a payment method such as a credit card or PayPal
7) Rewards API: provides rewards summary and history information based on application user, points needed for next level, points needed for next free reward, points needed for re-evaluation etc.
8) Barcode API: The barcode API generates Barcodes to communicate with in-store scanners and applications.
9) Content API: These are used to retrieve localized contents for any web client. Content is stored and retrieved in site core.
10) Transaction History API: These are used to retrieve the history data for a specified user, for all cards.
11) eGift API: These are used to retrieve the themes associated with eGifts.
12) Device Relationship API: These are used to insert or update a device, to get reporting actions such as touch-to-pay to display the barcode for scanning and touch-when-done to report when completing the barcode for scanning.
13) Product API : to browse food, beverage, Coffee, eGifts etc.

Saturday, July 27, 2013

API Explorer for Starbucks and a Graph API implementation

Starbucks API are OAuth enabled. This means that they don't just grant access based on api keys but require an access token that is provided by an OAuth Provider. Starbucks APIs are available from Mashery that provides a redirect to Starbucks Authorization endpoint and this is where API users get their access token. OAuth enables one of four different workflows to get access tokens.
Implicit Grant - such as when a mobile application tries to get an access token from the authorization endpoint based on client id and user id.
Authorization Code grant - such as when a user login to an IIS hosted site and the user's browser is redirected to the Starbucks' authorization endpoint to get a one time short lived authorization code. The client can then exchange the code for an access token.
Credentials Grant - such as when a user provides his or her username and password for a token.
Client Credentials Grant - such as when an application from a secured kiosk or site provides context regardless of the user.
In building an explorer for Starbucks API, we will need to get an access token to make the API calls. Since this application that we call the API explorer enables API users to try out the different APIs based on input parameters and responses, we will choose either the client credentials grant or the implicit grant to retrieve an access token at push button demand. Both XML and JSON responses can be displayed in the text area panel of the API explorer. This is conceived to be very similar to the Graph API Explorer from Facebook.

Another application of Starbucks API could be a deeper integration with the FaceBook's location data. For example Starbucks customers would like to know which of their friends from FaceBook frequented the same Starbucks store the same day as the one they are at currently. Starbucks mobile application today maintains card history and rewards on their application. If they could push FaceBook location updates on purchases that they track with their mobile application at the store that they visit, then Facebook friends could see where each other have been on a given day. This could encourage more sales at the Starbucks store as friends try to catch up with each other and at the very least provides useful knowledge to the Starbucks coffee customer of who else has been doing the same at this store. Additionally Starbucks mobile application need not take the user to their Facebook page to view or post this data, but offer a tip or balloon notification of which of the application user's friends had been at this store and when, if any. Such tips are non-invasive, information only and enables the coffee experience to be an avenue for social networking. Interested users could be taken to a map that displays not just the stores but the Facebook friends that have visited that store in the past day, week or month.

Localizatioon and globalization testing of websites

usually referred to by their notations L10N and I18N, locale specific website rendering is a significant test concern both in terms of resources required for the testing and the time it consumes. The primary considerations for this testing is the linguistic, cosmetic or basic functionality issue in displaying information in the culture specific manner. Some languages such as German require around 30% more space while Chinese for instance requires around 30% less. Morever, right to left languages such as Arabic and Hebrew require alignments, proper indentations and layout. Since UI resources for a website are typically collected and stored in resx files their collation and translation is made easy with tools such as resgen.exe. However the content alone does not guarantee their appropriateness to the website rendering, hence additional testing is required. As with any variation of the website, a full test pass using functionality test and load test is incurred. These sites also require significant environment resources to be allocated, including culture specific domain name registrations and associated servers. Each such resource requires setup, maintenance and constant tracking in various measurement and reporting systems. Such tasks increase the matrix of the web testing. Fundamentally, these testings are rigorous, end to end and repeated for each locale. What would be desirable is to unify the testing for the common content and factor out the testing specific to the locale. By unifying the tests upstream for much of the content and their display, there are significant savings made in the test cost. Consider the steps involved in the culture specific testing today as depicted below. Each of them is a full iteration of a common content with repeated functionality and load tests even though the locale specific testing is focused on linguistic translation and cosmetics.
test-en-us : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
test-uk-en : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
test-de-de : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
If there were a solution that enables a common test bed for much of the redundancies such as below
-> linguistic and translation tests
test-neutral: setup->deployment->functionality testing->load testing -> layout, width, height, indentation checks from static resource checking
-> variations of locale for repeating the above.
This way, the redundancies are removed, testing is more streamlined and focused on explicit culture specific tasks.
Moreover, in the earlier model, test failures with one locale environment could be different from other local environment on a case by case basis. By unifying the resources and the operations, much of this triage and variations can be avoided. The blogposts on Pseudoizer can be very helpful here.

Friday, July 26, 2013

Technical overview OneFS continued

Software upgrade of the Isilon cluster is done in one of two methods:
Simultaneous upgrade - This method installs the software updates and reboots the nodes all at the same time. This does cause a temporary interruption of service in serving data to clients but it is typically kept under two minutes. The benefits are that system wide changes can be made without any data operations. This enables us to make changes without impacting the customer and can be considered safer even though the service is interrupted albeit temporarily.
Rolling upgrade - This method upgrades and restarts into each node in the cluster sequentially. The cluster remains online and there is no disruption of service to the customer. This is ideal for minor revisions but for major revisions of say OneFS code, it may be better to perform a simultaneous upgrade so that version incompatibilities are avoided.
The same holds true for an upgrade. Additionally, a pre-verification script is run to ensure that only supported configuration is permitted to upgrade. If the checks fail, instructions on troubleshooting the issues are typically provided. Upgrades can be invoked by the administrative interfaces mentioned earlier such as the CLI or the web admin UI. After the upgrade completes, the cluster is verified with a heatlh status check.
Among the various services for data protection and management in the OneFS, some are listed below:
InsightIQ : This is a performance management service. It maximizes the performance of your Isilon scale out storage system with innovative performance monitoring and reporting tools. A backend job called the FSAnalyze is used to gather the file system analytics data and used in conjunction with InsightIQ.
SmartPools is a resource management service which implements a highly efficient automated tiered storage strategy. It keeps the single file system tree in tact while performing the tiering of aged data. Recall that the SmartPool subdivides the large set of homogeneous nodes into smaller Mean Time to Data Loss (MTTDL)- friendly disk pools. By subdividing a node's disks into multiple, separately protected pools, nodes are also significantly more resilient to multiple disk failures.
SmartQuotas: is a data management service. This assigns and manages quota that seamlessly partition the storage into easily managed segments at the cluster, directory and sub-directory levels.
SmartConnect: is a data access service that enables client connection, load balancing and dynamic NFS failover and fallback of client connections. Connections target different nodes to optimize the use of cluster resources.
SnapShot IQ is a data protection service that takes near instantaneous snapshots while incurring little or no performance overhead. Recovery is equally fast with near-immediate on demand snap shot. Snapshot revert and delete are separate services.
Cloud management such as Isilon for VCenter is a software service that manages Isilon functions from VCenter. VCenter also comes with its own automatable framework.
SyncIQ is a data replication service that replicates and distributes large, mission critical data sets, asynchronously to one or more alternate clusters. Replication can be targeted to a wide variety of sites and devices and this helps disaster recovery. The replication has a simple push-button operation.
SmartLock is a data retention service that protects critical data against accidental premature or malicious alteration or deletion. It is also security standards compliant.
Aspera for Isilon is a content delivery service that provides high performance wide area file and content delivery.

Thursday, July 25, 2013

Technical overview OneFS continued

OneFS is designed to scale out as opposed to some storage systems that scale up. We can seamlessly increase the existing file system or volumes by adding more nodes to the cluster. This is done in three easy steps by the administrator:
1) adding another node into the rack
2) attaching the node to the Infiniband network
3) instructing the cluster to add the additional node
The data in the cluster is moved across to the new node by autobalance feature in an automatic coherent manner such that the new node will not be a hot spot and existing data gets benefit with the additional performance capabilities. This works in a transparent manner so that storage can grow from TB to PB without any administration overhead.
The storage system is designed to work with all kinds of workflows - sequential, concurrent or random. OneFS provides for all these workflows because throughput and IOPS scale linearly with the number of nodes present in the system. Balancing plays a large role in keeping the performance linear with capacity. Each node is treated the same as they are added and it's a homogeneous cluster. Since each of the nodes have a balanced data distribution and there is automatic rebalancing and distributed processing, each additional CPU, network ports and memory is utilized as the system scales.
Administrators have a variety of interfaces to configure the OneFS.
The Web administration User Interface ("WebUI")
The command line interface that operates via SSH interfaces or RS232 serial connection
The LCD panel on the nodes themselves for simple add/remove functions.
RESTful platform API for programmatic control of cluster configuration and management.
Files are secured by a variety of techniques :
Active Directory (AD)
LDAP Lightweight Directory Access Protocol
Network Information Service
Local users and groups.
Active Directory which is a directory service for the network resources is integrated with the cluster by joining the cluster to the domain. The nodes of the cluster are now reachable via the DNS and the users can be authenticated based on their membership to Active Directory.
LDAP provides a protocol to reach out to other directory services provider. So many more platforms can be targeted.
NIS is another protocol that is referred to as the yellow pages and provides a way to secure the users
And finally the local users and groups of a node can be used to grant permission to that node.
Cluster access is partitioned into access zones. Access Zones are logical divisions comprising of
cluster network configuration
file protocol access
authentication
Zones are associated with a set of SMB/CIFS shares and one or more authentication providers for access control.

Technical overview of OneFS continued

OneFS manages protection of its data directly by allocating data during normal operations and rebuilding data after recovery. It does not rely on hardware RAID levels. OneFS determines which files are affected by a failure in constant time. Files are repaired in parallel. As the cluster size increases, their resiliency increases.
Systems that use a "hot spare" drive, use it to replace a failed drive. OneFS avoids the use of hot spare drives and instead uses available free space to recover from failure. This is referred to as virtual hot spare and guarantees that the system can self-heal.
Data protection is applied at the file level and not the system level, enabling the system to focus on only those files affected by failure. Reed Solomon error codes are used for data but metadata and inodes are protected by mirroring only.
Further the data protection is configurable and can be appliedy dynamically and online. For a file protection involving N data blocks protected by M error code blocks and b file stripes, the protection level is (N + M ) / b, When b = 1, M members can fail simultaneously and still provide 100% availability.As opposed to the double failure protection of RAID-6, this system can provide upto quadruple failure protection.
OneFS also does automatic partitioning of nodes to improve Mean Time to Data Loss (MTTDL). If a 80 node cluster at +4 protection level is partitioned into four twenty node pools at +2, then the protection overhead is reduced, space is better utilized and there is no net addition to management overhead.
Automatic provisioning subdivides the nodes into pools of twenty nodes each and six drives per node. Furthermore, the node's disks are now subdivided into multiple, separately protected poolsand they are significantly more resilient to multiple disk failures than previously possible.
Supported protocols for client access to create, modify and read data include the following:
NFS Network file system used for unix/linux based computers
SMB/CIFS (server message block and Common Internet File System)
FTP : File Transfer Protocol
HTTP : Hypertext Transfer Protocol
iSCSI : Internet Small Computer System Interface
HDFS: Hadoop distributed file system
REST API : Representational state transfer Application Programming Interface.
By default only the SMB/CIFS and NFS are enabled. The root system for all file data is ifs Isilon OneFS file system. The SMB/CIFS protocol has an ifs share and the NFS has an /ifs export.
Changes made by one protocol is visible to all others because the file data is common.