Wednesday, July 31, 2013

The datetime conversions (SQL Server) include the following functions :
The Convert function is a good general choice but DateName and DatePart provide flexibility.
SET DATEFORMAT dmy is used
SELECT CAST('7/31/2013' AS DateTime)
You can also use different styles to convert to and from text:
SELECT CONVERT(VARCHAR, CONVERT(DATETIME, '31-Jul-2013', 106), 106)
will display 31 Jul 2013
The allowed  styles include 0, 100 Default:mon dd yyyy hh:mi AM (or PM)
Style 101 : USA mm/dd/yyyy
Style 102 : ANSI yyyy. mm. dd
Style 103 : British /French dd/mm/yyyy
Style 104: German dd.mm.yyyy
Style 105: Italian dd-mm-yyyy
Style 106: dd mon yyyy
Style 107: mon dd, yyyy
Style 108 : hh:mm:ss
Style 9, 109:  Default with milliseconds
Style 110 USA:mm-dd-yyyy
Style 111 : Japan: yyyy/mm/dd
Style 112: ISO:yyyymmdd
Style 13, 113: Europe default with milliseconds  and 24hour clock
Style 114: hh:mm:ss:mmm with a 24 hour clock
Style 20, 120 ODBC canonical, 24 hour clock: yyyy-mm-dd hh:mi:ss
Style 21, 121 ODBC canonical with milliseconds, 24 hour clock:yyyy-mm-dd hh:mi:ss.mmm
 
A look at some of the SQL constructs in no particular order:
1) common table expression - This is a temporary result set maintained for the duration of a query execution such as an insert, update or delete.
It can be thought of as a substitute for a view and doesn't require the view definition to be stored in the metadata. Follows the syntax: With CTE_Name(Col1, Col2, Col3) (SELECT query to populate)  specify the outer query.
A CTE can be written in a recursive manner as follows : With CTE_name(column_name) AS
(CTE_query_definition -- anchor member is defined
union all
CTE_query_definition -- recursive member is defined in referencing cte_name
)
-- statement using the CTE
SELECT * from cte_name
For example With DirectReports (ManagerID, EmployeeID, Level) AS (
select e.ManagerID, e.EmployeeID, 0 as Level from dbo.Employee  as e where ManagerID is null
union all
select e.ManagerID, e.EmployeeID, Level + 1 from dbo.Employee as e inner join DirectReports  as d on e.ManagerId = d.EmployeeID
)
SELECT ManagerID, EmployeeID, Level from DirectReports
 2) Some query hint options include
hash or order group which specifies if the aggregations use hash or order group
Merge or hash or concat specifies which should be used for union
Loop or merge or hash specifies which should be used for join
FAST number_rows that specifies fast retrieval of number_rows
FORCE ORDER specifies that the default source order is preserved.
MAXDOP overrides the max degree of parallelization setting, 0 => max
optimize for unknown informs the query optimizer to use statistical data instead of the initial values for variables
RECOMPILE discards the plan generated for the query after it executes, forcing the query optimizer to recompile

Tuesday, July 30, 2013

stored procedure to get stores by latitude and longitude

USE Store
GO

/****** Object:  StoredProcedure [dbo].[GetStoresByGeo]    Script Date: 7/30/2013 6:36:37 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author:           Ravishankar Rajamani
-- Create date: July 2013
-- Description:      Get a list of all Stores by co-ordinates
-- =============================================

CREATE PROCEDURE [dbo].[GetStoresByGeo]
@Latitude VARCHAR(50),
@Longitude VARCHAR(50),
@Total int OUTPUT
AS
BEGIN
SET NOCOUNT ON;

      DECLARE @Total int; -- get neighboring coordinates DECLARE @g geography; -- center SET @g = geography::STPointFromText('POINT( ' + '49.261923' + ' ' + '-123.068225' + ')', 4326); DECLARE @n geography; -- neighbors SET @n = @g.STBuffer(1000 * 1609.344); -- get count SELECT @Total = count(*) FROM Store (nolock) WHERE Coordinates.STWithin(@n)   = 1              -- the STBuffer computed column could be persisted.
       SELECT @Total as Total -- Get a table of StoreIds of results DECLARE @StoreIds Table(ID int); INSERT INTO @StoreIds SELECT StoreID FROM Store (nolock) as S WHERE Coordinates.STWithin(@n)  = 1 ORDER BY StoreID -- Get stores SELECT * FROM Store (nolock) as S inner join @StoreIds as Ids on Ids.ID = S.StoreID END

Caveat : This stored procedure has not been tried and merely tries to illustrate an idea from a similar real world example. 
Here's the location usage in NerdDinner sample that we could  modify with the geography data type as per the previous post:


        public IQueryable<Dinner> FindByLocation(float latitude, float longitude)
        {
            var dinners = from dinner in FindUpcomingDinners()
                          join i in NearestDinners(latitude, longitude)
                          on dinner.DinnerID equals i.DinnerID
                          select dinner;

            return dinners;
        }


/****** Object:  UserDefinedFunction [dbo].[NearestDinners]    Script Date: 7/30/2013 7:39:11 AM ******/
DROP FUNCTION [dbo].[NearestDinners]
GO

/****** Object:  UserDefinedFunction [dbo].[NearestDinners]    Script Date: 7/30/2013 7:39:11 AM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION [dbo].[NearestDinners]
(
@lat real,
@long real
)
RETURNS  TABLE
AS
RETURN
SELECT     Dinners.DinnerID
FROM         Dinners
WHERE dbo.DistanceBetween(@lat, @long, Latitude, Longitude) <100

GO


Monday, July 29, 2013

The advantage of using Geography data types for the REST based location services is that the URI can continue to remain the same for the customer. The latitude and longitude is anyways represented in a well-known text format that is described by the Open Geospatial Consortium.
For example,
DECLARE @g geography;
SET @g = geography::STPointFromText('POINT(-122.34900 47.65100)', 4326);
SELECT @g.ToString();
Internally these WKTs are translated to Geography data type instances and stored like regular columns along with the attributes of the location.
They are queried using functions based on the geography data types. The functions could be based on distance between two points or their membership to a polygon such as one representing a zip.
They invoke corresponding static methods on the geography data type such as for example STDistance()
DECLARE @g geography;
DECLARE @h geography;
SET @g = geography::STGeomFromText('LINESTRING(-122.360 47.656, -122.343 47.656)', 4326);
SET @h = geography::STGeomFromText('POINT(-122.34900 47.65100)', 4326);
SELECT @g.STDistance(@h);
or find all intersection points with a polygon such as follows
DECLARE @Seattle geography = 'POLYGON(....)'; 
SELECT c.customerid FROM customer c WHERE c.geog.STIntersects(@Seattle.STBuffer(10 * 1609.344));
In either case the we can retrieve the points that are found in the vicinity of a location of interest.
By changing the column data type, we may have to handle all dependencies on the earlier data type. If location data was used for any index creation or used in conjunction with other tables or database objects, they would need to be modified to use the geography data type.  In general, this should not be a problem since pretty much all the semantic operations performed on the co-ordinates are supported by the geography data type.
Also, one more thing to point out is that the geography column does not need an index creation. In fact, this might be a good opportunity to refactor all references to the geography column and limit their usage only to the database objects that support querying by the user.
Finally, database unit-tests (DUT) must be authored before the changes are made so that we can keep the same behaviour before and after we make the changes. These DUTs could include specific use case scenarios, as well as column scanning to see if any of the conversions failed, or relative distances between select points are maintained with the modified column type.

Location services using spatial data

In applications like the NerdDinner samples posted online, the location data is often represented with latitudes and longitudes which are stored in the database as not nullable doubles or floating points. Stores or restaurants in the neighborhood of a location are found by their cartesian distance between the latitude and longitude of the location and the store / restaurant usually implemented as a database function or stored procedure. The table for the stores or restaurants with their locations are then scanned to find the distances with the input location and ordered . This is then filtered for the top handful and displayed on the maps such as with Google API. Application developers find it easy to write the controllers to use the said database objects or LINQ to SQL to display the appropriate views.
However, there are some limitations with these approach. First this doesn't scale when there are hundreds of stores or restaurants in the radius of interest. Second, it doesn't answer the often repeated queries such as the points in a polygon such as that formed by a zipcode. Queries could also more efficiently find distance between two points if the data type to store location was say GEOGRAPHY  and GEOMETRY data type in SQL Server. The Geography data type stores ellipsoidal data such as GPS Latitude and Longitude and the geometry data type stores Euclidean (flat) co-ordinate system One could then have a table such as :
ZipCodes
 - ZipCodeId
 - Code
-  StateID
- Boundary
- Center Point
Boundary could be considered the polygon formed by the zip and the Center point is the central location in this zip. Distances between stores and their membership to a zip can be calculated based on this center point. Geography data type also lets you perform clustering analytics which answer questions such as the number of stores or restaurants satisfying a certain spatial condition and/or matching certain attributes. These are  implemented using R-Tree data structures which support such clustering techniques.
Spatial data types such as the Geography data type now enjoys support in Entity Framework 5 as described here and therefore is available in LINQ and explained here. .Net also supports these data types with the SQLGeography and SQLGeometry data types for easy translation and mapping to their equivalent sql server data types.
One approach is to store the co-ordinates in a separate table where the primary keys are saved as the pair of latitude and longitude and then to describe them as unique such that a pair of latitude and longitude does not repeat. Such approach is questionable because the uniqueness constraint for locations has a maintenance overhead. For example, two locations could refer to the same point and then unreferenced rows might need to be cleaned up. Locations also change ownership for example, a store A could own a location that was previously owned by store B but B never updates its location. Moreover, stores could undergo renames or conversions.  Thus it may be better to keep the spatial data associated in a repeatable way along with the information about the location.
Map APIs such as Google Maps or Bing Maps lets you work with spatial data types along with their usual caveat of not to store or cache locations independently.
Referred StackOverflow and MSDN


Sunday, July 28, 2013

Let us look at an implementation for the map showing the coffee stores in your neighborhood and your friends who had visited these same stores in the past day, month or year.
The APIs to retrieve store information in your neighborhood is given by the Starbucks location API. The friends information is given by the Facebook API. The location updates by your friends are made on Facebook through automatic push notifications including a proposed Starbucks application.
Different applications make Facebook posts on their customer's wall using the permission requested from them. These show up under Location updates on the FB profiles. The same can be queried for other facebook friends for a given location. A list of such friends who have the given location on their wall is collected and this should be rendered alongwith the store mentioned.
The Starbucks API provides the following services:
1) OAuth APIs : provides access tokens based on one or more methods to acquire them, relying on
a) password b) client id and user id c) authentication code, d) client credentials. Refresh tokens are also provided.
2) Loyalty API : The loyalty API provides Licenses with a way to integrate their Siebel loyalty systems
3) Location API : Provides methods to list all stores, stores near a geographic point, stores by specific ids, stores within a region.
4) Account API: such as to create US Starbucks account and non-US Starbucks account, to validate and recover credentials, to create account profiles and to update social profile data.
5) Card API: Provides customer's card information to keep track of their purchases. This gives information such as Card URL, Card details, Card balances, and the ability to reload a card.
6) Payment methods : Enables collection of sums from the customer using a payment method such as a credit card or PayPal
7) Rewards API: provides rewards summary and history information based on application user, points needed for next level, points needed for next free reward, points needed for re-evaluation etc.
8) Barcode API: The barcode API generates Barcodes to communicate with in-store scanners and applications.
9) Content API: These are used to retrieve localized contents for any web client. Content is stored and retrieved in site core.
10) Transaction History API: These are used to retrieve the history data for a specified user, for all cards.
11) eGift API: These are used to retrieve the themes associated with eGifts.
12) Device Relationship API: These are used to insert or update a device, to get reporting actions such as touch-to-pay to display the barcode for scanning and touch-when-done to report when completing the barcode for scanning.
13) Product API : to browse food, beverage, Coffee, eGifts etc.


Saturday, July 27, 2013

API Explorer for Starbucks and a Graph API implementation

Starbucks API are OAuth enabled. This means that they don't just grant access based on api keys but require an access token that is provided by an OAuth Provider. Starbucks APIs are available from Mashery that provides a redirect to Starbucks Authorization endpoint and this is where API users get their access token. OAuth enables one of four different workflows to get access tokens.
Implicit Grant - such as when a mobile application tries to get an access token from the authorization endpoint based on client id and user id.
Authorization Code grant - such as when a user login to an IIS hosted site and the user's browser is redirected to the Starbucks' authorization endpoint to get a one time short lived authorization code. The client can then exchange the code for an access token.
Credentials Grant - such as when a user provides his or her username and password for a token.
Client Credentials Grant - such as when an application from a secured kiosk or site provides context regardless of the user.
In building an explorer for Starbucks API, we will need to get an access token to make the API calls. Since this application that we call the API explorer enables API users to try out the different APIs based on input parameters and responses, we will choose either the client credentials grant or the implicit grant to retrieve an access token at push button demand. Both XML and JSON responses can be displayed in the text area panel of the API explorer. This is conceived to be very similar to the Graph API Explorer from Facebook.

Another application of Starbucks API could be a deeper integration with the FaceBook's location data. For example Starbucks customers would like to know which of their friends from FaceBook frequented the same Starbucks store the same day as the one they are at currently. Starbucks mobile application today maintains card history and rewards on their application. If they could push FaceBook location updates on purchases that they track with their mobile application at the store that they visit, then Facebook friends could see where each other have been on a given day. This could encourage more sales at the Starbucks store as friends try to catch up with each other and at the very least provides useful knowledge to the Starbucks coffee customer of who else has been doing the same at this store. Additionally Starbucks mobile application need not take the user to their Facebook page to view or post this data, but offer a tip or balloon notification of which of the application user's friends had been at this store and when, if any. Such tips are non-invasive, information only and enables the coffee experience to be an avenue for social networking.  Interested users could be taken to a map that displays not just the stores but the Facebook friends that have visited that store in the past day, week or month.

Localizatioon and globalization testing of websites

usually referred to by their notations L10N and I18N, locale specific website rendering is a significant test concern both in terms of resources required for the testing and the time it consumes. The primary considerations for this testing is the linguistic, cosmetic or basic functionality issue in displaying information in the culture specific manner. Some languages such as German require around 30%  more space while Chinese for instance requires around 30% less. Morever, right to left languages such as Arabic and Hebrew require alignments, proper indentations and layout. Since UI resources for a website are typically collected and stored in resx files their collation and translation is made easy with tools such as resgen.exe. However the content alone does not guarantee their appropriateness to the website rendering, hence additional testing is required. As with any variation of the website, a full test pass using functionality test and load test is incurred.  These sites also require significant environment resources to be allocated, including culture specific domain name registrations and associated servers. Each such resource requires setup, maintenance and constant tracking in various measurement and reporting systems. Such tasks increase the matrix of the web testing.  Fundamentally, these testings are rigorous, end to end and repeated for each locale. What would be desirable is to unify the testing for the common content and factor out the testing specific to the locale. By unifying the tests upstream for much of the content and their display, there are significant savings made in the test cost. Consider the steps involved in the culture specific testing today as depicted below. Each of them is a full iteration of a common content with repeated functionality and load tests even though the locale specific testing is focused on  linguistic translation and cosmetics.
test-en-us  : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
test-uk-en : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
test-de-de : setup ->deployment->functionality testing->load testing->translation and cosmetics->completion
If there were a solution that enables a common test bed for much of the redundancies such as below
                                                                                                       -> linguistic and translation tests
test-neutral: setup->deployment->functionality testing->load testing   -> layout, width, height, indentation checks from static resource checking
                                                                                                        -> variations of locale for repeating the above.
This way, the redundancies are removed, testing is more streamlined and focused on explicit culture specific tasks.
Moreover, in the earlier model, test failures with one locale environment could be different from other local environment on a case by case basis. By unifying the resources and the operations, much of this triage and variations can be avoided. The blogposts on Pseudoizer can be very helpful here.
                                                                               

Friday, July 26, 2013

Technical overview OneFS continued

Software upgrade of the Isilon cluster is done in one of two methods:
Simultaneous upgrade - This method installs the software updates and reboots the nodes all at the same time. This does cause a temporary interruption of service in serving data to clients but it is typically kept under two minutes. The benefits are that system wide changes can be made without any data operations. This enables us to make changes without impacting the customer and can be considered safer even though the service is interrupted albeit temporarily.
Rolling upgrade - This method upgrades and restarts into each node in the cluster sequentially. The cluster remains online and there is no disruption of service to the customer. This is ideal for minor revisions but for major revisions of say OneFS code, it may be better to perform a simultaneous upgrade so that version incompatibilities are avoided.
The same holds true for an upgrade. Additionally, a pre-verification script is run to ensure that only supported configuration is permitted to upgrade. If the checks fail, instructions on troubleshooting the issues are typically provided. Upgrades can be invoked by the administrative interfaces mentioned earlier such as the CLI or the web admin UI. After the upgrade completes, the cluster is verified with a heatlh status check.
Among the various services for data protection and management in the OneFS, some are listed below:
InsightIQ : This is a performance management service. It maximizes the performance of your Isilon scale out storage system with innovative performance monitoring and reporting tools. A backend job called the FSAnalyze is used to gather the file system analytics data and used in conjunction with InsightIQ.
SmartPools is a resource management service which implements a highly efficient automated tiered storage strategy. It keeps the single file system tree in tact while performing the tiering of aged data. Recall that the SmartPool subdivides the large set of homogeneous nodes into smaller Mean Time to Data Loss (MTTDL)- friendly disk pools. By subdividing a node's disks into multiple, separately protected pools, nodes are also significantly more resilient to multiple disk failures.
SmartQuotas: is a data management service. This assigns and manages quota that seamlessly partition the storage into easily managed segments at the cluster, directory and sub-directory levels.
SmartConnect: is a data access service that enables client connection, load balancing and dynamic NFS failover and fallback of client connections. Connections target different nodes to optimize the use of cluster resources.
SnapShot IQ is a data protection service that takes near instantaneous snapshots while incurring little or no performance overhead. Recovery is equally fast with near-immediate on demand snap shot. Snapshot revert and delete are separate services.
Cloud management such as Isilon for VCenter is a software service that manages Isilon functions from VCenter. VCenter also comes with its own automatable framework.
SyncIQ  is a data replication service that replicates and distributes large, mission critical data sets, asynchronously to one or more alternate clusters. Replication can be targeted to a wide variety of sites and devices and this helps disaster recovery. The replication has a simple push-button operation.
SmartLock  is a data retention service that protects critical data against accidental premature or malicious alteration or deletion. It is also security standards compliant.
Aspera for Isilon is a content delivery service that provides high performance wide area file and content delivery.

Thursday, July 25, 2013

Technical overview OneFS continued

OneFS is designed to scale out as opposed to some storage systems that scale up. We can seamlessly increase the existing file system or volumes by adding more nodes to the cluster. This is done in three easy steps by the administrator:
1) adding another node into the rack
2) attaching the node to the Infiniband network
3) instructing the cluster to add the additional node
The data in the cluster is moved across to the new node by autobalance feature in an automatic coherent manner such that the new node will not be a hot spot and existing data gets benefit with the additional performance capabilities. This works in a transparent manner so that storage can grow from TB to PB without any administration overhead.
The storage system is designed to work with all kinds of workflows - sequential, concurrent or random. OneFS provides for all these workflows because throughput and IOPS scale linearly with the number of nodes present in the system. Balancing plays a large role in keeping the performance linear with capacity. Each node is treated the same as they are added and it's a homogeneous cluster. Since each of the nodes have a balanced data distribution and there is automatic rebalancing and distributed processing, each additional CPU, network ports and memory is utilized as the system scales.
Administrators have a variety of interfaces to configure the OneFS.
The Web administration User Interface ("WebUI")
The command line interface that operates via SSH interfaces or RS232 serial connection
The LCD panel on the nodes themselves for simple add/remove functions.
RESTful platform API for programmatic control of cluster configuration and management.
Files are secured by a variety of techniques :
Active Directory (AD)
LDAP Lightweight Directory Access Protocol
Network Information Service
Local users and groups.
Active Directory which is a directory service for the network resources is integrated with the cluster by joining the cluster to the domain. The nodes of the cluster are now reachable via the DNS and the users can be authenticated based on their membership to Active Directory.
LDAP provides a protocol to reach out to other directory services provider. So many more platforms can be targeted.
NIS is another protocol that is referred to as the yellow pages and provides a way to secure the users
And finally the local users and groups of a node can be used to grant permission to that node.
Cluster access is partitioned into access zones. Access Zones are logical divisions comprising of
cluster network configuration
file protocol access
authentication
Zones are associated with a set of SMB/CIFS shares and one or more authentication providers for access control.




Technical overview of OneFS continued

OneFS manages protection of its data directly by allocating data during normal operations and rebuilding data after recovery.  It does not rely on hardware RAID levels. OneFS determines which files are affected by a failure in constant time. Files are repaired in parallel. As the cluster size increases, their resiliency increases.
Systems that use a "hot spare" drive, use it to replace a failed drive.  OneFS avoids the use of hot spare drives and instead uses available free space to recover from failure. This is referred to as virtual hot spare and guarantees that the system can self-heal.
Data protection is applied at the file level and not the system level, enabling the system to focus on only those files affected by failure. Reed Solomon error codes are used for data but metadata and inodes are protected by mirroring only.
Further the data protection is configurable and can be appliedy dynamically and online. For a file protection involving N data blocks protected by M error code blocks and b file stripes, the protection level is (N + M ) / b, When b = 1, M members can fail simultaneously and still provide 100% availability.As opposed to the double failure protection of RAID-6, this system can provide upto quadruple failure protection.
OneFS also does automatic partitioning of nodes to improve Mean Time to Data Loss (MTTDL). If a 80 node cluster at +4 protection level is partitioned into four twenty node pools at +2, then the protection overhead is reduced, space is better utilized and there is no net addition to management overhead.
Automatic provisioning subdivides the nodes into pools of twenty nodes each and six drives per node. Furthermore, the node's disks are now subdivided into multiple, separately protected poolsand they are significantly more resilient to multiple disk failures than previously possible.
Supported protocols for client access to create, modify and read data include the following:
NFS Network file system used for unix/linux based computers
SMB/CIFS (server message block and Common Internet File System)
FTP : File Transfer Protocol
HTTP : Hypertext Transfer Protocol
iSCSI : Internet Small Computer System Interface
HDFS: Hadoop distributed file system
REST API : Representational state transfer Application Programming Interface.
By default only the SMB/CIFS and NFS are enabled. The root system for all file data is ifs Isilon OneFS file system. The SMB/CIFS protocol has an ifs share and the NFS has an /ifs export.
Changes made by one protocol is visible to all others because the file data is common.

Wednesday, July 24, 2013

Technical overview of OneFS continued

Locks and Concurrency in OneFS is implemented by a lock manager that marshals locks across all nodes in a storage cluster. Multiple and different kinds of locks referred to as lock "personalities" can be acquired. File System locks as well as cluster coherent protocol-level locks such as SMB share mode locks or NFS advisory-mode locks are supported. Even delegated locks such as CIFS oplocks and NFSv4 delegations are supported.
Every node in a cluster is a coodinator for locking resources and a coordinator is assigned to lockable resources based upon an advanced hashing algorithm. Usually the co-ordinator is different from the initiator. When a lock is requested such as a shared lock for reads or an exclusive lock for writes, the call sequence proceeds something like this:
1) Let's say Node 1 is the initiator for a write, and Node 2 is designated the co-ordinator. Node 3 and Node 4 are shared readers.
2) The readers request a read lock from the co-ordinator at the same time.
3) Coordinator checks if an exclusive lock is granted for the file.
4) if no exclusive locks exist, then the co-ordinator grants shared locks to the readers.
5) The readers begin their read opeations on the requested file
6) An exclusive lock for the same file that is being read by the readers is now requested by the writer.
7) The co-ordinator checks if the locks can be reclaimed
8) The writer is made to block/wait while the readers are reading.
9) The exclusive lock is granted by the coordinator and the writer begins writing to the file.
When the files are large and the number of nodes is large, high throughput and low latency become important. In such cases multi-writer support is made available by dividing the file into separate regions and providing granular locks for each region.
Failures are tolerated such as in the case of power loss. A journal is maintained to record changes to the file system and this enables fast, consistent recovery from a power loss or other outage. No file scan or disk check is required with a journal. The journal is maintained on a battery backed NVRAM card. When the node boots up, it checks its journal and replays the transactions. If the NVRAM is lost or the transactions are not recorded, the node will not mount the file system.
In order for the cluster to function, a quorum of nodes must be active and responding. This can be a simple majority where one more than half the nodes are functioning. A node that is not part of the quorum is in a read only state. The simple majority helps avoid split-brain conditions when the cluster temporarily splits into two. The quorum also dictates the number of nodes required in order to move to a given data protection level. For an N+M protection level, 2*M+1 nodes must be in quorum.
The global cluster state is available via a group management protocol that  guarantees a consistent view across the entire cluster of the state of the other nodes. When one or mode nodes become unreachable, the group is split and all nodes resolve to a new consistent view of their cluster. In the split state, the file system is reachable and for the group with the quorum, it is modifiable. The node that is down is rebuilt using the redundancy stored in the cluster. If the node becomes reachable again, a "merge" occurs where the two groups are brought back into one. The nodes can rejoin the cluster without being rebuilt and reconfigured. If the protection group changes during the merge, files may be restriped for rebalance. When a cluster splits some blocks may get orphaned because they are re-allocated on the quorum side. Such blocks are collected through a parallelized mark and sweep scan.

Technical overview OneFS continued

In a file write operation on a three node cluster, each node participates with a two layer stack - the initiator and the participant. The node that the client connects to acts as the Captain. In an Isilon cluster, data, parity, metadata and inodes are all distributed on multiple nodes.Reed Solomon erasure encoding is used to protect data and this is generally more efficient (upto 80%) on raw disk with five nodes or more. The stripe width of any given file is determined based on the number of nodes in the cluster, the size of the file, and the protection setting such as N+2.
OneFS uses InfiniBand  back-end network for fast network access. Data is written in atomic units called protection groups. If every protection group is safe, the entire file is safe. For files, protected by erasure codes, a protection group consists of a series of data blocks as well as a set of erasure codes Types of protection groups can be switched dynamically, temporarily relying on mirroring when node failures prevent the desired number of erasure codes from being used and then reverting to protection groups without admin intervention.
The OneFS file system block size is 8 KB and a billion such small files can be written at high performance because the on disk structures can scale to that size. For large files, mutliple contiguous 8KB blocks upto sixteen can be striped into a single node's disk. For even larger files, OneFS can maximize sequential performance by writing in stripe units of sixteen contiguous blocks.An AutoBalance service reallocates and rebalances data to make storage space more efficient and usable.
The captain node uses a two phase commit transaction to safely distribute writes to multiple NVRAMS across clusters.  The mechanism relies on NVRAMs for journaling all transactions on every node in the cluster.Using NVRAMs  in parallel allows high throughput writes and safety against failures. When a node returns from failure, the only actions required is to replay its journal from the NVRAM and occassionally from AutoBalance to rebalance files that were involved in the transaction. No resynchronization event ever needs to take place.  Access patterns can be chosen from the following at a per file or per directory level. Concurrency - optimizes for current load on the cluster, featuring many simultaneous clients.
Streaming - optimizes for high speed streaming of a single file, for example to enable very fast reading within a single client.
Random - optimizes for unpredictable access to the file, by adjusting striping and disabling the use of any pre-fetch cache.
Caching is used to accelerate the process of writing data into an Isilon cluster.  Data is written into the NVRAM based cache of the initiator node before writing to disk, then batched up and flushed to disks later at a more convenient time. In the event of a cluster split or unexpected node outage, uncommitted cached writes are fully protected.
Caching operates as follows :
1) an NFS client sends node 1 a write request for file with N+2 protection
2) Node 1 accepts the writes into its NVRAM write cache fast path and then mirrors the writes to participants nodes' logfile
3) Write acknowledgements are returned to the NFS clients immediately and as such write to disk latency is avoided
4) As node 1's write cache fills, it is flushed via two phase commit protocol  and appropriate parity protection
5) the write cache and participant node files are cleared and available to accept new writes.

Tuesday, July 23, 2013

Technical overview of OneFS file system continued

Each cluster creates a single namespace and file system.  This means that the file system is distributed across all nodes. There is no partitioning and no need for volume creation. Each cluster can be accessed by any node in the cluster. Data and metadata are striped across the nodes for redundancy and availability. The single file tree can grow with node additions without any interruption to user. Services take care of all the associated management routines such as tiering, replication etc. Files are transferred parallely without any regard for the depth and breadth of the tree.
This design is different from having different namespaces or volumes and aggregating them to make it appear as a single. There the administrator has to layout the tree, move files between volumes, purchase cost, power and cooling etc.
OneFS uses physical pointers and extents for metadata and stores files and directory metadata in inodes. B-Trees are used extensively in the file system. Data and metadata are redundant and the amount of redundancy can be configured by the administrator. Metadata only takes up about 1% of the system. Metadata access and locking are collectively managed in a peer to peer architecture.
The OneFS blocks are accessed by multiple devices simultaneously so the addressing scheme is indexed by a tuple of {node, drive, offset}.Data is protected by erasure code and is labeled by the number of failures it can tolerate simultaneously such as "N+2" that can withstand two failures when the metadata is mirrored 3x.
OneFS controls the placement of files directly down to the sector level on any drive. This allows for optimized data placement I/O patterns and avoids unnecessary read-modify-write operations. As opposed to dedicating entire RAID volumes to a particular performance type and protection setting, the file system is homogeneous and can be used to optimize spindle usage with configuration changes at any time and online.
Every node that participates in the I/O has a stack that comprises of two layers. The top half is the initiator that serves client side protocols such as  NFS, CIFS, iSCSI, HTTP, FTP etc. and the bottom layer is the participant that comprises of NVRAM. When a client connects to a node to write to a file, it connects to the top half where the files are broken into smaller logical chunks called stripes before being written. Failure safe buffering using a write coalescer is used to ensure that the writes are efficient.
OneFS stripes the data across all nodes and not simply across all disks.
OneFS file system overview:
The EMC OneFS file system is a distributed file system that runs on a storage cluster. It combines three layers of traditional storage architecture namely, a File System, Volume Manager, and data protection.
It scales out because it relies on intelligent software, commodity hardware, and distributed architecture.
OneFS works with a cluster that consists of multiple nodes and starts out with as few as three nodes. In the clustrer, nodes could provide different ratios of throughput and Input/Output operations per second. OneFS combines these into a whole - RAM is grouped together into a single coherent cache, NVRAM is grouped together for high throughput writes and spindles and CPU are combined to increase throughput, capacity and IOPS
There are two types of networks associated with a cluster : internal and external. Intra-node communication in a cluster is performed using a proprietary unicast node to node protocol. IP over InfiniBand network is used with redundant switches.
Clients connect to the cluster using Ethernet connections, each node provides its own ports. File system protocols such as NFS, CIFS, HTTP, iSCSI, FTP and HDFS are supported.
The Operating system of OneFS is a BSD-based operating system that supports both windows and Linux/Unix based semantics such as hard links, delete-on-close, atomic rename, ACLs and extended attributes.
To communicate with the client, a variety of protocols as mentioned above are supported. The I/O subsystem is split in two halves - the top half is the initiator and the bottom half is the participant. Any node that the client connects to acts as the initiator.
Cluster operations involve checking and maintaining the health of the cluster. These jobs are run through a job engine and have priority associated. Jobs might include balancing free space in a cluster, scanning for viruses, reclaiming disk space, associating a path and its contents with a domain, rebuild and re-protect the file system, gathers the file system, scrubs disk for media level errors, and revert an entire snapshot to disk.
The granularity of the jobs ensures that OneFS performs adequately and appropriately for every impact interval in the customers environment.


Monday, July 22, 2013

Barry Wise on five progressive steps to database normalization
We start out with an example where we store user's name, company, company address, and personal  urls - say url1 and url2
The zero form is when all of this is in a single table and no normalization has occured.
The first normal form is achieved by
1) eliminating repeating groups in individual tables
2) creating a separate table for each set of data
3) identify each set of related data with a primary key
so this yields a table where the user information is repeated for each url so url field limitation is solved
The Second normal form is achieved by
1) Creating separate tables for a set of values that apply to multiple records
2) Relate these tables with a foreign key
Basically, we break the url values into a separate table so we can add more in the future
The third normal form is achieved by
1) eliminating fields that do not depend on the key
Company name and address have nothing to do with the user id, so they are broken off into their own table
The fourth and higher form depend on data relationships involving one-one, one-to-many and many-to-many.
The Fourth normal form is
1) In many to many relationship, independent entities cannot be stored in the same table.
To get many users related to many urls, we define a url_relations where they user id and url id are paired.
The next normal form is the Fifth normal form which suggests that
1) The original table must be reconstructed from the tables into which it has been broken down. This is a way to check that no new columns have been added.
As always, remember that denormalization has its benefits as well.
Also, Litt's tips additionally mentions the following :
1) create a table for each list. More than likely every list will have additional information
2) create non-meaningful identifiers.
This is to make sure that business rule changes do not affect the primary identifier 

Sunday, July 21, 2013

Database normalization and denormalization rules are discussed here.
Codd described the objectives of normalization as follows:
1) To free the collection of relations from undesirable insertion, update and deletion dependencies.
2) To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increases the lifespan of application programs.
3) To make the relational program more informative to users
4) To make the collection of results neutral to query statistics, where these statistics are liable to change as time goes by.
The undesired side effects of insert, update or delete may include the following:
- Multiple rows for the same information which are updated independently and get out of sync
- no rows such as when a new user is added but not assigned to anything
- inconsitent deletes such as where one table deletion implies and requires a deletion in a completely different table.
If the addition of new data, requires changes to existing structure, then such changes can cause regressions
Tables when normalized are immediately correspond to real world concepts and their relationships.
The normalized tables are suited for general querying across any set of tables.
Some common normalization terms:
Functional dependency : FD: X->Y where Y attribute has a functional dependency on a set of X attributes if and only if each X value is associated with one and only one Y value.
Trivial Functional dependency :  is a FD of an attribute on a superset of itself.
Full Functional dependency : when an attribute is FD on X but not on any subset of X
Transitive dependency : X->Y and Y->Z implies  X->Z
Multivalued dependency : presence of some rows implies presence of some others
Join dependency - table can be recreated with joins
Superkey : A superkey is a combination of attributes that can be used to identify a database record.
Candidate key is a minimal superkey such as the Social Security Number
Non-prime attribute is one that does not occur in any candidate key. A prime attribute is one which occurs in some candidate key
A candidate key may be designated as a primary key but is unsually not talked about in with respect to other candidate keys.
Normal forms include the following:
1) First normal form - table has one candidate key
2) Second - no non-prime attribute is FD on a proper subset of any candidate key
3) Third - every non-prime attribute is non-transitively dependent on every candidate key in the table ( no transistive dependency is allowed)
4) EKNF - Every non-trivial FD is either the dependency of an elementary key or a dependency of a superkey
5) BCNF - Every non-trivial FD in the table is a dependency on a super key
6) Fourth - Every non-trivial multivalued dependency is a dependency on a super key
7) Fifth - Every non-trivial join dependency in the table is implied by the super keys of the table.
8) Domain key - Every constraint is a logical consequence of the tables domain constraints or key constraints.
9) Sixth - no non-trivial join dependencies
Denormalization - OLAP is denormalized as compared to OLTP. The redundant data is carefully controlled during ETL. Normalized snowflake schema becomes denormalized star schema. Non-first normal form is formed by nesting 1NF and the reverse is unnesting.
The KMP algorithm for string pattern matching proceeds like this:

//
#include "stdafx.h"
#include <iostream>
#include <cstdio>
#include <cstring>
#include <vector>
using namespace std;
int* PreProcess( string pattern) {
 int patternLength = pattern.length();
 if (patternLength == 0) return 0;
    int * next = new int[patternLength + 1];
    if (next == 0) return 0;
    next[0] = -1;  // set up for loop below; unused by KMP

    int i = 0;
    int j = -1;
    while (i < patternLength) {
  next[i + 1] = next[i] + 1;
  while ( next[i+1] > 0 &&
    pattern[i] != pattern[next[i + 1] - 1])
   next[i + 1] = next[next[i + 1] - 1] + 1;
  i++;
 }
    return next;
}
void KMP(string pattern, string text, vector<int> *positions) {
    int patternLength = pattern.length();
    int textLength = text.length();
    int* next =  PreProcess(pattern);
 if (next == 0) return;
    int i = 0;
    int j = 0;
    while ( j < textLength )
 {
  while(true)
   if (text[j] == pattern[i]) //matches
   {
    i++;   // yes, move on to the next state
    if (i == patternLength)  // maybe that was the last state
    {
     // found a match;
     positions->push_back(j-(i-1));
     i = next[i];
    }
    break;
   }
   else if (i == 0) break; // no match in state j = 0, give up
   else i = next[i];
  j++;
 }
}
char CharCode(char chr) {
    if ( 'a' <= chr && chr <= 'z' )
        return 'a';
    if ( 'A' <= chr && chr <= 'Z' )
        return 'A';
    if ( '0' <= chr && chr <= '9' )
        return '0';
    if ( chr == '.' || chr == '?' || chr == '!' || chr == ',' || chr == ':' || chr == ';' || chr == '-' )
        return '.';
}
string CodeText(string text) {
    string ret = text;
    for (int i = 0; i < text.length(); ++i) {
        ret[i] = CharCode(text[i]);
    }
    return ret;
}
void FancyOutput(string pattern, string code, vector<int> *positions) {
    cout << "Matched positions: ";
    for (int i = 0; i < positions->size()-1; ++i)
        cout << (*positions)[i] + 1 << ", ";
    cout << (*positions)[positions->size()-1] + 1 << "." << endl;

    std::cerr << "Text: " << code.c_str() << endl;
    for (int i = 0; i < positions->size(); ++i) {
        printf("%5d ", i+1);
        for (int j = 0; j < (*positions)[i]; ++j) cout << " ";
        cout << pattern.c_str() << endl;
    }
}
int _tmain(int argc, _TCHAR* argv[])
{
    string pattern, text, code;
 char pattext[1024];
 char txttext[1024];
    cout << "Input pattern:" << endl;
    cin.getline(pattext, 1024, '\n');
 pattern.assign(pattext);

    cout << "Input text:" << endl;
    cin.getline(txttext, 1024, '\n');
 text.assign(txttext);
    cout << endl;

    code = CodeText(text);
    cout << "Processed text:" << endl;
    cout << code.c_str() << endl;
    cout << endl;

    vector<int> *positions = new vector<int>();
    KMP(pattern, code, positions);

    if ( positions->size() )
        cout << "Y" << endl;
    else
        cout << "N" << endl;

    if ( positions->size() )
        FancyOutput(pattern, code, positions);
    return 0;
}
 

Saturday, July 20, 2013

Kimball architecture

Kimball architecture:
The Kimball architecture is based on dimensional modeling. The following rule of thumbs are observed in this modeling
1) Load detailed atomic data into dimensional structures. i.e do not load summarized data into the dimensional tables.
2) Structure dimensional models around the business processes. these business processes have performance metrics that often translate to dimensions or facts. Combined metrics could also be additional dimensions.
3) Ensure that every fact table has an associated date dimensional table. The business processes  and performance metrics mentioned above are often associated with measurement events which are usually periodic with holiday indicators
4) Ensure that all facts in a single fact table are at the same grain or level of detail. The measurements within a fact table must be at the same grain or level of detail such as transactional, periodic snapshot or accumulating snapshot.
5) Resolve many to many relationships in fact tables. The events stored in a table are inherently associated with many places on many days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, in which case, a many-many dual keyed bridge table is used in conjunction with the fact table.
6) Resolve many to one relationships in dimensional tables. Hierarchical fixed depth many to one relationships are typically collapsed into a flattened dimensional table. Do not normalize or snowflake a M:1 relationship but denormalize the dimensions.
7) Store report labels and filter domain values in dimension tables. The codes, decodes and descriptors uses for labeling and query should be captured in dimensional tables. Again such attributes should have no nulls.
8) Make certain that dimension tables use a surrogate key. Meaningless sequentially assigned surrogate keys can help make smaller fact tables, smaller indexes and improved performance.
9) Create conformed dimensions to integrate data across the enterprise. Conformed dimensions also referred to with common, master, standard or reference dimensions are defined once in the ETL system and deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes.
10) Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision making. User requirements and underlying realities of the associated source data needs to be reconciled.
Dimnesional modeling, project strategy, technical ETL/ BI architectures or deployment/maintenance all require balancing acts.

Friday, July 19, 2013

Corporate Information Factory Architecture

CIF operational systems topology:
The CIF architecture comprises of the data warehouse and operational data store as the core of the architecture. This is surrounded by metadata management  plane where the operational systems reside Operational systems can include external operational systems, distribution, product, account and customer operational systems. These operational systems add data into data management core through transformations and integrations. Exploration warehouse datamart and data delivery also extract information from the data  management core. The decision support interface may use these data marts while the transaction interface may use the operational data store directly. CIF consumers acquire the information produced via data delivery, manipulate it in datamarts, and assimilate it in their own environments. Outside this metadata management is the Information Services as well as Operations and administration.
The Information Services  comprises of groups of items such as library and toolbox as well as the workbench. The Operations and administration involve systems management, data acquisition management, service management and change management.
Producers are the first link in the information food chain. They synthesize the data into raw information and make it available for consumption across the enterprise.
Operational systems are the core of the day to day operations. The operational systems are organized by the product they support. However, businesses are more customer oriented than product so that they can differentiate their offerings. CIF provides facilities to define how this data relates to a customer using rules that form the metadata.
Operational data usually stand alone so they have not been integrated. CIF synthesizes cleans and integrates the data before it is usable. CIF also acts a history store for the enterprise.
Next, integration and transformation consist of the processes to capture, integrate, transform, cleanse, re-engineer and load the source data into the data warehouse. Typically data is pulled from the operational systems as opposed to pushing. Configuration management and scheduling play a large role here. Typically these packages can be written with the knowledge of the source data and once written don't change too much. So they are good candidates for scheduled jobs.
Next the data warehouse plays the big role in CIF. It is often described as "subject-oriented, integrated, time-variant(temporal) and non-volatile collection of data"  Sizes of up to terabytes of data are not uncommon for a data warehouse.
Next, the data management extends the data warehouse with archival/restoration, partitioning and movement of data based on triggering and aggregation. This is often in house development tasks.
Finally, the consumers extract information from this collection of data.  The data-marts support these analytic requirements and they are made available via the decision support interface.
Also, the metadata enables metrics for measuring support.
Courtesy : Imhoff
OAuth

OAuth is a feature by which users can sign in with a third party intermediary authorization. For example, if you were at a online printing store and wanted to print photos from an album in another website, the printing store can retrieve the photos from that website on your behalf and you can continue with your print order, if you sign in to that website.It differs from login in that you don't have to repeatedly provide username and password for different websites. The sign in on one website can be reused for others.
The way this works is by providing users access tokens based on third party referral. This token grants a user access to resources. The token may expire every few minutes and may need to be refreshed. So a refresh token can be requested. Admins to OAuth provider can see which users have registered from which clients. They can also revoke access to users and clients.
OAuth has four different workflows for granting access: These are :
Implicit Grant – such as when the mobile application follows a redirect to the client application.
Credentials Grant – such as when the user provides username and password for a token
Client Credentials Grant – such as when admin applications from secured kiosks provide context regardless of the user
Refresh Grant – a client can retrieve an access token by exchanging a previous refresh token.
An access token request consists of the following
- a grant type aka authorization code
- code
- redirect URI
- client Id
while the response consists of
- a return type ( data or redirect )
- Access Token
- token type
- refresh type
- expires in how many seconds
- refresh token
- scope
The authorization request consists of response type, client id, scope and state. The response consists of code and state.

Thursday, July 18, 2013


This post talks about some business intelligence practices :
OLTP and OLAP processing have different systems due to several differences:
The first difference is that an online transaction processing is one in which the response time is critical and touches existing data.  The Online Analytical Processing on the other hand is based on data accumulated over time such as for decision support.
The second difference is that the latter require a schema where they have a star or multi-level star or snowflake like design.
The third difference is that the latter require very different data structures such as bitmaps and conjunctive filters while the former uses B+ trees.
The fourth difference is that the databases in the latter required fast load of large data periodically.
The fifth difference is that the joins on the data were very costly to perform and the views need to be captured and persisted till the next update.

OLAP systems could be implemented with a CIF architecture or a Star-Kimball model: 
The CIF architecture uses Entity Relationships and database normalization rules to build a normalized data model in its data warehouse. In the Star-Kimball model, a dimensional model is used where the transactions are split into facts or dimensions and arranged in a star like schema and if they are hierarchical in a multi-level star like schema. 
The advantage of the dimensional model is that it is easy to understand and use. The joins are simpler and the users don’t need to know the source of the data or the data structure but can work off materialized views although the operating systems have to handle the complex transformations to maintain the dimensions.
The advantage with the normalization model is that it holds a lot of information and is flexible to changing business needs. 

A day to day task may involve the task of  preventing orphaned data. We prevent this by adding constraints and checks to the data. For example in a table that maintains employees and their managers, the column to denote the manager points back to another employee in the table. For the top of the organization, there is usually no manager. We represent this by making him his own manager.  This is useful now because none of the employees can be orphaned when all the data in the manager column is checked for consistency. In general, we use constraints, checks, primary and foreign keys, surrogates and natural keys to enforce integrity of the data.

Wednesday, July 17, 2013

A quick look at some of the web technologies
web crawling : start with a list of URLs, visit them according to policy, find more and add them to the list.

Tuesday, July 16, 2013

Memory Manager

The memory manager components include a set of executive system services for allocating, deallocating, and managing virtual memory, a translation-not-valid and access fault trapper for hardware exceptions, and a set of kernel mode system thread routines as listed below:
The working set manager to handle global policies such as trimming, aging, and modified page writing.
The process/stack swapper that handles inswapping and outswapping.
The modified page writer that writes dirty pages in mapped files to disk.
The dereference segment thread which is responsible for system cache and page file growth shrinkage.
The zero page thread that zeros out pages on the free list/
The memory manager provides reserving and committing pages, locking memory, allocation granularity, shared memory and mapped files, protecting memory, copy on write, heap functions and address windowing extensions. A look at these in brief:
reserving and committing: reserved pages are not mapped to any storage. This is useful for large contiguous buffer. For example, when a thread is created, a stack is reserved. The committed pages are pages that when accessed, ultimately translate to valid pages in physical memory. Committed pages are either private and not share-able or mapped to a view of the section.
locking memory: Pages can be locked in memory in two ways:
1) Many pages can be requested by say a device driver and locked until it releases.
2) User mode applications can lock pages in their process working set.
Allocation Granularity This defines the system page size so that the risks associated with the assumptions about allocation alignment would be reduced.
Shared memory and mapped files Each process maintains its private memory area in which to store private data, but the program instruction and unmodified data pages could be shared. Shared memory is implemented via section objects
Memory is protected first by accessing all systemwide data structures and pools using kernel mode system components which user threads can't access.
Second, each process has a separate private address space which others can't access.
Third, in addition ti implicit protection, some form of hardware supported protection is also done.
And finally, shared memory section objects are protected via the standard access-control lists.
Copy on write page protection is an optimization the memory manager uses to conserve physical memory. Each new process that writes to a page will also get its own private copy.
A heap is a region of one or more pages of reserved address space that can be subdivided and allocated in smaller chunks by the heap manager.
Address windowing extensions is done by allocating the physical memory to be used, creating a region of virtual address space to act as a window to map views and mapping views of the physical memory into the window.
AWS implements REST API security with apikey and encrypting parameters.

Monday, July 15, 2013

WCF and WPF Fundamentals review

This post mentions the important contracts for each aspect of the communication
1. Service has three main parameters : Address Binding and Contract. Bindings can be of three different types: TCP/IP binding, http binding, net msmq binding.
2. MEX endpoint for metadata exchange. API Versioning using a standard workflow resulting in different actions taken.
3. Hosting can be of three types : IIS, Windows service, and Windows activation service. T
4. Use attributes such as ServiceContractAttribute, DataContractAttribute, and OperationsContractAttribute. Define exception handling via Fault attributes.
5. Use appcmd.exe to determine which instance is hosting your code.
6. Use appcmd.exe to configure site, application, virtual directory or URL.
7. Use TransactionScope for reliability.
8. Use MSMQ for queued calls. Use posion messages and dead letter queues to handle retrying.
9. Security can be none, transport or message.
10. Role based security features of .Net can be reused for service
11. Use PrincipalPermission and UseWindowsGroups for windows role based security.
12. Use certificates for encryption. Certificates have to be in the right store and not have expired.

ASP.Net page life cycle events
The following are the page events during life cycle
1. PreInit : check the IsPostBack
2. Init : raised after all the controls have been initialized.
3. InitComplete : signals end of completion
4. PreLoad : raised after the page loads view state for itself and all controls.
5. Load : The page object calls the OnLoad Method and for each child control
6. Control events: to handle individual controls
7. LoadComplete : Raised at the end of the event handling state.
8. PreRender : Raised after the page object has created all the controls
9. PreRenderComplete : Data binding occurs here
10. SaveStateComplete :  Raised after view state and control state have been saved for the page and for all controls
11. Render: the Page object calls this method on each control.
12. Unload : Raised for each control and then for the page. In controls, use this event to do final cleanup for specific controls.

Reviewed from MSDN

Creating automated load test run report reminders

Load Test Run results can be populated in a database. Scripts to set up this database are available here.
In order to send automated reports on new results, we can register a trigger and an xslt transformation like so:

USE PerfResults;
GO
IF OBJECT_ID ('dbo.LoadTestRunCompleted','TR') IS NOT NULL
    DROP TRIGGER dbo.LoadTestRunCompleted;
GO
CREATE TRIGGER dbo.LoadTestRunCompleted
ON LoadTestRun
AFTER UPDATE
AS
DECLARE @RunID uniqueidentifier;
SELECT TOP 1 @RunID = RunID from LoadTestRun where Outcome = 'Completed' order by EndTime desc;
DECLARE @msg nvarchar(max);
SELECT @msg = dbo.fn_DisplayLoadTestReport(@RunID);
EXEC msdb.dbo.sp_send_dbmail
    @profile_name = 'Performance Reports Administrator',
    @recipients = 'user@xyz.com',
    @body_format='HTML',
    @body = @msg,
    @subject = 'Automated Performance Reports' ;
GO


-- ================================================
-- Template generated from Template Explorer using:
-- Create Scalar Function fn_DisplayLoadReport.SQL
--
-- Use the Specify Values for Template Parameters
-- command (Ctrl-Shift-M) to fill in the parameter
-- values below.
--
-- This block of comments will not be included in
-- the definition of the function.
-- ================================================
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: Ravishankar Rajamani
-- Create date: 15th July, 2013
-- Description: function to render an html for summary display
-- =============================================
CREATE FUNCTION dbo.fn_DisplayLoadTestReport
(
-- Add the parameters for the function here
@RunId uniqueidentifier
)
RETURNS nvarchar(MAX)
AS
BEGIN

    DECLARE @name nvarchar(255)
DECLARE @description nvarchar(255)
DECLARE @starttime datetime
DECLARE @endtime datetime
DECLARE @warmuptime int
DECLARE @runduration int
DECLARE @controller nvarchar(255)
DECLARE @runsettings nvarchar(255)

SELECT @name = LoadTestName, @description = Description, @starttime = StartTime, @endtime = EndTime, @warmuptime = (WarmupTime/60) , @runduration = (RunDuration/60), @controller = ControllerName, @runsettings = RunSettingUsed from LoadTestRun where RunId = @RunId

DECLARE @ret nvarchar(max)
SET @ret = N'<head>
<style type="text/css">
#fullwidth
{
width:100%;
clear:both
}
#leftcol
{
width:49%;
float:left;
clear:all;
position:relative;
}
#rightcol
{
width:49%;
float:right;
position:relative;
}
.pagetitle
{
font-family: Tahoma, Verdana, Arial, Helvetica, sans-serif;
font-size: .8em;
font-weight: bold;
}
.bodytext
{
font-family: Tahoma, Verdana, Arial, Helvetica, sans-serif;
font-size: .7em;
font-weight: normal;
}
.sectiontitle
{
font-family: Tahoma, Verdana, Arial, Helvetica, sans-serif;
font-size: .7em;
font-weight: bold;
text-align: left;
padding-top: 1.5em;
}
.content table
{
width: 100%;
border-top: 1px solid #D4D0C8;
border-right: 1px solid #D4D0C8;
}
.content td
{
font-family: Tahoma, Verdana, Arial, Helvetica, sans-serif;
font-size: .7em;
text-align: left;
border-bottom: 1px solid #D4D0C8;
border-left: 1px solid #D4D0C8;
padding-top: .2em;
padding-left: .3em;
}
.content th
{
font-family: Tahoma, Verdana, Arial, Helvetica, sans-serif;
font-size: .7em;
font-weight: normal;
text-align: left;
border-bottom: 1px solid #D4D0C8;
border-left: 1px solid #D4D0C8;
background-color: #ECE9D8;
padding-top: .2em;
padding-left: .4em;
}
</style>
<script language="JavaScript">
function clickHandler()
{
var targetId, srcElement, targetElement;
srcElement = window.event.srcElement;
if (srcElement.className == "Expandable")
{
targetId = srcElement.id + "div";
targetElement = document.all(targetId);
if (targetElement.style.display == "none")
{
targetElement.style.display = "";
srcElement.innerText = "6";
} else {
targetElement.style.display = "none";
srcElement.innerText = "4";
}}}
document.onclick = clickHandler
</script>
</head>
<body bgcolor="#ffffff">
<div class="pagetitle">Load Test Summary</div>
<div id="leftcol">
<div class="sectiontitle">Test Run Information</div>
<div class="content">
<table cellspacing="0" id="TestRunInformation">
<tr>
<td nowrap width="50%">Load test name</td>
<td nowrap width="50%">' + @name + '</td>
</tr>
<tr>
<td nowrap width="50%">Description</td>
<td nowrap width="50%">' + @description + '&nbsp;</td>
</tr>
<tr>
<td nowrap width="50%">Start time</td>
<td nowrap width="50%">' + CONVERT(nvarchar, @starttime) + '</td>
</tr>
<tr>
<td nowrap width="50%">End time</td>
<td nowrap width="50%">' + CONVERT(nvarchar, @starttime) + '</td>
</tr>
<tr>
<td nowrap width="50%">Warm-up duration</td>
<td nowrap width="50%">' + CONVERT(nvarchar, @warmuptime) + '</td>
</tr>
<tr>
<td nowrap width="50%">Duration</td>
<td nowrap width="50%">' + CONVERT(nvarchar, @runduration) + '</td>
</tr>
<tr>
<td nowrap width="50%">Controller</td>
<td nowrap width="50%">' + @controller + '</td>
</tr>
<tr>
<td nowrap width="50%">Run settings used</td>
<td nowrap width="50%">' + @runsettings + '</td>
</tr>
</table>
</div>
</div>
';

RETURN @ret;


END
GO

Remember to set the content-type of the message as
Content-Type: multipart/mixed
so that it can be displayed  as desired.

Sunday, July 14, 2013

Crash course in file systems

File System Fundamentals :
File attributes include name, type, location, size, protection, and time, date and user identification.
Operations supported are creating a file, writing a file, reading a file, repositioning within a file, deleting a file, and  truncating a file.
Data structures include the following
1) two levels of internal tables : there is a per process table of all the files that each process has opened . This points to the location inside a file where data is to be read or written. This table is arranged by the file handles and has the name, permissions, access dates and pointer to disk block. The other table is a system wide table with open count, file pointer, and disk location of the file.
Sections of the file can be locked for multi-process access and even to map sections of a file on virtual memory systems. The latter is called memory mapping and it enables multiple processes to share the data. Each sharing process' virtual memory map points to the same page of physical memory - the page that holds a copy of the disk block.
File Structure is dependent on the file types.  Internal file structure is operating system dependent. Disk access is done in units of block. Since logical records vary in size, several of them are packed in single physical block as for example at byte size. The logical record size, the physical block size and the packing technique determine how many logical records are in each physical block. There are three major methods of allocation methods : contiguous, linked and indexed. Internal fragmentation is a common occurrence from the wasted bytes in block size.
Access methods are either sequential or direct. The block number is relative to the beginning of the file. The use of relative block number helps the program to determine where the file should be placed  and helps to prevent the users from accessing portions of the file system that may not be part of his file.
File system is broken into partitions. Each disk on the system contains at least one partition. Partitions are like separate devices or virtual disks. Each partition contains information about files within it and is referred to as the directory structure. The directory can be viewed as a symbol table that translates file names into their directory entries. Directories are represented as a tree structure. Each user has a current directory. A tree structure prohibits the sharing of a files or directories. An acyclic graph allows directories to have shared sub-directories and files. Sharing means there's one actual file and changes made by one user are visible to the other. Shared files can be implemented via a symbolic link which is resolved via the path name. Garbage collection may be necessary to avoid cycles.
Protection involves access lists and groups. Consistency is maintained via open and close operation wrapping.
File system is layered in the following manner:
1) application programs, 2) logical file system, 3) file-organization module, 4) basic file system, 4) i/o control  and 5) devices. The last layer is the hardware. The i/o control is the consists of device drivers and interrupt handlers, the basic file system issues generic commands to the appropriate device driver. The file organization module knows about files and their logical blocks. The logical file system uses the directory structure to inform the file organization module. The application program is responsible for creating and deleting files. 
Syllabus :
Algorithms and Data Structures
Programming questions
Design patterns
.Net Technology Stack
ASP.Net
WCF
WPF
SQL Server
Windows operating system
 

Saturday, July 13, 2013

Today mom was 8 - 4. When mom was at work, I got to watch two shows just like mom told Patti and Patti told me. Dad gave me some cursive writing so I did that. I played with some Legos. After mom came back, I rode bikes for some time. First Kailey rode bikes with me then she went inside. In the night, we all played with chutes and ladders.
 
The client logon process
We look at how DC and clients communicate during logon process but in addition we also look at how a RODC gets passwords from a writeable domain controller and caches them locally. Each RODC has a krbtgt account . The krbtgt accounts are the credentials used by DCs to encrypt Kerberos messages. RODC keeps the password for this locally. However, writeable DCs keep all the passwords for krbgt accounts. Let us consider an example where a user logins to his PC and there are two DCs : one read-write and another read only. First the users machine will contact RODC and provide a Kerberos authentication service request packet. When the RODC receives this KRB_AS_REQ packet it will check its local database to see if it already has the user's password. If it does not, it will forward the RWDC and provide the KRB_AS_REQ it received. The RWDC will generate a KRB_AS_REP - a Kerberos authentication service reply and forward it back to RODC. This is then relayed to the user's machine. At this point, the users machine has a Kerberos ticket-granting ticket signed with the domain krbtgt account. Two additional steps are performed next. The RODC will request the RWDC to have the user PC's credential replicated to it. The RWDC will verify if such a policy is permitted and replicate the password.
Now let's look at the process of user authentication. The following are the steps taken when the user attempts to login to his machine.
The user presents his credential to his machine that sends a KRB_AS_REQ to the use. When the RODC receives this request, it will check its local database to see if there's a password cached otherwise it proceeds to the same steps that it took for machine account. However the user logon is still not complete. Before the user can use his workstation, he must obtain a Kerberos service ticket (TGS) for his PC. The user's machine sends the  KRB_TGS_REQ including the users TGT from the previous example. However RODC is unable to decrypt since it is encrypted with the domain krbtgt account. RODC transmits the KRB_TGS_REQ to RWDC who replies with the corresponding response. The RODC receives a valid KRB_TGS_REP. At this point instead of forwarding it to the user's machine, the RODC may decide to send an error indicating the ticket has expired. Since RODC now holds cached credentials for the user it is able to construct a new KRB_AS_REP locally and thus a new TGT for the user, encrypt it with the local krbtgt account and transmit it to the user's machine. The user's machine sends a new TGS request to RODC including the new TGT for the user which the RODC is able to decrypt and construct a TGS response permitting the user to use his PC. After completing these steps, the user is able to logon.

reports and xslt and load test runs.

XSLT transformation enables test results to be displayed. This is how we prepare the data for display or mailing out to subscribers. The resulting xhtml is easy to share. First we get the results from a trx file or the stored procedure execution in a database. This gives us the data in the form of a xml or dataset. Then we take create the xslt with the summary we would like to see. Note that visual studio has default summary view and results that you can open from a load test run using the open and manage results button on the toolbar. This already converts the summary and the results to an html that can be cut and pasted into any application using object linking and embedding technology. The views we create with xslt merely defines a customized view using headings, row and columns to summarize the data.
 The results from a trx file or a stored procedure execution need not be wrapped in a html. It can be converted to xml or excel file with a load test plugin. The load test plugin will simply have an event handler invoked at the end of the relevant test execution and can be written in C#.
Likewise xslt transform and database mail can be written as a SQL Stored procedure. So that newer additions of test runs can trigger database mail. This also scales well to enterprise load. where the runs and the results could be stored in the order of gigabytes. It is easier to design the html and transforms using tools such as report manager and word prior to moving it inside a stored procedure.
Reports can be generated for all kinds of runs. For performance testing, these runs are usually, load test, stress test, and capacity test.  The load test determines the throughput required to support the anticipated peak production load,  the adequacy of a hardware environment and the adequacy of a load balancer. It also detects functionality errors under load, collects data for scalability and capacity planning. However, it is not designed for speed of response. Stress test determines if data can be corrupted by overstressing the system, provides an estimate of how far beyond the target load an application can go before causing failures and errors in addition to slowness, allows establishing application monitoring triggers  to warn of impending failures and helps to plan what kind of failures are most valuable to plan for. Capacity test provides information about how workload can be handled to meet the business requirements, provide actual data that capacity planners can use to validate or enhance their models or predictions, and determines current usage and capacity of the existing system as well as trends to aid in capacity planning. Note that in practice, the most frequently used tests are smoke test which is the initial run of the performance  test to see if your application can perform its operation under normal load. For all these runs, reports generation and subscription is somewhat similar.

Friday, July 12, 2013

Publishing load test results.

In the Visual Studio, when we open a load test, we see an option to "open and manage Results" in the toolbar.  This brings up a dialog box which lists the results associated with the loadtest. Each of these results can be selected and opened. Opening a result brings up the summary view by default. This view can then be cut and paste in the body of an e-mail for reporting. Alternatively it can be exported to a file on the fileshare.

SQL Services Reporting Manager provides great functionality to design custom reports. These reports can draw data using SQL queries. They can also be subscribed with e-mail registration.

Team Foundation Server enables automation of a performance test cycle.  The steps involved in a performance test cycle are as follows:
1. Understand the process and compliance criteria
2. Understand the system and the project plan
3. Identify performance acceptance criteria
4. Plan performance-testing activities
5. Design tests
6. Configure the test environment
7. Implement the test design
8. Execute the work items
9. Report results and archive data
10. Modify the pain and gain approvals for Modifications
11. Return to activity 5
12. Prepare the final report.

The first step involves getting a buy-in on the performance testing prior to the testing and to comply with the standards, if any.  The second step is to determine the use case scenarios and their priority. The third step is to determine the requirements and goals for performance testing as determined with stakeholders, project documentation, usability study and competitive analysis.  The goals should be articulated in a measurable way and recorded somewhere. Plan work items to project plans and schedule them accordingly. This planning is required to line up the activities ahead of time. Designing performance tests involves identifying usage scenarios, user variances and generating test data. Test are designed based on real operations and data, to produce more credible results and enhance the value of performance testing. Tests include component-level testing. Next, configure the environments using load-generation and application monitoring tools, isolated network environments and ensuring compatibility all of which takes time. Test Designs are implemented to simulate a single or virtual user. Next, work items are executed in the order of their priority, their results evaluated and recorded, communicated and test plan adapted. Results are then reported  and archived. Even if there are runs that may not all be usable, they are sometimes archived with appropriate labels.  After each testing phase, it is important to review the performance test plan. Mark the test plan that have been completed and evaluated and submit for approval. Repeat the iterations. Finally, prepare a report to be submitted to the relevant stakeholders for acceptance.

Thursday, July 11, 2013

Mailing data and database objects best practices:
Data can be attached to mail  in several different file formats including Excel, rtf, csv, text, HTML.
Data can also be included in the body of the mail.
You would need access to MAPI or SMTP to send mails
When sending a data access page, share the database so that users can interact with the page.
Create the data access page using UNC paths so that they are not mapped to local drives
Store the database and the page on the same server.
Publish from a trusted intranet security zone
Send a pointer instead of a copy of the HTML source code
For intranet users, UNC and domains alleviate security considerations while the same can be used to demand permissions for external users.
Always send the page to yourself and view the code before mailing others.
System generated mails for periodic activities or alerts are common practice in most workplace. There are several layers from which such mails can be generated. SQL Server has a xp called sendmail that can send messages to an smtp server. It needs to be enabled via server configurations. The sendmail xp can be directly invoked from the stored procedures which are very close to the data.
SSRS is another layer from which well-formed reports can be mailed out. These are again designed and sent out from SSRS. The TFS or source control is another place which can send mail. Automated performance reports can also be sent out this way.

Tuesday, July 9, 2013

REST API : resource versus API throttling

REST APIs should have cost associated with the resources rather than the APIs because there is no limit on the number of calls made per API. If the response sizes can be reduced with inline filter, then it directly translates to savings for both the sender and the receiver.

Some of the performance degradations occur due to :
premature optimization
guessing
caching everything
fighting the framework

Performance can be improved with
1) finding the target baseline
2) knowing the current state
3) profiling to find bottlenecks
4) removing bottlenecks
5) repeating the above

Request distribution per hour, most requested, http statuses returned, request duration, failed requests etc all help with the analysis. Server logs can carry all this information. Tools to parse the logs for finding these information could help. Process id and memory usage can directly be added to the server logs.  Server side and client side performance metrics help to isolate issues.

Benchmarks are available for performance testing of APIs.  CDN should not matter in performance measurements. use static file return as baseline. Separate out I/O and CPU bound processes.

Courtesy : rails performance best practices