Cluster computing

Tuesday, June 18, 2013

Some of the common techniques in finding patterns with data mining include Association rule mining which consists of finding frequent item sets from which strong association rules are generated.Associations can be analyzed to uncover correlation rules which give statistical information.
Frequent pattern mining can be categorized based on completeness, levels and dimensions of data, types of values, kinds of rules, patterns. Frequent pattern mining can be classified into frequent itemset mining, sequential pattern mining, structured pattern mining etc. Algorithms for frequent itemset mining can be of three types :
1) apriori-like algorithms: The Apriori algorithm mines frequent item sets for Boolean association rules. Based on the property that the non-empty subsets of a frequent itemset must also be frequent, the kth iteration, it forms frequent k-itemset candidates based on the frequent (k-1) itemsets.
2) frequent pattern based algorithms: FP-Growth does not generate any candidates but constructs a highly compact data structure (FP-tree) and uses fragment growth.
3) algorithms that use vertical data format transform a given data set of transactions in the horizontal data format of TID-itemset into the vertical data format of item-TID_set. Then it mines using the Apriori property and additional optimization techniques such as diffset.
These same methods can be extended for the mining of closed frequent itemsets from which the set of frequent itemsets can easily be derived. These include additional techniques such as item merging, sub-itemset pruning and item skipping.
These techniques can be extended to multilevel association rules and multidimensional association rules.
Techniques for mining multidimensional association rules can be categorized according to their treatment of quantitative attributes. For example, they can be discretized statically based on predefined concept hierarchies. Quantitative association rules can be mined where quantitative attributes are discretized dynamically based on binning/clustering.
Association rules should be augmented with a correlation measure such as lift, all_confidence and cosine.
Constraint-based rule mining refines the search for rules by providing meta rules and constraints which can be antimonotonic, monotonic, succint, convertible and inconvertible. Association rules should not be used for prediction without training.

book review

Data Mining concepts and techniques Jiawei Han and Micheline Kamber

This book mentions the data preprocessing steps as descriptive data summarization, data cleaning, data integration and transformation, data reduction, data discretization and automatic generation of concept hierarchies.
Descriptive data summarization provides the analytical foundation for data pre-processing using statistical measures such as mean, weighted mean, median and mode for center, range, quartiles, interquartile range, variance, and standard deviation for dispersion, histograms, boxplots, quantile plots, scatter plots, and scatter plot matrices for visual representation.
Data cleaning routines fill in missing values, smooth out noise, identifying outliers and inconsistencies in the data.
Data integration combines data from multiple sources into a coherent data store by smoothing out data conflicts, semantic heterogeneity and contribute towards data integration.
Data transformation routines convert the data into appropriate forms for mining involving steps such as normalization.
Data reduction techniques such as data cube aggregation, dimensionality reduction, subset selection and discretization can be used to obtain a reduced representation of data.
Data discretization can involve techniques such as binning, histogram analysis, entropy based discretization, cluster analysis and intuitive partitioning. Data processing methods continue to evolve due to the size and complexity of the problem.

Data Mining is the discovery of knowledge based on finding hidden patterns and associations, constructing analytical models, perform classification and prediction and presenting the mining results using visualization tools. Data Warehousing helps with providing summarized data. A data warehouse is defined in this book as a subject-oriented, integrated, time-variant, and non-volatile collection of data organized for decision making. A multidimensional data model is used to design the data warehouse and consists of a data cube with a large set of facts or measures and a number of dimensions. A data cube consists of a lattice of cuboids.Concept hierarchies organize the values into levels of abstraction.
Data Mining can use OLAP queries as well as On-line analytical mining (OLAM)

Monday, June 17, 2013

Q: Find the second largest number in an integer array
A: int GetSecondLargest(int[] array, uint length)
{
if (array == null || length < 1 ) throw new Exception();
int max = array[0];
bool found = false;
int secondmax = 0;
for (int i =1; i < length; i++)
{
if (found == false)
{
if (array[i] < max)
{
found = true;
secondmax = array[i];
}
}
if (found != false && array[i] > secondmax)
{
if (array[i] < max)
{
found = true;
secondmax = array[i];
}
}
if (array[i] > max)
{
found = true;
secondmax = max;
max = array[i];
}
}
if ( found == false) throw new Exception();
if ( secondmax == max) throw new Exception();
return secondmax;
}

Question: Given an arbitrary two dimensional matrix of integers where the elements are sorted in increasing order along rows and columns, find a number in the matrix closest to and less than or equal to a given number.
Answer:
uint GetElement(int [,] matrix, uint startrow, uint startcol, uint endrow, uint endcol, uint number)
{
while(startrow < endrow && startcol < endCol)
{
uint midrow = (startrow + endrow) / 2 ;
uint midcol = (startcol + endcol) / 2;

if (matrix[midrow, midcol] < number))
{
startrow = midrow;
startcol = midcol;
}
else
{
endrow = midrow;
endcol = midcol;
}
}
if (startrow == endrow && startcol == endcol)
{
return matrix[startrow, startcol] < number ? matrix[startrow, startcol] : 0;
}
if ((startcol == endcol && startrow == endrow - 1) || (startrow == endrow && startcol == endcol - 1) )
{
if (matrix[endrow, endcol] < number) return matrix[endrow, endcol];
if (matrix[startrow, startcol] < number) return matrix [ startrow, startcol];
return 0;
}
if (matrix[startrow, startcol] < number)
{
startrow = endrow;
startcol = endcol;
}
uint topright = startcol - 1 > 0 && startrow - 1 > 0 ? GetElement(matrix, 0, startcol, startrow - 1, endcol, number) : 0;
uint bottomleft = startrow + 1 <= endrow && startcol - 1 > 0 ? GetElement(matrix, startrow + 1, 0, endrow, startcol - 1,
number) : 0;
if (topright < bottomleft)
return bottomleft;
else
return topright;
}

Sunday, June 16, 2013

Microsoft Exchange Architecture continued

Microsoft Exchange Architecture
We will look at the Exchange Server Architecture in detail here:
1) Unified Messaging: The unified messaging server role enables unified messaging for an Exchange Server organization.
Features included are :
a) Outlook voice access : user logs onto the mailbox and accesses it via a voice user interface. An associated UM server checks Active Directory for addresses and access information.
b) Call answering : UM Server plays the individual's greeting and captures voice mail message which is then sent to Hub transport server for delivery
c) Play on Phone: User receives a voice mail message and selects play. Outlook uses https to communicate with the UM web services to fetch the appropriate message.
d) One Inbox: Unified messaging puts all the voice, video and audio content in the same mailbox for the users to pull it from different devices.
e) The Active Directory UM objects has a dial plan comprising of an auto-attendant and user dictated Mailbox policies. A UM IP gateway communicates with the external PBX switchboard.

2) The Mailbox server role: includes the following features:
a) Resource Booking Attendant: This enables conference room booking, enforces duration, who can book, delegates for approval and provides conflict information. The policies and resources for auto-accept are booked using OWA or Exchange management shell.
b) Generate Offline Address Book: OAB files are generated, compressed and placed on a local share. Administrators can configure how the address books are distributed.
c) Outlook Client Connection : Clients inside the mailbox server can access the mailbox server directly to send and retrieve messages.
d) Exchange administration : Administrator-only computer retrieves active directory topology information from the corresponding AD service.
e) Mailbox and Public Folder Databases: Private user database as well as public folder information are stored in Exchange databases as logical containers
f) Exchange Search : generates fulltext index and indexes new message and attachments automatically
g) Calendar attendant : automatically puts new meetings as tentative appointments and deletes out of date meeting requests.
i) Messaging records management: This is a managed folder.

3) Client Access Server Role: includes the following features:
1) Exchange web services : These web services comprise of the autodiscover service, exchange data service, availability service, synchronization service, notification service, and managed folder service. Clients using EWS communicate over the https and SOAP / REST and these services are hosted in the IIS. The autodiscover service lets the clients find the exchange server via AD or DNS. The Exchange data service provides read or write access to mailbox and public folder mail, contact, tasks and calendar data. The synchronization and notification services alerts changes to mailbox and synchronizes public folders. The availability service retrieves free/busy information and meeting time suggestions.
2) Exchange Active Sync : Used mainly by the handheld devices, this service is used to push messages from the intranet to the devices using cellular / wireless network and SSL. Remote device wipe can also be initiated.
3) Outlook web access : A variety of authentication schemes together with light and full feature clients enable mailbox to be accessed via browser.
4) CAS Proxy and redirection. Proxy enables another CAS server to be made available when one is not. Redirection informs the user the correct OWA url when user tries to access another.
5) Intranet features include Sharepoint, file share integration, conversion of pdf and office attachments to HTML, single sign on for mailbox server access, mailbox server access and most OWA configuration settings are stored in Active Directory.

4) High Availability includes features such as
1) no replication : Failover cluster is built using shared storage array and Microsoft Cluster service.
2) replication to a local disk set : partitions data for performance and recovery and adds data redundancy without service redundancy
3) replication to a standby server : Source server can be standalone. Target must be stand-alone and passive. Logs are copied, verified and replayed. Database is copied. There is a built-in delay for log replay activity.
4) replication within a cluster : Failover cluster built using Microsoft Windows cluster service with log replay and copying database. Hub Transport Server acts as the file share witness.

5) Hub Transport Server includes features such as :
1) directly delivers message between the source server and the target server by reducing the hops. Internally, it has a pickup/replay directory, a submission queue, a categorizer that takes the messages from the submission queue and processes the message by resolving recipients, routing, converting content, processing routed messages, and message packaging before delivering it to the delivery queue.
Courtesy : MSDN

Microsoft Exchange Server Architecture

Microsoft Exchange Servers are deployed to Active Directory Sites except for the perimeter role. The AD must have at least one domain controller in each domain and at least one global catalog server. Generally, there should be 4 : 1 ratio of exchange processors to global catalog server processors. There are several different roles included with the Exchange Server.
These are :
1) Edge Transport Server Role:
The Edge Transport Server is deployed to the perimeter beyond which lies the internet. The Exchange hosted services are hosted in the Internet.The Edge Transport server runs in the perimeter network and provides message hygiene and security over untrusted networks. The EdgeSync service pushes configuration information to the Edge server using secure LDAP. The Edge server sends messages using SMTP TLS. The Edge transport filters the messages using antispam and antivirus filters which comprise of connection filter, address rewriting agent, edge rule agent, sender ID agent, recipient/sender filter, content writer, attachment filter and virus scanning.
2) Hub Transport Server Role:
The Hub Transport Server role handles all e-mail flow inside the organization, applies transport rules, applies journaling policies and delivers messages to a recepients mailbox.
3) Client Access Server Role: The client access server role supports the WebAccess and ActiveSync client applications, and the POP3 and IMAP4 protocols.
4) Mailbox Server Role :
The mailbox server role hosts mailbox and public folder databases. It also provides advanced scheduling services for Microsoft Office Outlook users, generates the offline address book, provides services that calculate e-mail address policies and address list for recepients.
5) Unified Messaging Server Role:
The unified messaging server role enables combines voice, fax, and e-mail messaging into a single infrastructure. These are all delivered to the user's inbox so that they can be accessed from a variety of devices. All unified messaging servers are placed in a central location and IP gateways are enabled in each branch office.

Management and Monitoring components : Exchange Management and monitoring is made easy with tools such as management shell and console. This helps answer questions like are all Exchange Services running, are all databases active, do disks have enough space, can clients connect with reasonable performance, are the servers performing efficiently and reliably, are they configured correctly and are they secure ?

High Availability : The Microsoft Exchange Server has a component for high-availability. It includes built in features that can provide quick recovery, high availability and site resiliency for mailbox servers. Availability is improved with no replication such as with single copy cluster (SCC) or shared storage cluster feature, replication within a cluster using cluster continuous replication feature, replication to a standby server using standby continuous replication and replication to a local disk set using local continuous replication.

Saturday, June 15, 2013

Finding the closest pair of points
Two points are closest based on their euclidean distance which is computed as sqrt( (x1 - x2) ^ 2 + (y1 - y2) ^ 2 ). The naive approach would be to compare two points at a time and exhaust the nC2 choices
Instead we can use a divide and conquer algorithm whose running time is T(n)=2T(n/2)+O(n)
Let us take a subset P of points in Q and have two arrays with each holding all the points in P. Array X is sorted with monotonically increasing x coordinates and Y is sorted with monotonically increasing y coordinates. We keep the arrays presorted.
First the point set P is divided into two set with a vertical line such that there is roughly half the points in each. These are stored as two subarrays within X. Similarly the Y is stored as two sub-arrays which contains the points sorted by monotonically increasing y-coordinate.
Next, we find the closest pair of points recursively first in the left subarray and then in the right subarray. The inputs to the first call are the subset PL and arrays XL and YL and this is repeated for the right. The minimum of the closest distances between the left and the right are chosen.
Combine The closest pair is either the pair with the distance delta found by one of the recursive calls or it is a pair of points with one point in PL and another in PR. To find such points, we create an array Y' with only those points in Y that are not within the 2-delta wide distance from the line dividing the points. Delta is the minimum distance found from the earlier step.For each point p in Y', try to find points within Y' that are less than delta apart while keeping track of the smallest delta' found in Y'. It has to compare with only seven others in the delta times 2 delta rectangle. If delta' < delta we have found points that exist in this narrow band and are the results of the search otherwise the recursive calls gives the points with the closest distance.