Sunday, June 30, 2013

How replication conflicts are resolved ?
There can be conflicts during replication cycle. For example, server A creates an object with a particular name at roughly the same time that Server B creates an object. with the same name. The conflict reconciliation process kicks in at the next replication cycle. The server looks for the version numbers of the updates and whichever is higher wins the conflict. If the version numbers are same, whichever attribute was changed at a later time wins the conflict. If the timestamps are equal, the guids are checked and whichever is higher wins.
If an object is moved to a parent that is now deleted, that object is placed in the lost and found` container.
 If  two objects with same RDN are created, then after the usual resolution process mentioned above, one of the conflicting attribute is modified with a known unique value. Between three servers, when a server receives updates from the other two, the conflict resolution is worked out  at this server and repeated at the other two when they request the updates from this server.
 Active Directory uses DNS for name resolution.  WINS is no longer supported.  DNS is a hierarchical name resolution system.  DNS is one of the largest directory services.  DNS nomenclature involves Zones, resource records  and dynamic DNS. Zones are delegated portions of the DNS namespace that a name server maintains. A resource record is the unit of information in DNS. A zone is essentially a collection of resource records. Common record types include address record, pointer record, alias record, mail exchange record, name server record, start of authority record and service record. The last one is used by DC and AD clients to locate servers for a particular service. Service records are dependent on address records. A list of all computers running a service is maintained as A records. 

Saturday, June 29, 2013

Active Directory replication continued ...
When replicating a naming context, a domain controller maintains a high watermark table to pick up where it left off. . There is one table for every naming context which totals three if we include schema, configuration and domain NCs. Each table stores the highest USN of the updates so that only new information is requested.
This is different from the Up-to-dateness vector  which is another table that the DC maintains to assist in efficient  replication of a naming context by removing redundancies in replication and endless replication loops. The two tables can be used together to improve the efficiency in replication.
By filtering out the same changes from multiple sources, only the updates that have not been made yet are done. This is called propagation dampening.  Thus we have seen that the Active Directory is split into separate naming contexts each of which is replicated independently and that within each naming context, a variety of metadata is held. Update entries consist of originating-DSA-GUID,originating-USN and a timestamp indicating the last successful replication with the originating domain controller. These values are updated only during a replication cycle.
 As an example of replication, we take the following example from the book :Step 1) a user is created on DC A Step2) That object is replicated to DC B. Step 3) DC B is subsequently modified and Step 4) the new changes to that object are replicated back to DC A. The Active Directory database transaction representing  step 1 consists of a USN  and the timestamp. The replication of the originating write to a DC B allocates a different USN and the users USNCreated and USNChanged attributes are updated. In Step 3) the password change for the user on DC B also modifies this USNChanged attribute for the user. In addition the password attribute is modified and the corresponding USN and timestamp updated. The step 4 is similar to step 2. A change transaction is issued and the attributes updated.  To look at how the replication occurs, we look at the following five steps : Step 1) Replication with a partner is initiated Step 2) the partner works out what updates to send. Step 3) The partner sends the updates to the initiating server. Step 4) The initiating server processes the updates and Step 5) The initiating server checks whether it is up to date.

Friday, June 28, 2013

Active Directory Site topology and replication

While data is usually replicated from a single master server to subordinate servers, Active Directory offers multimaster replication. The single-master replication has following drawbacks: it has a single point of failure, there's geographic distance from master to clients performing the updates, and less efficient replication due to single originating location of updates. With multimaster replication, you can avoid these but you have to first create a site topology and define how the domain controllers replicate with each other. The Knowledge Consistency Checker tool sets up and manages the replication connections. Subnets are added to site with 32 bit or 128bit IP addressing to determine relative locations on a network. AD sites are defined in terms of a collection of well-connected AD subnets. Replication flows are setup between sites and DFS shares or DCs are located using sites.  Sites are used to perform DNS queries via the DC locator service which finds the nearest DC or the global catalog. Most members of a domain dynamically determine their site when they start up. By default, there's one site created automatically. Multiple sites can be defined for a single location to segregate resources.
Site links allow you to define what sites are connected to each other and the cost associated. Site links are used for replication and can support IP or SMTP. The default replication happens via IP but in some cases where the connectivity is poor or unreliable. These site links help a DC to determine which  other sites to cover in addition to its own site for someone to logon to that site. As a trivia, if the network is not fully routed i.e. not all site links are available, then bridges have to be defined between sites. Connection objects specify which DC replicate with other DCs and are generally managed by the DC themselves. It isn't always possible to allow AD to manage all of these connections.
 Knowledge Consistency Checker tool automatically maintains and generates the connection objects and knows how to replicate them and when. It uses two algorithms one called intrasite and another called intersite. The intrasite is designed to create a minimal latency ring topology that guarantees no more than three hops between any two DCs in the site. The intersite on the other hand, tries to keep the sites connected via a spanning tree algorithm so that replication can occur and uses the site link metrics to make the connections. RepAdmin is a commandline tool for administering replication. Replmon is a graphical utility for managing and monitoring replication.
 Replication is done with either update sequence number (USN) or timestamps. Each DC maintains its highest combined USN for all naming contexts.
Constructing a web image graph requires uses the VIPS algorithm  which separates out the web page into visual blocks that are organized in a tree. Different blocks are related to different topics, so we use the hyperlinks from block to page, rather than from page to page. The page to block and block to page relationships are extracted first. The block to page matrix  Z with dimensions nxk is constructed first  and the matrix has values for a cell as the inverse of the number of pages to which a block links or zero otherwise. The page to block matrix X with dimensions kxn is populated with a normalized importance value based on the size and the distance from the center or zero otherwise. These two matrix are combined to form a new web page graph W = ZX.  Let Y define a block to image matrix with dimension nxm such that each block contains the inverse of the number of images contained in the image block or zero otherwise. Using link level analysis, the block graph is defined as Wb = (1-t)ZX +TU/D where t is a constant and D is a diagonal matrix. The diagonal matrix is populated with zero if block i and block j are contained in two different web pages, otherwise it is set to the default degree of coherence from VIPS algorithm. The block graph attempts to define the probability of jumping from block a to block b. Then the image graph is constructed Wi = (Y-Transposed)WbY. This image graph better reflects semantic relationships between images and can be used with the data mining techniques discussed earlier.
 
A frequently occurring interview question is to print the nodes of a tree using the breadth first traversal. The breadth first traversal of a tree involves visiting each node from left to right at each level on the tree starting from the root and descending to the leaves. The order in which each sibling occurs is the order in which their parents occur. Dijkstra's algorithm uses this to order the visits in the breadth first traversal by maintaining a queue. The root is added to the queue. There are no more nodes at this level. When the root is dequeued, its siblings if any are added to the queue. When one of the siblings is dequeue, its siblings are enqueued. When we dequeue the last node at any level and add its siblings to queue, the next node to be dequeued is the first node of the next level. We repeat this until the queue is empty. The queue is empty when all of the nodes have been added and removed. Since the visits are ordered the time complexity is linear. 

        static void PrintBreadthNodes(BinaryTreeNode root)
        {
            if (root == null) return;
            var list = new List<BinaryTreeNode?>(){null};           
            list.Add(root);
            while (list.Count > 0)
            {
                var item = list.First();
                list.RemoveAt(0);
                if (item != null)
                {
                    Console.Write(item);
                    if (item.left != null) list.Add(item.left);
                    if (item.right != null) list.Add(item.right);
                }
                else
                {
                    if (list.Count == 0) return;
                    list.Add(null);
                    Console.WriteLine();
                }
            }
        }

Thursday, June 27, 2013

Mining multimedia data on the web
Multimedia can be embedded on the web pages and associated with link information. These texts and link information can be regarded as features. Using some web page layout mining techniques, a web page can be partitioned into a set of semantic blocks, then the block that contains the semantic data can be considered a whole. Searching and organizing multimedia data can be considered as searching the multimedia as a whole. VIPS discussed earlier can help identify the surrounding text and this text can be used to build an image index. Then image search is partially completed using traditional text search techniques. To construct a web-image search, in addition to the block to page and page to block relations, a block to image relation is also used.
For the automatic classification of web documents, different vendors maintain their own taxonomy of web document and new documents are classified based on this taxonomy. Otherwise the procedures described earlier for keyword based document classification and keyword based association analysis are also used here.
We discussed web usage  and structure based mining, and from the previous post, we now discuss web usage mining. Here web log records is used to discover who accesses what pages and their access patterns. This can give clues to who the potential customers are, enhance the quality and delivery of internet information services to end users and improve performance. Web log records can quickly grow in size and can amount to huge data size. There is a lot of information that can be mined but valid but making sure its valid and reliable is equally important.  So we begin with cleaning, condensing and transforming as preprocessing methods. Next with the available URL, time, IP address and web page content information, a weblog database is populated. This is used with a very typical data warehouse based multidimensional OLAP analysis. Then we use association mining as discussed earlier to find patterns and trends of web access. For example, user browsing sequences of web pages can be helpful to improve their web experience. Systems can be improved with better web caching, web page prefetching, web page swapping, understanding the nature of web traffic and understanding the user reaction and motivation. Web sites can improve themselves based on the access pattern and they are referred to as adaptive sites.
Web usage together with web content and web structure together help with web page ranking. Hence the quality of search is improved, because search is contextualized and personalized.
Having talked about the VIPS algorithm and structural relationships, it may be interesting to note that we extract page to block and block to page relationship to construct a graph model with page graph and block graph. Based on this graph model, new link analysis algorithms capable of discovering the intrinsic semantic structure of the web. The block to page (link structure) and the page to block(page layout) relationships are used in block level link analysis. The block to page relationship is obtained from link analysis, using a matrix for distances, which gives more accurate and robust representation of the link structures of the web. The page to block relationships are obtained from the page layout analysis which can be segmented into blocks. Then we construct the block graph and the image graph in turn. This image graph better reflects the semantic relationships between the images. 

Wednesday, June 26, 2013

Mining the world wide web
The world wide web serves as a huge, widely distributed global information service center for a variety of market such as finance, consumer, education, government, e-commerce, and many others. In addition to the content on the webpage, web page access, usage information and hyperlinks also provide additional sources for data mining.  The challenges are : 1) The web is too huge for effective data warehousing and data mining. 2) The web pages are complex, lack a unifying structure and vary quite a bit. 3) The web is a highly dynamic information source. 4) The web serves a broad diversity of user communities. 5) The bulk of the information is considered irrelevant. Web search engines serve up resources on the internet and they usually index the web pages and store huge keyword based indices. However, such approach has had limitations. First a topic can contains hundreds of thousands of documents. This can lead to a number of document entries returned by a search engine. Second relevant documents may not be retrieved because they don't contain the keywords.  Since keyword based web search engine is not sufficient for web resource discovery, web mining has to address search on web structures, rank the contents, discover patterns etc. This is typically categorized as web content mining, web structure mining, and web usage mining.  Web pages are supposed to have a DOM tree structure. and have a layout that's generally shared by many. Hence page segmentation based on Vision is a common technique. The page layout is taken and the blocks are extracted while finding the separators so that the web page can be represented as a semantic tree. Search engines also automatically identify authoritative web pages. This is done based on the collective endorsement of web pages by way of hyperlinks pointing from one page to another. These hyper links infer the notion of authority. Also, a so called hub can provide a collection of links to authorities. Hub pages may not be prominent, or there may exist few links pointing to them. but they can be mined to find authoritative pages. This algorithm called HITS (Hyperlinked Induced topic search involves using the query terms to collect a starting set, also called the root set, Since many of the pages are presumably relevant, the root set can be expanded to a base set based on a threshold of additional pages to include. And a weight propagation method is initiated. Finally the HITS algorithm outputs a short list of pages with large hub weights. To workaround the size of the hub pages, the weights of the links are adjusted and the hub pages are broken down into smaller units. Google's page rank uses something similar. Other newer algorithms include block-level link analysis and block to page relationship. Next we will look at mining multimedia data on the web.
Comparison of decision tree versus vector space model for document databases
A decision tree operates well on structured data where each tuple is defined by a set of attribute value pairs. This analysis decides which features are more discriminating and build the tree top down. Once the tree is constructed, it is used across all records one by one. Alternatively, the rules can be derived and ordered in a flat list based on their discriminative power and occurrence frequency.  Either way, both the attribute value pairs and the order of evaluation are set prior to evaluating new data. This is efficient for large data sets where each data object can be evaluated against the tree to find the label to assign.
But for document databases, this is not the case. The set of attributes or dimensions is not fixed and furthermore, some attributes may be more relevant than others. Support vector space model work well in such high-dimensional space since they use a mapping function that maps term space to a quantitative variable. However, vector space model may be assigning improper weights to rare items since it does not take into account the distribution characteristics of other data points. This can be addressed with a preprocessing step called feature selection which removes the terms in the training documents that are statistically uncorrelated with the class labels.
To recap, here are the advantages and disadvantages of using decision trees versus vector space model.
 
Advantages
Disadvantages
Decision Tree
·         Useful for structured data
·         Applicable to hierarchical categorical distinctions.
·         Automated tree induction is straightforward.
·         Accuracy at each level can be determined before growing the tree
·         Can be printed out and interpreted
·         Amount of training data reduced as we descend the decision stumps
·         Force features to be checked in a specific order even if the features may act relatively independent.
·         Doesn’t work well with weak predictors of correct label.
Vector space model
·         Works with varied length attribute value pairs
·         Allows quantitative representation of data points
·         Allows all features to act in parallel.
·         Ordering is not required
·         Overcomes weak predictors with feature selection
·         Disregards class distributions
·         May assign large weights to rare items.
·         Can be extremely slow

For example,  the following code attempts to find the topic as a single keyword in the provided text.

import nltk.corpus
from nltk.text import TextCollection
from nltk import cluster
from numpy import array

# method to get the topic of a given text
def getTopic(text):
    # clean input
    stop = open('stopwords.txt').read()
    l = []
    src = [w.strip(" .,?!") for w in nltk.word_tokenize(text.lower()) if w not in stop]
    candidates = nltk.FreqDist(w for w in src if w.__len__ > 3)
    candidates = candidates.keys()[:10]

    # initialize vectors
    brown = TextCollection(nltk.corpus.brown)
    for w in candidates:
        l.append((w,brown.tf_idf(w, candidates)))
    vectors = [array(l)]

    # initialize the clusterer
    clusterer = nltk.cluster.kmeans.KMeansClusterer(10, euclidean_distance)
    clusterer.cluster(vectors, True)

    #pick the one closest to the center of the largest
    o = [(clusterer.classify(l.index(i)), l.index(i)) for i in range(l.__len__)]
    o.reverse()
    print o.pop().index(1)
 

Tuesday, June 25, 2013

Keyword based association analysis:Such analysis collects sets of keywords or terms that occur together and finds correlations between them. Each document is considered a transaction and the set of keywords in that documents are considered a set of items in a transaction. The association mining process can detect compound associations, such as domain dependent terms or phrases and non-compound associations such as units of measure. These are called term level association mining  as opposed to mining on individual words and consist of bigrams and trigrams. Terms and phrases are automatically tagged and the number of meaningless results are greatly reduced. With such term and phrase recognition, standard association mining or max pattern mining may be evoked.
Document classification analysis: Automated document classification assigns labels to documents for easy lookup and retrieval This is used for tagging topics, constructing topic directory, identifying document writing styles, and the purposes of grouping hyperlinks associated with a set of documents. Document classification is automated in the following way: First, a set of pre-classified documents is used as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process, then it can be used for classifying other documents. This is different from classifying relational data which is structured. Each tuple in the relational data is defined by a set of attribute value pairs and the classification analysis decides which attribute value pair is most discriminating. Document databases on the other hand contain a set of a keywords per document and don't have any fixed set of attributes and treating each keyword as a dimension results in a large dimension set. Decision tree analysis is not suitable for document classification.
Techniques commonly used for document classification include nearest neighbor classification, feature selection methods, Bayesian classification, support vector machines, and association based classification.
The k- nearest neighbor classifier assumes that similar documents share similar document vectors and are expected to be assigned the same class label. We can simply index all of the training documents, each associated with its corresponding class label. For a given test document query, we can retrieve the k - most similar and use their label. To refine the label selection, we can tune k or use weights associated with documents. Furthermore, a feature selection process can be used to remove terms in the training document that are statistically uncorrelated with the class labels. Bayesian classification is another popular technique that involves the statistical distribution of documents in specific classes. This classifier first trains the model by generating a document distribution to each class c of document d and then tests which class is most likely to generate the test document. Another classification method is the support vector machine. Here the classes are represented by numbers and a direct mapping function from term space to the class variable is constructed.  The least square linear regression method can be used to discriminate this classification. Association based classification classifies document  based on a set of associated frequently occurring text patterns. However frequent terms are usually poor discriminators and the not so frequent ones have better discriminative power. Such an association based classification method proceeds as follows First, we extract the keywords and terms by information retrieval and simple association analysis techniques mentioned above. Second, we use a concept hierarchy with term classes such as WordNet, or expert knowledge or keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term association mining can then be applied to discover sets of associated term that can be used to maximally distinguish one class of documents from another. This derives a set of association rules associated with each document class which can be ordered based on their discriminative power and occurrence frequency and then used to classify new documents.

Monday, June 24, 2013

Steps for processing text:
Text cleaning : clean the data, remove stop words and resolve inconsistencies and stem the terms.
Keywords are extracted based on their conditional probability distributions.
Text is usually analyzed for topics by clustering with K-means using a cosine based distance vector.
Another data mining tool is to use the clustering for high-dimensional data.  For example, text documents may contain thousands of terms or keywords as features. Document classification differ from topic analysis in that the document is high-dimensional where there are many keywords as features. Text documents are clustered on the frequent terms they contain. Terms are extracted and stemmed to reduce the term to its basic stem. This reduces the document to a bag of words.  If each term is treated as a dimension, the document becomes high-dimensional where there are as many features as keywords. That is, by using a frequent term based analysis, a well selected subset of all documents can be considered as a clustering. The frequent term based cluster is used to refer to a cluster which consists of the set of documents containing all of the terms of the frequent term set. A good subset of all the frequent term sets is selected. A good subset is one that covers all of the documents. Cluster overlap is measured by the distribution of documents supporting some clusters over the remaining cluster candidates. Clusters are automatically described by their frequent set.
Information retrieval is concerned with the organization and retrieval of information from a large number of documents usually with metrics such as Term-frequency-inverse-document-frequency and Shannon information. Two documents are considered to be similar if they have a high cosine measure. Unlike database systems that address concurrency control, recovery, transaction management and update, information retrieval is concerned with keywords and the notion of relevance.
Popular Text Indexing techinques include inverted indices and signature files. An inverted index comprises of two indexed tables (hash or B+-tree): document_table and term_table. The document table comprises of the document_id and a list of terms for each document aka posting_list. The term table consists of the term_id and the list of  document identifiers in which the term appears. The list of terms can be quite large so a hashing technique and a superimposed coding technique to encode a list of terms into bit representation.
 A signature file stores a signature record for each document in the database. Signatures are usually b bits long, initialized to zero, and a bit is set to 1 if the term appears in the document.  To reduce the high-dimensionality of documents, reduction techniques such as latent semantic indexing, probabilistic latent semantic indexing and locality preserving indexing can be used.
Latent semantic indexing decomposes the document matrix using singular value decomposition. i.e. it extracts the most representative features while minimizing the reconstruction error. If the rank of the term document X is r, then LSI decomposes X using X=USigmaVsuffixT where Sigma is the representation of the singular values of X. The LSI uses the first K vectors to embed the documents in a k-dimensional subspace.
Locality preserving index - extracts the most discriminative features by preserving the locality of documents in the reduced dimensionality space. Locality implies semantic closeness. LPI uses a minimization function with constraints.
While LSI seeks to uncover the most representative features, LPI aims to discover the geometrical structure of the document space.
Probabilistic Latent Semantic indexing is similar to LSI but achieves reduction through a probabilistic mixture model. Specifically there are k latent common themes in the document collection and each is characterized by a multinomial word distribution.
There are other approaches to text mining as well in addition to keyword based approach. These are 1) tagging approach and 2) information extraction approach. Tagging may be manual or done by some automated categorization algorithm where the tag set is small and predefined. Information extraction is deeper in that it requires semantic analysis of text by natural language understanding and machine learning methods. Other text mining tasks may include document clustering, classification, information extraction, association analysis, and trend analysis.
Self organizing feature maps is a neural network method for cluster analysis. A neural network is a set of connected input/output units, where each connection has a weight associated with it. They are popular for clustering because 1) they are inherently parallel and distributed processing architectures 2) they learn by adjusting their inter-connection weights so as to best fit the data. With this, they normalize the patterns and act as feature extractors for the various clusters. 3) They process numerical vectors and require object patterns to be represented by quantitative patterns.
Each cluster is represented as an exemplar which means a prototype and does not have to match a data example. Data points are assigned to cluster that is most similar to an exemplar based on a distance measure. The attributes are then predicted from the attributes of the exemplar.
Self-organizing feature maps represent all points in a high-dimensional source space by a points in 2-D or 3-D space such that distance and closeness are maintained. The method is useful when the problem is inherently non-linear.
SOM can be viewed as constrained versions of k-means clustering where the cluster centers are in low dimensional space.
Clustering is performed with several units competing for the current object. The unit whose weight vector is closest to the current object becomes the winning unit. The weights of this unit and those of its neighbors are adjusted to get closer to the input object. The assumption is there is a topology or ordering in the input that the units will eventually take shape. This is called a feature map. Such processing is applied to web mining but is costly for large databases.

Sunday, June 23, 2013

A closer look at decision tree induction
Decision tree can be built from training data using this kind of algorithms. The non-leaf nodes denote a test on an attribute and the leaf node denotes a class label. Attribute values of each tuple are evaluated before a class-label is assigned to it. Decision trees can be applied to high-dimensional data because multiple attributes can be added to the tree.
Tuples from the training data are class-labeled and these are used to build the decision tree. Attribute selection measures are used to select the attribute that best partitions the tuples into distinct cases. The set of candidate attributes and the attribute selection method that best partitions the data tuples into individual classes are available as input.
Generate_Decision_Tree(N, attribute_list):
First we create a node N.
If all the tuples in D are all of the same class C then return N as a leaf node labeled with class C.
If attribute list is empty then return the label that is majority
Apply attribute_selection_method(D, attribute_list) to find the best splitting criterion. The splitting criterion tells us which attribute to test at node N by determining the best way to partition the tuples. It also tells us which branches to grow from node N with respect to the outcomes of the chosen test. The partitions are kept as pure as possible i.e. they belong to the same class. A partition is pure if all of the tuples in it belong to the same class.
Label the node N with splitting criterion
A branch is grown from each of the outcomes of the splitting criterion. The tuples in D are partitioned accordingly.
If splitting Attribute is discrete valued and there are more than one splits possible, then set the attribute list to the remainder without the splitting attribute i.e. remove the splitting attribute.
foreach outcome j of splitting criterion
  partition the tuples and grow subtrees for each partition
  let  Dj is the set of data tuples satisfying the outcome j
  if Dj is empty then attach a leaf labeled with the majority class in D to node N;
  else attach the node returned by Generate_Decision_Tree(Dj, attribute_list) to node N;
end for
return N
 
Microsoft OLE DB for data mining :
OLE DB standardized data mining language primitives  and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch.  Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data  mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL. 
The OLE DB for Data Mining stack uses a data mining extensions (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information.  Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
 Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training : The data are loaded  into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.
 

Saturday, June 22, 2013

Interview questions on SQL Server:
1) What are the different isolation levels
2) What is the difference between repeatable read and read committed ?
3) What does the statement select from update do ?
4) What is the difference between Where and Having clauses ?
5) What are the differences between delete and truncate ?
6) What is key lookup in query plan ?
7) What is the statement to get query execution plan ?
8) What are the three parameters used to find bad query
9) What is the difference between clustered index and non-clustered index
10) What is SET NO COUNT ON ? What is the count after three DML statements ?
11) What is collation and case sensitivity ?
12) How do you handle errors in stored procedures ?
13) What is the statement to create a table by copying the schema of  another table ?
Interview questions on WCF, ASP.Net, NHibernate or MVC:
1) What is the difference between NHibernate and EntityFramework ?
2) If hbm files are not loaded, how do you include them ?
3) How do you define transactions for service calls ?
4) What is the transaction scoped to ?
5) What is MVC and how is it used ?
6) How is ASP.Net pages different from the MVC pages ?
7) What is the difference between the post back call both ASP.Net and MVC ?
8) What are the other characteristics of ASP.Net ?


 
Applications and Trend in data mining:
Data mining tools have been domain specific such as in finance, telecommunications and retail industry.  These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface.
Visual data mining is done with a designer style user interface that renders data, results and process in a graphical and usually interactive presentation. Audio data mining uses audio signals to indicate patterns or features.
Data analysis that use statistical methods involve regression, generalized linear models, analysis of variance, mixed-effect models, factor analysis, discriminant analysis, time-series analysis, survival analysis, and quality control.
Data mining or statistical techniques can also be applied to recommendations and opinions of customers to search for similarities and rank products. Such systems are called collaborative recommender systems.
Data mining may be ubiquitous where it is applied in the way we shop, work, search for information, use our leisure time, maintain our health, and well-being. It may not always be visible and it may participate behind the scenes in managing our e-mails, or in web search engines. Such usages sometimes bring up questions around privacy and data security which have been attempted to be addressed with fair information practices act that govern the usage of such data. On the other hand, data mining can help with counterterrorism. Solutions that balance these trade-offs try to not interpret the data while obtaining mining results to preserve privacy and attempt to encrypt data to preserve their security. Recent trends include standardization of data mining languages, visualization methods, and new methods for handling complex data types,
 

Given a vector of integers, find the longest consecutive sub-sequence of increasing numbers. If two sub-sequences have the same length, use the one that occurs first. An increasing sub-sequence must have a length of 2 or greater to qualify.
Example input:
[1 0 1 2 3 0 4 5]
Result:
[0 1 2 3]

void GetLongestRun ( int* A, uint N )
{
     Assert ( A != NULL && N > 0) ;
  
     uint i  = 0;
     uint start = 0;   // start index of current candidate sequence
     uint end = 0;     // end index of current candidate sequence
     uint globalStart = 0; // start index of overall winner among candidates if more than one
     uint globalEnd = 0;        // end index of overall winner among candidates assuming more than one

     i++;
     while ( i < N )
     {
        if ( A[i] >= A[i-1] )
       {
              end++;
       }
        else
       {
              start = i;
              end = i;
       }

       if (end - start > globalEnd - globalStart)
        {
            globalStart = start;
            globalEnd = end;
        }
        
        i++;
     }
    
     if (globalStart == globalEnd || globalStart + 1 == globalEnd) return;

     for (uint j = globalStart; j <= globalEnd; j++)
         printf ("%d ", A[j]);
}
A tic-tac-toe board is represented by a two dimensional vector. X is represented by :x, O is represented by :o, and empty is represented by :e. A player wins by placing three Xs or three Os in a horizontal, vertical, or diagonal row. Write a function which analyzes a tic-tac-toe board and returns :x if X has won, :o if O has won, and nil if neither player has won.
Example input:
[[:x :e :o]
[:x :e :e]
[:x :e :o]]
Result:
:x
public char GetWinner(char[,] Board)
{
     var ret = '\0';

     // check horizontals
     for (int i = 0; i < 3; i++)
     {
          if (Board[i,0] != 'e' && Board[i,0] == Board[i,1] && Board[i,1] == Board[i,2]) return Board[i,0];
     }

     // check verticals
     for (int j = 0; j < 3; j++)
     {
          if (Board[i,0] != 'e' && Board[0,j] == Board[1,j] && Board[1,j] == Board[2,j]) return Board[0,j];
     }

     // check diagonals
     if (Board[i,0] != 'e' && Board[0,0] == Board[1,1] && Board[1,1] == Board[2,2]) return Board[0,0];
     if (Board[i,0] != 'e' && Board[0,2] == Board[1,1] && Board[1,1] == Board[2,0]) return Board[0,2];

     return ret;
}

Friday, June 21, 2013

Mining object spatial multimedia and text
These are complex types of data. If the database is object-relational database or object oriented database,  then they can be mined by using generalization and assigning classes to these complex objects including set based, list, inheritance and composition based hierarchies. They can also be mined by visualizing them as object data cubes. Finally, they can be used with generalization based mining.
Spatial data mining finds interesting patterns from large geospatial databases  Spatial data cubes are constructed using spatial dimensions and measures. These can be queried using spatial OLAP . Spatial mining includes mining spatial association and collocation patterns, clustering, classification and special trend and outlier analysis.
Multimedia data mining finds interesting patterns from multimedia databases which store audio data, image data, video data, sequence data, and hypertext data containing text, markups and linkages.  Mining involves finding patterns based on content and similarity measures, generalization and multidimensional analysis. Mining also involves classification and prediction, mining associations and audio and video data mining.
Text or document database mining  uses precision, recall and F-score to measure the effectiveness of mining. As discussed earlier, various text retrieval methods have been developed where the queries can specify constraints on the documents to select or the documents have a ranking that enables a selection. As an example, if we use similarity measures between keywords, then the documents can be ranked in the order of relevance. Text that has a lot of attributes can be reduced with indexes such as in Latex Semantic Indexing (LSI), Locality preserving Indexing (LPI), and probabilistic LSI. Text mining is not limited to keyword based and similarity based search. It could involve key-board based associations, document classification and document clustering.
Web mining looks for web linkage structures, web contents and web access patterns. Web page layouts, web link structures, associated multimedia data and classification of web pages  are all used in this mining.

Thursday, June 20, 2013

Graph Mining
This involves finding frequent sub-graph patterns  and performing some of the mining operations discussed earlier such as categorization, classification, and cluster analysis over large graph data sets. These approaches are of two types - Apriori based and pattern growth based approaches. The Apriori based approach generates candidates by extracting the nodes at each level using a breadth first approach. The pattern growth approach is more flexible with respect to the search method.  Graph mining uses efficient graph index structures and can be used to search for similarity based on the structure and not the values. Using structure has more benefits than comparing values. Classification and cluster analysis of graph data can now be integrated with pattern mining process.
An example for a graph for mining is the social network which is a heterogenous and multirelational data set represented by a graph. They can be arbitrarily large. They tend to follow the densification power law, which states that the network become increasingly dense over time. They also exhibit a property where the effective diameter decreases as the network grows. Nodes that are facing out and those facing in referred to as out-degrees and in-degrees are distributed such that there are a large number of outliers. The growth of these networks are often modeled and referred to with the analogy of a forest fire.
Link mining is an interdisciplinary pursuit between social networks, link analysis, hypertext and web mining, graph mining, relational learning, and inductive logic programming. The tasks here involve classifying objects based on links, prediction based on object types, link types and link existence, estimating link cardinality, predicting whether two objects are the same and detecting cluster objects, other tasks include. Some other tasks may involve finding sub-graphs within networks and finding metadata from unstructured data.
Link prediction uses measures for the proximity of network nodes and ranks links based on these measures. Some examples include shortest path, and the number of neighbors that two nodes share.  This degree of connections is exploited in viral marketing where more money is spent on the individual who can advertise by positive word of mouth to more number of individuals. Similarly in newsgroups, people respond to posts more when they disagree than when they agree. So graph partitioning can be used to mine newsgroups by grouping authors into opposite camps. Most community mining methods assume that there is only one kind of relation in the network and in addition assume that the results are independent of the users information needs, however there may be multiple relations and relations that matter more depending on the users information needs. Furthermore, a combination of such hidden relations might reveal a hidden community within this network.
 
Mining stream, time-series, and Sequence Data
Stream data could come from telecom, financial markets and satellite data and they are typically continuous with varying update rate, sequenced in time, fast changing and massive Their summaries are called synopses. Synopses help with answer queries on stream data with an approximation. They can include random sampling, sliding windows just like TCP, histograms, multire solutions, sketches, and randomized algorithms.
The tilted time frame model allows data to be stored at the finest granularity in the most recent time and the coarsest granularity in the most distant time.  A stream data cube can store compressed data by using the tilted time frame model on the time dimension, storing data at only some critical layers, which reflect the levels of data that are of most interest to the analyst,.and performing partial materialization based on "popular paths" through the critical layers.
Stream based operations are typically forward only. The earlier discussed methods of frequent itemset mining, classification and clustering tend to scan the data multiple times  Stream-based methods instead find approximate answers as for example with lossy counting algorithm.and CluStream algorithms for stream data clustering. A time series database consists of sequences of values or events changing with time at regular time intervals such as for weather forecasting. This database is studied with trend analysis that includes long term trend movements, cyclic movements, seasonal movements, and irregular movements. Subsequence matching is a form of similarity search that finds subsequences that are similar to a given query subsequence. Such methods match subsequence that have the same shape while accounting for gaps and differences in baseline and scale.
A sequence database consists of ordered element not necessarily based on time such as web clickstreams. Any sequence that satisfies a minimum support is frequent and these patterns are queried from the database such as for example a customer buying this also bought that. Constraint based mining of sequential pattern are user defined and help to further narrow down the patterns being searched. These constraints can be expressed in terms of the duration of the sequence, a window of time when events occur, and gaps between events. Analysis of recurring patterns is called periodicity analysis and may involve full or half periods or association rules between periods. Another example of sequence analysis is the biological sequence analysis which compares, aligns, indexes and analyzes biological sequences such as sequences of amino acids. These are of two types pairwise sequence alignment and multiple sequence alignment and usually involve dynamic programming. Common techniques to analyze biological sequences are the Markov chains and hidden Markov models. These attempt to find the probability of a symbol x in the model given the sequence of symbols.

Wednesday, June 19, 2013

Cluster analysis:
A cluster is a collection of data objects that are similar to one another and dissimilar to the objects in another cluster. The quality of  a cluster can be measured based on the dissimilarity of objects, which can be computed for different types of data such as interval-scaled, binary, categorical, ordinal and ratio-scaled variables. Vector data can be measured with cosine measure and Tanimoto coefficient. Cluster analysis can be a stand alone data mining tool. Clustering methods can be grouped as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern based methods) and constraint-based methods.
A partitioning method creates an initial set of k partitions and assigns the data points to the nearest clusters. Then it iteratively re-computes the center of the cluster and reassigns the data points. Some examples are K-means, k-medoids, and CLARANS. A hierarchical method groups data points into a hierarchy by either bottom up or top down construction. Iterative relocation can also be applied to the sub clusters. A density based approach clusters objects based on a density function or according to the density of neighboring objects. Some examples are DBSCAN, DENCLUE and OPTICS.  A grid based method first allocates the objects into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is an example of grid based where the cells contain statistical information.  A model based method hypothesizes a model for each of the clusters and finds the best fit of the data into the model. Self-organizing feature maps is an example of model based method. Clustering high-dimensional data has applications to text documents. There are three methods - dimension growth subspace clustering, dimension reduction projected clustering and frequent pattern based clustering. A constraint based clustering method groups objects based on application dependent or user-specified constraints. Outlier detection and analysis can be useful for applications such as fraud detection.  The outliers are detected based on statistical distribution, distance, density, or deviation.
Classification and prediction are forms of data analysis to extract models. Classification predicts labels or data classes while prediction models continuous-valued functions.
Same steps for data preprocessing are applied as discussed before.
ID3, C4.5 and CART are greedy algorithms for the induction of decision trees. Each algorithm tests the attribute for each non-leaf node before selection.
Naïve Bayes classification and Bayesian belief networks are based on Bayes probability theory. The former assumes data classes are conditionally independent while the latter assumes subsets of variables can be conditionally independent.
A rule based classifier uses a set of IF-THEN rules for classification. Rules can be generated directly from training data using sequential covering algorithms and associative classification algorithms.
Associative classification uses association mining techniques that search for frequently occurring patterns in large databases
The set of all classifiers that use training data to build a generalization model are called eager learners. This is different from lazy learners or instance based learners that store different training tuples in pattern space and wait for test data before performing generalization. Lazy learners require indexing techniques.
Linear, non-linear and generalized linear models of regression can be used for prediction. Non-linear programs are converted to linear problems by transforming predictor variables.

Tuesday, June 18, 2013

Some of the common techniques in finding patterns with data mining include Association rule mining which consists of finding frequent item sets from which strong association rules are generated.Associations can be analyzed to uncover correlation rules which give statistical information.
Frequent pattern mining can be categorized based on completeness, levels and dimensions of data, types of values, kinds of rules, patterns. Frequent pattern mining can be classified into frequent itemset mining, sequential pattern mining, structured pattern mining etc. Algorithms for frequent itemset mining can be of three types :
1) apriori-like algorithms: The Apriori algorithm mines frequent item sets for Boolean association rules. Based on the property that the non-empty subsets of a frequent itemset must also be frequent, the kth iteration, it forms frequent k-itemset candidates based on the frequent (k-1) itemsets.
2) frequent pattern based algorithms: FP-Growth does not generate any candidates but constructs a highly compact data structure (FP-tree) and uses fragment growth.
3) algorithms that use vertical data format transform a given data set of transactions in the horizontal data format of TID-itemset into the vertical data format of item-TID_set. Then it mines using the Apriori property and additional optimization techniques such as diffset.
These same methods can be extended for the mining of closed frequent itemsets from which the set of frequent itemsets can easily be derived. These include additional techniques such as item merging, sub-itemset pruning and item skipping.
These techniques can be extended to multilevel association rules and multidimensional association rules.
Techniques for mining multidimensional association rules can be categorized according to their treatment of quantitative attributes. For example, they can be discretized statically based on predefined concept hierarchies. Quantitative association rules can be mined where quantitative attributes are discretized dynamically based on binning/clustering.
Association rules should be augmented with a correlation measure such as lift, all_confidence and cosine.
Constraint-based rule mining refines the search for rules by providing meta rules and constraints which can be antimonotonic, monotonic, succint, convertible and inconvertible. Association rules should not be used for prediction without training.
 

book review

Data Mining concepts and techniques Jiawei Han and Micheline Kamber

This book mentions the data preprocessing steps as descriptive data summarization, data cleaning, data integration and transformation, data reduction, data discretization and automatic generation of concept hierarchies.
Descriptive data summarization provides the analytical foundation for data pre-processing using statistical measures such as mean, weighted mean, median and mode for center, range, quartiles, interquartile range, variance, and standard deviation for dispersion, histograms, boxplots, quantile plots, scatter plots,  and scatter plot matrices for visual representation.
Data cleaning routines fill in missing values, smooth out noise, identifying outliers and inconsistencies in the data.
Data integration combines data from multiple sources into a coherent data store by smoothing out data conflicts, semantic heterogeneity and contribute towards data integration.
Data transformation routines convert the data into appropriate forms for mining involving steps such as normalization.
Data reduction techniques such as data cube aggregation, dimensionality reduction, subset selection and discretization can be used to obtain a reduced representation of data.
Data discretization can involve techniques such as binning, histogram analysis, entropy based discretization, cluster analysis and intuitive partitioning. Data processing methods continue to evolve due to the size and complexity of the problem.

Data Mining is the discovery of knowledge based on finding hidden patterns and associations, constructing analytical models, perform classification and prediction and presenting the mining results using visualization tools. Data Warehousing helps with providing summarized data. A data warehouse is defined in this book as a subject-oriented, integrated, time-variant, and non-volatile collection of data organized for decision making. A multidimensional data model is used to design the data warehouse and consists of a data cube with a large set of facts or measures  and a number of dimensions. A data cube consists of a lattice of cuboids.Concept hierarchies organize the values into levels of abstraction.
Data Mining can use OLAP queries as well as On-line analytical mining (OLAM)

Monday, June 17, 2013

Q: Find the second largest number in an integer array
A: int GetSecondLargest(int[] array, uint length)
{
if (array == null || length < 1 ) throw new Exception();
int max = array[0];
bool found = false;
int secondmax = 0;
for (int i =1; i < length; i++)
{
if (found == false)
{
if (array[i] < max)
{
found = true;
secondmax = array[i];
}
}
if (found != false && array[i] > secondmax)
{
if (array[i] < max)
{
found = true;
secondmax = array[i];
}
}
if (array[i] > max)
{
found = true;
secondmax = max;
max = array[i];
}
}
if ( found == false) throw new Exception();
if ( secondmax == max) throw new Exception();
return secondmax;
}
 
Question: Given an arbitrary two dimensional matrix of integers where the elements are sorted in increasing order along rows and columns, find a number in the matrix closest to and less than or equal to a given number.
Answer:
 uint GetElement(int [,] matrix, uint startrow, uint startcol, uint endrow, uint endcol, uint number)
{
while(startrow < endrow && startcol < endCol)
{
uint midrow = (startrow + endrow) / 2 ;
uint midcol = (startcol + endcol) / 2;

if (matrix[midrow, midcol] < number))
{
startrow = midrow;
startcol = midcol;
}
else
{
endrow = midrow;
endcol = midcol;
}
}
if (startrow == endrow && startcol == endcol)
{
  return matrix[startrow, startcol] < number ? matrix[startrow, startcol] : 0;
}
if ((startcol == endcol && startrow == endrow - 1) || (startrow == endrow && startcol == endcol - 1) )
{
  if (matrix[endrow, endcol] < number) return matrix[endrow, endcol];
  if (matrix[startrow, startcol] < number) return matrix [ startrow, startcol];
  return 0;
}
if (matrix[startrow, startcol] < number)
{
startrow = endrow;
startcol = endcol;
}
uint topright =  startcol - 1 > 0 && startrow - 1 > 0  ? GetElement(matrix, 0, startcol, startrow - 1, endcol, number) : 0;
uint bottomleft = startrow + 1 <= endrow && startcol - 1 > 0 ? GetElement(matrix, startrow + 1, 0, endrow, startcol - 1,
number) : 0;
if (topright < bottomleft)
  return bottomleft;
else
  return topright;
}

Sunday, June 16, 2013

Microsoft Exchange Architecture continued

Microsoft Exchange Architecture
We will look at the Exchange Server Architecture in detail here:
1) Unified Messaging: The unified messaging server role enables unified messaging for an Exchange Server organization.
Features included are :
a) Outlook voice access : user logs onto the mailbox and accesses it via  a voice user interface. An associated UM server checks Active Directory for addresses and access information.
b) Call answering : UM Server plays the individual's greeting and captures voice mail message which is then sent to Hub transport server for delivery
c) Play on Phone: User receives a voice mail message and selects play. Outlook uses https to communicate with the UM web services to fetch the appropriate message.
d) One Inbox: Unified messaging puts all the voice, video and audio content in the same mailbox for the users to pull it from different devices.
e) The Active Directory UM objects has a dial plan comprising of an auto-attendant and user dictated Mailbox policies. A UM IP gateway communicates with the external PBX switchboard.

2) The Mailbox server role: includes the following features:
a) Resource Booking Attendant: This enables conference room booking, enforces duration, who can book, delegates for approval and provides conflict information. The policies and resources for auto-accept are booked using OWA or Exchange management shell.
b) Generate Offline Address Book: OAB files are generated, compressed and placed on a local share. Administrators can configure how the address books are distributed.
c) Outlook Client Connection : Clients inside the mailbox server can access the mailbox server directly to send and retrieve messages.
d) Exchange administration : Administrator-only computer retrieves active directory topology information from the corresponding AD service.
e) Mailbox and Public Folder Databases: Private user database as well as public folder information are stored in Exchange databases as logical containers
f) Exchange Search : generates fulltext index and indexes new message and attachments automatically
g) Calendar attendant : automatically puts new meetings as tentative appointments and deletes out of date meeting requests.
i) Messaging records management: This is a managed folder.

3) Client Access Server Role: includes the following features:
1) Exchange web services : These web services comprise of the autodiscover service, exchange data service, availability service, synchronization service, notification service, and managed folder service. Clients using EWS communicate over the https and SOAP / REST and these services are hosted in the IIS.  The autodiscover service lets the clients find the exchange server via AD or DNS. The Exchange data service provides read or write access to mailbox and public folder mail, contact, tasks and calendar data. The synchronization and notification services alerts changes to mailbox and synchronizes public folders. The availability service retrieves free/busy information and meeting time suggestions.
2) Exchange Active Sync : Used mainly by the handheld devices, this service is used to push messages from the intranet to the devices using cellular / wireless network and SSL. Remote device wipe can also be initiated.
3) Outlook web access : A variety of authentication schemes together with light and full feature clients enable mailbox to be accessed via browser.
4) CAS Proxy and redirection. Proxy enables another CAS server to be made available when one is not. Redirection informs the user the correct OWA url when user tries to access another.
5) Intranet features include Sharepoint, file share integration, conversion of pdf and office attachments to HTML, single sign on for mailbox server access, mailbox server access and most OWA configuration settings are stored in Active Directory.

4) High Availability includes features such as
1) no replication : Failover cluster is built using shared storage array and Microsoft Cluster service.
2) replication to a local disk set : partitions data for performance and recovery and adds data redundancy without service redundancy
3) replication to a standby server : Source server can be standalone. Target must be stand-alone and passive. Logs are copied, verified and replayed. Database is copied. There is a built-in delay for log replay activity.
4) replication within a cluster : Failover cluster built using Microsoft Windows cluster service with log replay and copying database. Hub Transport Server acts as the file share witness.

5) Hub Transport Server includes features such as :
1) directly delivers message between the source server and the target server by reducing the hops. Internally, it has a pickup/replay directory, a submission queue,  a categorizer that takes the messages from the submission queue and processes the message  by resolving recipients, routing, converting content, processing routed messages, and message packaging before delivering it to the delivery queue.
Courtesy : MSDN
 

Microsoft Exchange Server Architecture

Microsoft Exchange Servers are deployed to Active Directory Sites except for the perimeter role. The AD must have at least one domain controller in each domain and at least one global catalog server. Generally, there should be 4 : 1 ratio of exchange processors to global catalog server processors. There are several different roles included with the Exchange Server.
These are :
1) Edge Transport Server Role:
The Edge Transport Server is deployed to the perimeter beyond which lies the internet. The Exchange hosted services are hosted in the Internet.The Edge Transport server runs in the perimeter network and provides message hygiene and security over untrusted networks. The EdgeSync service pushes configuration information to the Edge server using secure LDAP. The Edge server sends messages using SMTP TLS. The Edge transport filters the messages using antispam and antivirus filters which comprise of connection filter, address rewriting agent, edge rule agent, sender ID agent, recipient/sender filter, content writer, attachment filter and virus scanning.
2) Hub Transport Server Role:
The Hub Transport Server role handles all e-mail flow inside the organization, applies transport rules, applies journaling policies and delivers messages to a recepients mailbox.
3) Client Access Server Role: The client access server role supports the WebAccess and ActiveSync client applications, and the POP3 and IMAP4 protocols.
4) Mailbox Server Role :
The mailbox server role hosts mailbox and public folder databases. It also provides advanced scheduling services for Microsoft Office Outlook users, generates the offline address book, provides services that calculate e-mail address policies and address list for recepients.
5) Unified Messaging Server Role:
The unified messaging server role enables combines voice, fax, and e-mail messaging into a single infrastructure. These are all delivered to the user's inbox so that they can be accessed from a variety of devices. All unified messaging servers are placed in a central location and IP gateways are enabled in each branch office.

Management and Monitoring components : Exchange Management and monitoring is made easy with tools such as management shell and console. This helps answer questions like are all Exchange Services running, are all databases active, do disks have enough space, can clients connect with reasonable performance, are the servers performing efficiently and reliably, are they configured correctly  and are they secure ?

High Availability : The Microsoft Exchange Server has a component for high-availability. It includes built in features that can provide quick recovery, high availability and site resiliency for mailbox servers. Availability is improved with no replication such as with single copy cluster (SCC)  or shared storage cluster feature, replication within a cluster using cluster continuous replication feature, replication to a standby server using standby continuous replication  and replication to a local disk set using local continuous replication.
 

Saturday, June 15, 2013

Finding the closest pair of points
Two points are closest based on their euclidean distance which is computed as sqrt( (x1 - x2) ^ 2    +  (y1 - y2) ^ 2 ). The naive approach would be to compare two points at a time and exhaust the nC2 choices
Instead we can use a divide and conquer algorithm whose running time is T(n)=2T(n/2)+O(n)
Let us take a subset P of points in Q and have two arrays with each holding all the points in P. Array X is sorted with monotonically increasing x coordinates and Y is sorted with monotonically increasing y coordinates. We keep the arrays presorted.
First the point set P is divided into two set with a vertical line such that there is roughly half the points in each. These are stored as two subarrays within X. Similarly the Y is stored as two sub-arrays which contains the points sorted by monotonically increasing y-coordinate.
Next, we find the closest pair of points recursively first in the left subarray and then in the right subarray. The inputs to the first call are the subset PL and arrays XL and YL and this is repeated for the right. The minimum of the closest distances between the left and the right are chosen.
Combine The closest pair is either the pair with the distance delta found by one of the recursive calls or it is a pair of points with one point in PL and another in PR. To find such points, we create an array Y' with only those points in Y that are not within the 2-delta wide distance from the line dividing the points. Delta is the minimum distance found from the earlier step.For each point p in Y', try to find points within Y' that are less than delta apart while keeping track of the smallest delta' found in Y'. It has to compare with only seven others in the delta times 2 delta rectangle. If  delta'  < delta we have found points that exist in this narrow band and are the results of the search otherwise the recursive calls gives the points with the closest distance.

Finding the convex hull

The convex hull of a set Q of points is the smallest convex polygon for which each point in Q is either on the boundary or inside the polygon. We discuss two techniques to solve these called Graham's scan and Jarvis' march.
Graham's approach is based on the following steps:
1) Choose the point with the lowest y-coordinate and the left most in case of a tie as the starting point.
2) Push this point and the next two points visited in the counter clockwise order on stack S
3) For the next points if the angle formed by the next to top and the top and the candidate point makes a non-left turn, then pop it from the stack
otherwise push the next point on the stack and proceed
The stack returns the convex hull vertices.
The complexity is O(nlogn)
Jarvis' approach also known as package wrapping is based on the following steps:
We start with the lowest point and we go around the board building a sequence such that the next vertex in the convex hull has the smallest polar angle with respect to the previous point and in case of ties we pick the point farthest from the previous point. When we reach the highest vertex, breaking ties by choosing the farthest such vertex.
The complexity is O(NM)

Friday, June 14, 2013

Computational Geometry is the branch of computer science that studies algorithms for solving geometric problems.The input to a problem in this area is typically is a set of geometric objects such as a set of points, a set of line segments, or the vertices of a polygon in counter-clockwise order.
To determine whether a set of n line-segments contains any intersections, we review a technique called sweeping.
Before we discuss sweeping, let us use the following:
Two consecutive segments turn left or right if their cartesian product (p2-p0)*(p1-p0) is positive or negative.
Two line segments intersect each other if each segment straddles the line containing the other.
segments-intersect (p1, p2, p3, p4)
We determine if line segments straddle by finding the directions of the four line segments and checking that there are pairs that are opposite. We also check that if any of the directions are zero, then we check that the opposite end is colinear with a segment.
Now we look at the sweeping technique that describes whether any two line segments in a set of segments intersect.
In sweeping, an imaginary vertical sweep line passes through the given set of geometric objects, usually from the left to the right.  This technique provides a way of ordering the geometric objects by placing them in a dynamic data structure. We further assume that no input segment is vertical and that no three input segments intersect at a single point.
The first assumption tells us that any segment crossing a vertical sweep line intersects it at only one point.
Where the line segments intersect the sweeping line, the intersection points are comparable and are taken in the order of the increasing y coordinates. Where the segments intersect, this order is reversed. Any sweep line that passes through the shaded region has e and f intersect, they reverse their orders.
 
Review of Text analytics 2011 talk by Seth Grimes
Text analysis adds value where transactional information stops. From the information retrieval perspective, people want to publish, manage, archive, index and search, categorize and classify and extract metadata. Text analytics add semantic understanding of named entities, pattern based entities, concepts, facts and relationships, concrete and abstract attributes, subjectivity etc. Text analytics applications lets users search terms, retrieve material from large scale structures, search features such as entities or topics, retrieve materials such as facts and relationships, group results based on topics and visually explore information. Some examples are SiloBreaker, FirstRain, Bing, Google. etc. Text analytics includes metadata and metadata population Search results are measued based on precision and recall. Accuracy is measured with the combination of the two in a term called f-score which is defined as 2 * (precision * recall)/ (precision + recall). Typical steps in text analytics include : identify and retrieve documents for analysis, apply techniques to discern, tag and extract entities and apply techniques to classify documents and organize extracted features. BeyeNetwork and Ranks.NL are some examples of these. Applications such as Connexor  and VisuWords talk display part of speech tagging and ontology. Search logs suggest that

Thursday, June 13, 2013

Slide review of text analytics user perspectives on solution and providers by Seth Grimes ( Continued )
Text analysis involves statistical methods for a relative measure of the significance of words, first for individual words and then for sentences. Vector space models is used to represent documents for information retrieval, classification, and other tasks. The text content of a document is viewed as an unordered bag of words and measures such as TF-IDF (term-frequency-inverse-document-frequency) represent their distances in the vector space. Additional analytic techniques to group the text is used to identify the salient topics.
However, the limitation of statistical methods is that the statistical method have a hard time making sense of nuanced human language. Hence natural language processing is proposed where one or a sequence or pipeline of resolving steps are applied to text. These include:
tokenization - identification of distinct elements
stemming - identifying variants of word bases
Lemmatization - use of stemming and analysis of context and parts of speech
Entity Recognition - Lookup lexicons and gazetteers and use of pattern matching
Tagging - XML markup of distinct elements
Software using the above approaches have found applications in business, scientific and research problems. The application domains include: brand management, competitive intelligence, content management, customer service, e-discovery, financial services, compliance, insurance, law enforcement, life sciences, product / service design, research, voice of the customer etc.  Text analytics solution providers include young and mature software vendors as well as software giants. Early adopters have very high expectations on the return of investments from text analytics.  Among the survey conducted on adopters, some more findings are as follows:
Bulk of the text analytics users have been using it for four years or more.
Primary uses include brand management, competitive intelligence, customer experience, voice of the customer, Research and together they represent more than 50% of all applications.
Textual information sources are primarily blogs, news articles, e-mails, forums, surveys and technical literature.
Return on investment is measured by increased sales to existing customers, higher satisfaction ratings, new-customer acquisition, and higher customer retention.
A third of all spenders had budget below $50,000 and a quarter used open-source.
Software users also showed likes and dislikes primarily on flexibility, effectiveness, accuracy, and ease of use.
More than 70% of the users wanted the ability to extend the analytics to named entities such as people, companies, geographic locations, brands, ticker symbols, etc.
Ability to use specialized dictionaries, taxonomies, or extraction rules was considered more important than others.
This study was conducted in 2009.
 

A moving point within a bounded square

Q: If you are given a bounded square with cartesian co-ordinates of a 100 * 100 and a point within the square that moves in straight lines and rebounds of the edges, write a function that gives the new position of the point in each update.
A: Here's the code for the update method where previous and current are two points on the  board.

Point Update ( Point previous, Point current )
{
   var next = new Point();

   if (previous.X < current.X)
   {
       next.X = current.X + (current.X - previous.X);
       if  (next.X > 99) next.X = previous.X;
    }

   if (previous.Y < current.Y)
  {
     next.Y = current.Y + (current.Y - previous.Y);
     if (next.Y > 99) next.Y = previous.Y;
   }

  if (previous.X > current.X)
  {
     next.X = current.X - (previous.X - current.X);
     if (next.X < 0) next.X = previous.X;
   }

  if (previous.Y > current.Y)
  {
     next.Y = current.Y - (previous.Y - current.Y);
     if (next.Y < 0) next.Y = previous.Y;
   }

  if (previous == current)
  {
    var random = new Random();
    next.X = random.Next(99);
    next.Y = random.Next(99);
  }

  return next;
}