Sunday, November 5, 2017

#classifier
another way to do kmeans : cexamples/classifier.c
but unit-tests are missing -sigh

Yesterday we discussed virtualization that is helpful to visualize data. In fact visualization is an important functional area for software development and many tools are written and developed to find knowledge in vast sets of data.
Today we explore data visualization. This is what distinguishes Data Mining from machine learning.
While machine learning uses concepts such as supervised and unsupervised classifiers, it can be understood as a set of algorithms. Data Mining on the other hand uses those and other algorithms in conjunction with a database so that the data can be queried to yield the result set that summarizes the findings. These result sets can then be drawn on charts and represented on dashboards.
Yet data mining and machine learning are separate domains in themselves. Machine learning may find use with text analysis and images and other static data that is not represented in tables. Data Mining on the other than translates most data into something that can be stored in a database and this has worked well for organizations that want to safeguard their data. Moreover, we can view the difference as top down and bottoms up view as well. For example, when we use statistics for building a regression model, we are binding different parameters together to mean something together and tuning it with experimental data. An unsupervised machine learning algorithm on the other hand builds a decision tree classifier based on the data as it is made available.  The output from a machine learning algorithm may be input for a data mining process. Some of the machine learning algorithms are forms of batch processing while data mining techniques may be applied in a streaming manner.
Both data mining and machine learning have been domain specific such as in finance, retail or telecommunications industry These tools integrate the domain specific knowledge with data analysis techniques to answer usually very specific queries.
Tools are evaluated on data types, system issues, data sources, data mining functions, coupling with a database or data warehouse, scalability, visualization and user interface.  Among these visual data mining is popular for its designer style user interface that renders data, results and process in a graphical and usually interactive presentation.
Visualization tools such as graphana stack for viewing elaborate charts and eye candies only require read permissions on the data as they execute queries on the result to fetch the data for making the charts.


Saturday, November 4, 2017

Data Virtualization deep dive
Data evolves over time and with the introduction of new processes. As data ages, it becomes difficult to re-organize it. In some cases, the data is actively used by the business that may not even permit a downtime. Moreover, as data grows, it may be repurposed with changing requirements. As more and more departments and organizations visit the data, it may require separation of concerns. For example, an organization may want to see a customer's identity but not his or her credit cards. Similarly, another might want to see the items purchased by a customer but not the shipping addresses. Data also explodes at a phenomenal rate and once it starts accruing it does not stop until the business shuts down.
Organizations grapple to tame the data with compartmentalized databases. Databases are convenient to store data because they ensure atomicity, consistency, integrity and durability of data. They are also extremely performant and efficient in how the data is stored physically and accessed over the web. By separating databases for different purposes, companies try to be nimble in their effort and reduce the time to release operations to production. However, this is merely suited for expediting new offerings to market. It does not handle data analysis and insights. Consequently, data is staged from operations for loading into a warehouse which is more suited to gather all the data for analysis. Even then the warehouses proliferate. In addition, workflows that extract-transform-load the data between operational databases are found reusable for different databases. This makes more copies of the data. Syntax and semantics varies for the same entity from database to database. Databases also become distributed and separated over regions requiring the usage of web services to pull and process data.
There are many types of databases used by companies because they serve different purposes. A relational database organizes data for efficient querying. A NoSQL database organizes data for large scale distributed batch processing. A graph database persists many forms of relationships between entities. Databases fragment the view of data from the perspective of the business domain. This calls for some unified experience regardless of where or how the data is store. Data Virtualization tries to address this with consistent, wholesome, unified views and manipulation. It introduces a platform and tool that abstracts away the real topology of how data is organized.
The word virtual is a term to indicate that we are no longer looking at physical representation and instead we are looking at the semantics. With data virtualization, we can explore and discover related information. We can also view the entire collection of databases as a unified repository.  The actual data source may not just be a database. It could be a database, a data warehouse, Online Analytical Processing application, web services, Software-as-a-Service, a NoSQL database or any mix of these.
A certain degree of consolidation and consistency is preferred by data virtualization users. It is easier to query something with the same syntax rather than have to change it over and over again. Even though virtualization may aim to span a vast breadth of technologies and software stacks, it cannot be a panacea. Therefore, virtualization runs the risk of being fragmented just like databases. Some have questioned this to another degree. Can each database also come with its own logic and granular enough to make it available over the web? In other words, can each data source be a service in itself so that databases and data virtualization are no longer the frontend for users. Instead they can mix and match different data sources with the same programmability over the web? This so called microservices architecture puts nice boundaries on the source of truth and still manages to hide the complexity of a farm or a cluster behind the service. While services are great for programmers, they are not intended for users who want to visually work with the data using a tool. Therefore data virtualization has moved even closer to the user by pushing down the microservices as a source of data. Finally data virtualization comes with immense capabilities to browse and search the data like none other.

Friday, November 3, 2017

We were discussing MDM usage.  MDM users still prefer to use MS Excel. This introduces ETL based workflows and silo-ed views of data. Materialized views don't help because they are not updated in time. Also, any separation of stages to data manipulation introduces human errors and inconsistencies in addition to delay to reach the data.  The logic in the ETL also becomes more idempotent as it is needlessly exercised even if there are a few rows only to be inserted. Moreover the operation on each row now has to be made robust by making sure the corresponding row does not already exist. For example to move a record from source to destination, there must be a check to see if its exists in the destination already and to insert it and delete from the source. The delete cannot happen unless the record has already been inserted. and each of these operations has to be done for each row. Error checking for the workflow now includes checks against duplicate entries, syntactic or semantically equivalent entries and the progression of state for an entry to be forward only. These kind of checks could all be avoided if it were left to a service rather than an ETL workflow but access to the service is not always preferred to be programmatic so we have ADO.NET clients or cursor like tools that translate to LINQ queries.
One of the ways MDM users overcome this challenge is cited in the example from Denodo. 
Denodo is a data virtualization platform which  means that it lets you seamlessly work with data regardless of which database the data is physically located in. It gives you the ability to access complete information with business entities and pre-integrated views. It allows you to explore related information via discovery and self-service. It lets you access data in real time from different apps and devices. Basically it avoids point to point integration such as with ETL workflows by IT for case by case usage from business departments. As an abstraction layer it gives a unified repository view regardless of the actual data sources as databases, warehouses, OLAP, applications, web services, SaaS, and NoSQL.  Each CRUD operation on the unified data can then be executed against their respective data sources.
If we compare this model with OData which sought to expose the database directly to the web so that users can do pretty much the same thing, then we realize that the interface used by the Denoda against all data sources regardless of origin as well as the OData REST based interface correspond to standard DML statements on a database. It would be ideal if every database vendor also supported OData browsability.
#codingexercise
Segregate and sort odd and even numbers on either side of the input array 
Void SegregateAndSort(ref List<int> input) 
Var oddCount = input.Count(x => x%2 == 1); 
Int I = 0;  
Int j = I+1; 
While (I < oddCount && j < input.count) 
{  
    If (input[I] %2 == 0 && input[j] %2 == 1) { 
        Swap (ref input, I, j) 
         I  = I + 1; 
         J = j + 1; 
         Continue; 
     }     
     If (input[j] %2 == 0) { 
        J = j + 1; 
         Continue; 
     } 
     If (input[I] %2 == 1) { 
        I = I +1; 
        If (j <= I) 
           J =  I + 1; 
        Continue; 
     } 
input.Sort(0, oddCount); 
input.Sort(oddCount,Count-oddCount); 


Thursday, November 2, 2017

We were discussing Master Data Management. Some of the top players in this space include companies such as Informatica, IBM Infosphere, Microsoft, SAP Master and Riversand. Informatica offers an end to end MDM solution with an ecosystem of applications.  It does not require the catalog to be in a single domain. Infosphere has been a long player and its product is considered mature with more power for collaborative and operational capabilities. It plays well with other IBM solutions and their ecosystem. SAP consolidates governance over the master data with emphasis on data quality and consistency. It supports workflows that are collaborative and is noted for supplier side features such as supplier onboarding. Microsoft Data services that includes the SQL Server makes it easy to create master lists of data with the benefit that the data is made reliable and centralized so that it can participate in intelligent analysis. Most products require changes to existing workflows to some degree to enable customer to make the transition.
The trouble with  MDM users is that they still prefer to use MS Excel. This introduces ETL based workflows and silo-ed views of data. Materialized views don't help because they are not updated in time. Also, any separation of stages to data manipulation introduces human errors and inconsistencies in addition to delay to reach the data.  The logic in the ETL also becomes more idempotent as it is needlessly exercised even if there are a few rows only to be inserted. Moreover the operation on each row now has to be made robust by making sure the corresponding row does not already exist. For example to move a record from source to destination, there must be a check to see if its exists in the destination already and to insert it and delete from the source. The delete cannot happen unless the record has already been inserted. and each of these operations has to be done for each row. Error checking for the workflow now includes checks against duplicate entries, syntactic or semantically equivalent entries and the progression of state for an entry to be forward only. These kind of checks could all be avoided if it were left to a service rather than an ETL workflow but access to the service is not always preferred to be programmatic so we have ADO.NET clients or cursor like tools that translate to LINQ queries.
#codingexercise
Two of the nodes of a BST are swapped. Correct the BST: 

// performed during Inorder traversal
void CorrectBSTHelper(Node root, ref Node prev,  ref Node first, ref Node middle ref Node second)
{
if (root == null) return;
//Inorder
CorrectBSTHelper(root.left , ref prev,  ref first, ref middle, ref second);

if (prev && root.data < prev.data)
{
 if (first == null)
 {
     first = prev;
     middle = root;
 }
 else
 {
     second = root;
 }
// we can only check prev and  not next but the incorrect node may be the first, or in the middle or in the last of a sequence.
// when next becomes root, prev is already not null so we can change the above to checking
// if prev == null
// if prev != null && root < prev
// if prev != null && root > prev
// and return first and second only as found
// This avoids having to keep track of middle
}
prev = root;
CorrectBSTHelper(root.right, ref prev, ref current, ref first, ref middle, ref second);
}

void CorrectBST(Node root)
{
Node first = null;
Node middle = null;
Node second = null;
Node prev = null;ull;
CorrectBSTHelper(root, ref prev, ref first, ref middle, ref second);
if (first && second)
   Swap(first, second);
if (first && middle)
   Swap(first, middle);
return;
}

      4
  2      6
1  5 3    7
1254367

Wednesday, November 1, 2017

#codingexercise
Two of the nodes of a BST are swapped. Correct the BST: 

we could also detect this during inorder traversal. we keep track of previous and next node visited and the two times we see the violations of previous < current or current < next we record these and swap their data
Alternatively, 
Node CorrectBST(Node root) 
{ 
If (root == null) return null; 
Var inorder = new List<Node>(); 
InOrder(root, ref inorder); 
Var swapped = FindSwappedAsTupleFromSequence(inorder); 
// swap the data of the two nodes.
Var temp = Swapped.first.data; 
Swapped.first.data = swapped.second.data; 
Swapped.second.data = temp; 
return root;
} 

Tuesday, October 31, 2017

We were discussing Master Data Management. Some of the top players in this space include companies such as Informatica, IBM Infosphere, Microsoft, SAP Master and Riversand. Informatica offers an end to end MDM solution with an ecosystem of applications.  It does not require the catalog to be in a single domain. Infosphere has been a long player and its product is considered mature with more power for collaborative and operational capabilities. It plays well with other IBM solutions and their ecosystem. SAP consolidates governance over the master data with emphasis on data quality and consistency. It supports workflows that are collaborative and is noted for supplier side features such as supplier onboarding. Microsoft Data services that includes the SQL Server makes it easy to create master lists of data with the benefit that the data is made reliable and centralized so that it can participate in intelligent analysis. Most products require changes to existing workflows to some degree to enable customer to make the transition.

#codingexercise
Two of the nodes of a BST are swapped. Correct the BST: 
Node CorrectBST(Node root) 
{ 
If (root == null) return null; 
Var inorder = new List<Node>(); 
InOrder(root, ref inorder); 
Var swapped = FindSwappedAsTupleFromSequence(inorder); 
// swap the data of the two nodes.
Var temp = Swapped.first.data; 
Swapped.first.data = swapped.second.data; 
Swapped.second.data = temp; 
return root;
} 

Monday, October 30, 2017

We were discussing storing the product catalog in the cloud. MongoDB provides cloud scale deployment. Workflows to get data into MongoDB however required organizations to come up with Extract-Transform-Load operations and did not deal with json natively. Cloud scale deployment of MongoDB let us have the benefit of relational as well as document stores. Catalogs are indeed considered a collection of documents which are reclassified or elaborately named and tagged items. These digital assets also need to be served via browse and search operations which differ in their queries. Cloud scale deployments come with extensive monitoring support. This kind of monitoring specifically look for data size versus disk size, active set size versus ram size, disk IO, write lock and generally accounts for and tests the highest possible traffic. Replicas are added to remove latency and increase read capacity.


#codingexercise
we were discussing how to find whether a given tree is a sub tree of another tree. We gave a recursive solution and an iterative solution.  The recursive solution did not address the case when the subtree is not a leaf in the original tree. Since we know that the original and subtree are distinct we can improve  the recursive solution by adding a condition that returns true if the subtree node is null and the original is not.
We could also make this more lenient by admitting mirror trees as well. a mirror tree is one where left and right may be swapped.

     4
   2  6
1  35 7
InOrder : 1 2 3 4 5 6 7
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return true; // modified
return root1.data == root2.data && ((isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right)) || (isEqual(root1.left, root2.right) && isEqual(root1.right, root2.left)));
// as opposed to return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}