Cluster computing

Friday, November 3, 2017

We were discussing MDM usage. MDM users still prefer to use MS Excel. This introduces ETL based workflows and silo-ed views of data. Materialized views don't help because they are not updated in time. Also, any separation of stages to data manipulation introduces human errors and inconsistencies in addition to delay to reach the data. The logic in the ETL also becomes more idempotent as it is needlessly exercised even if there are a few rows only to be inserted. Moreover the operation on each row now has to be made robust by making sure the corresponding row does not already exist. For example to move a record from source to destination, there must be a check to see if its exists in the destination already and to insert it and delete from the source. The delete cannot happen unless the record has already been inserted. and each of these operations has to be done for each row. Error checking for the workflow now includes checks against duplicate entries, syntactic or semantically equivalent entries and the progression of state for an entry to be forward only. These kind of checks could all be avoided if it were left to a service rather than an ETL workflow but access to the service is not always preferred to be programmatic so we have ADO.NET clients or cursor like tools that translate to LINQ queries.
One of the ways MDM users overcome this challenge is cited in the example from Denodo.
Denodo is a data virtualization platform which means that it lets you seamlessly work with data regardless of which database the data is physically located in. It gives you the ability to access complete information with business entities and pre-integrated views. It allows you to explore related information via discovery and self-service. It lets you access data in real time from different apps and devices. Basically it avoids point to point integration such as with ETL workflows by IT for case by case usage from business departments. As an abstraction layer it gives a unified repository view regardless of the actual data sources as databases, warehouses, OLAP, applications, web services, SaaS, and NoSQL. Each CRUD operation on the unified data can then be executed against their respective data sources.
If we compare this model with OData which sought to expose the database directly to the web so that users can do pretty much the same thing, then we realize that the interface used by the Denoda against all data sources regardless of origin as well as the OData REST based interface correspond to standard DML statements on a database. It would be ideal if every database vendor also supported OData browsability.

#codingexercise

Segregate and sort odd and even numbers on either side of the input array

Void SegregateAndSort(ref List<int> input)

{

Var oddCount = input.Count(x => x%2 == 1);

Int I = 0;

Int j = I+1;

While (I < oddCount && j < input.count)

{

If (input[I] %2 == 0 && input[j] %2 == 1) {

Swap (ref input, I, j)

I = I + 1;

J = j + 1;

Continue;

}

If (input[j] %2 == 0) {

J = j + 1;

Continue;

}

If (input[I] %2 == 1) {

I = I +1;

If (j <= I)

J = I + 1;

Continue;

}

input.Sort(0, oddCount);

input.Sort(oddCount,Count-oddCount);

}

Thursday, November 2, 2017

We were discussing Master Data Management. Some of the top players in this space include companies such as Informatica, IBM Infosphere, Microsoft, SAP Master and Riversand. Informatica offers an end to end MDM solution with an ecosystem of applications. It does not require the catalog to be in a single domain. Infosphere has been a long player and its product is considered mature with more power for collaborative and operational capabilities. It plays well with other IBM solutions and their ecosystem. SAP consolidates governance over the master data with emphasis on data quality and consistency. It supports workflows that are collaborative and is noted for supplier side features such as supplier onboarding. Microsoft Data services that includes the SQL Server makes it easy to create master lists of data with the benefit that the data is made reliable and centralized so that it can participate in intelligent analysis. Most products require changes to existing workflows to some degree to enable customer to make the transition.

The trouble with MDM users is that they still prefer to use MS Excel. This introduces ETL based workflows and silo-ed views of data. Materialized views don't help because they are not updated in time. Also, any separation of stages to data manipulation introduces human errors and inconsistencies in addition to delay to reach the data. The logic in the ETL also becomes more idempotent as it is needlessly exercised even if there are a few rows only to be inserted. Moreover the operation on each row now has to be made robust by making sure the corresponding row does not already exist. For example to move a record from source to destination, there must be a check to see if its exists in the destination already and to insert it and delete from the source. The delete cannot happen unless the record has already been inserted. and each of these operations has to be done for each row. Error checking for the workflow now includes checks against duplicate entries, syntactic or semantically equivalent entries and the progression of state for an entry to be forward only. These kind of checks could all be avoided if it were left to a service rather than an ETL workflow but access to the service is not always preferred to be programmatic so we have ADO.NET clients or cursor like tools that translate to LINQ queries.

#codingexercise

Two of the nodes of a BST are swapped. Correct the BST:

// performed during Inorder traversal

void CorrectBSTHelper(Node root, ref Node prev, ref Node first, ref Node middle ref Node second)

{

if (root == null) return;

//Inorder
CorrectBSTHelper(root.left , ref prev, ref first, ref middle, ref second);

if (prev && root.data < prev.data)
{
if (first == null)
{
first = prev;
middle = root;
}
else
{
second = root;
}
// we can only check prev and not next but the incorrect node may be the first, or in the middle or in the last of a sequence.
// when next becomes root, prev is already not null so we can change the above to checking
// if prev == null
// if prev != null && root < prev
// if prev != null && root > prev
// and return first and second only as found
// This avoids having to keep track of middle
}
prev = root;
CorrectBSTHelper(root.right, ref prev, ref current, ref first, ref middle, ref second);

}

void CorrectBST(Node root)

{

Node first = null;

Node middle = null;

Node second = null;
Node prev = null;ull;
CorrectBSTHelper(root, ref prev, ref first, ref middle, ref second);
if (first && second)
Swap(first, second);
if (first && middle)
Swap(first, middle);
return;

}

2 6
1 5 3 7
1254367

Wednesday, November 1, 2017

#codingexercise

Two of the nodes of a BST are swapped. Correct the BST:

we could also detect this during inorder traversal. we keep track of previous and next node visited and the two times we see the violations of previous < current or current < next we record these and swap their data
Alternatively,

Node CorrectBST(Node root)

{

If (root == null) return null;

Var inorder = new List<Node>();

InOrder(root, ref inorder);

Var swapped = FindSwappedAsTupleFromSequence(inorder);

// swap the data of the two nodes.

Var temp = Swapped.first.data;

Swapped.first.data = swapped.second.data;

Swapped.second.data = temp;

return root;

}

Tuesday, October 31, 2017

Two of the nodes of a BST are swapped. Correct the BST:

Node CorrectBST(Node root)

{

If (root == null) return null;

Var inorder = new List<Node>();

InOrder(root, ref inorder);

Var swapped = FindSwappedAsTupleFromSequence(inorder);

// swap the data of the two nodes.

Var temp = Swapped.first.data;

Swapped.first.data = swapped.second.data;

Swapped.second.data = temp;

return root;

}

Monday, October 30, 2017

We were discussing storing the product catalog in the cloud. MongoDB provides cloud scale deployment. Workflows to get data into MongoDB however required organizations to come up with Extract-Transform-Load operations and did not deal with json natively. Cloud scale deployment of MongoDB let us have the benefit of relational as well as document stores. Catalogs are indeed considered a collection of documents which are reclassified or elaborately named and tagged items. These digital assets also need to be served via browse and search operations which differ in their queries. Cloud scale deployments come with extensive monitoring support. This kind of monitoring specifically look for data size versus disk size, active set size versus ram size, disk IO, write lock and generally accounts for and tests the highest possible traffic. Replicas are added to remove latency and increase read capacity.

#codingexercise
we were discussing how to find whether a given tree is a sub tree of another tree. We gave a recursive solution and an iterative solution. The recursive solution did not address the case when the subtree is not a leaf in the original tree. Since we know that the original and subtree are distinct we can improve the recursive solution by adding a condition that returns true if the subtree node is null and the original is not.
We could also make this more lenient by admitting mirror trees as well. a mirror tree is one where left and right may be swapped.

4
2 6
1 35 7
InOrder : 1 2 3 4 5 6 7
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return true; // modified
return root1.data == root2.data && ((isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right)) || (isEqual(root1.left, root2.right) && isEqual(root1.right, root2.left)));
// as opposed to return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}

Sunday, October 29, 2017

We were discussing Master Data management and how it differs between MongoDB and traditional databases. MongoDB provides two important functionalities over catalog namely browsing and searching. Both these services seem that they could do with improvements. While browsing based service could expand on the standard query operator capabilities including direct SQL invocations, the search capabilities could be made similar to that of Splunk which can query events using search expressions and operators that have the same look and feel as customary tools on Unix. These two distinct services are provided over a single managed view of the catalog. The challenge with browsing over catalog unlike other datasets is that the data is quite large and difficult to navigate. Traditional REST APIs solve this with the use of a few query parameters such as page, offset, search-term and leaving a cursor like logic to the callers. Instead I propose to push the standard query operators to service itself.
The browsing service has to be optimized for the mobile experience as compared to the desktop experience. These optimizations include smaller page size, reorganized page layouts, smaller resources, smaller images and bigger thumbnails, mobile optimized styling, faster page load times and judicious use of properties on the page.
Since much of the content for catalog remain static resources, web proxies, gro sharding, content delivery networks, web proxies and app fabric like caches are immensely popular. However, none of these are really required if the data store is indeed cloud based. For example the service level agreement from a cloud database is the same regardless of the region from where the query is issued.
#codingexercise
we were discussing how to find whether a given tree is a sub tree of another tree. We gave a recursive solution and an iterative solution. The recursive solution did not address the case when the subtree is not a leaf in the original tree. Since we know that the original and subtree are distinct we can improve the recursive solution by adding a condition that returns true if the subtree node is null and the original is not.

4
2 6
1 35 7
InOrder : 1 2 3 4 5 6 7
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return true; // modified
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}

Saturday, October 28, 2017

We were discussing Master Data management and how it differs between MongoDB and traditional databases.
Master Data Management participates in:
Planning and architecture
Execution and deployment
Quality and monitoring
Governance and Conrol
Standards and Compliance

Due to the lagging embrace of cloud based master data management technologies ( with the possible exception of Snowflake due to its architecture ), Riversand still continues to enjoy widespread popularity in an on-premise solution.
Riversand offers data modeling, data synchronization, data standardization, and flexible workflows within the tool. It offers scalability, performance and availability based on its comprehensive Product Information Management (PIM) web services. Functionality is provided in layers of information management starting with print/translation workflows at the bottom layer, followed by workflow or security for access to the assets, their editing, insertions and bulk insertions, followed by integrations or portals followed by flexible integration capabilities, full/data exports, multiple exports, followed by integration portals for integration with imports/exports, data pools and platforms, followed by digital asset management layer for asset on-boarding and delivery to channels, and lastly data management for searches, saved searches, channel based content, or localized content and the ability to author variants, categories, attributes and relationships to stored assets.

#codingexercise
Find if one binary tree is a subtree of another
We were discussing recursive solution and why it does not apply when the subtree is not a leaf in the original. There is also an iterative solution that can be modified to suit this purpose.
bool IsEqualIterative(Node root1, Node root2, ref Stack<Node> stk)
{
var current1 = root1;
var current2 = root2;

while (current1 != null ||  current2 != null || stk. empty() == false)
{

if (current1 == null && current2 != null ) return false;
if (current1 != null && current2 == null ) return false;

if (current1 != null)
{
if (current1.data != current2.data) return false;
stk.push(current1);
current1 = current1.left;
current2 = current2.left;
continue;
}

if (stk. empty() == false)
{
// we can introduce sequence equal-logic-so-far here
current1 = stk. pop();
current1 = current1.right;
current2 = current2.right;
}

}
return true;
}