Cluster computing

Monday, October 30, 2017

We were discussing storing the product catalog in the cloud. MongoDB provides cloud scale deployment. Workflows to get data into MongoDB however required organizations to come up with Extract-Transform-Load operations and did not deal with json natively. Cloud scale deployment of MongoDB let us have the benefit of relational as well as document stores. Catalogs are indeed considered a collection of documents which are reclassified or elaborately named and tagged items. These digital assets also need to be served via browse and search operations which differ in their queries. Cloud scale deployments come with extensive monitoring support. This kind of monitoring specifically look for data size versus disk size, active set size versus ram size, disk IO, write lock and generally accounts for and tests the highest possible traffic. Replicas are added to remove latency and increase read capacity.

#codingexercise
we were discussing how to find whether a given tree is a sub tree of another tree. We gave a recursive solution and an iterative solution. The recursive solution did not address the case when the subtree is not a leaf in the original tree. Since we know that the original and subtree are distinct we can improve the recursive solution by adding a condition that returns true if the subtree node is null and the original is not.
We could also make this more lenient by admitting mirror trees as well. a mirror tree is one where left and right may be swapped.

4
2 6
1 35 7
InOrder : 1 2 3 4 5 6 7
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return true; // modified
return root1.data == root2.data && ((isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right)) || (isEqual(root1.left, root2.right) && isEqual(root1.right, root2.left)));
// as opposed to return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}

Sunday, October 29, 2017

We were discussing Master Data management and how it differs between MongoDB and traditional databases. MongoDB provides two important functionalities over catalog namely browsing and searching. Both these services seem that they could do with improvements. While browsing based service could expand on the standard query operator capabilities including direct SQL invocations, the search capabilities could be made similar to that of Splunk which can query events using search expressions and operators that have the same look and feel as customary tools on Unix. These two distinct services are provided over a single managed view of the catalog. The challenge with browsing over catalog unlike other datasets is that the data is quite large and difficult to navigate. Traditional REST APIs solve this with the use of a few query parameters such as page, offset, search-term and leaving a cursor like logic to the callers. Instead I propose to push the standard query operators to service itself.
The browsing service has to be optimized for the mobile experience as compared to the desktop experience. These optimizations include smaller page size, reorganized page layouts, smaller resources, smaller images and bigger thumbnails, mobile optimized styling, faster page load times and judicious use of properties on the page.
Since much of the content for catalog remain static resources, web proxies, gro sharding, content delivery networks, web proxies and app fabric like caches are immensely popular. However, none of these are really required if the data store is indeed cloud based. For example the service level agreement from a cloud database is the same regardless of the region from where the query is issued.
#codingexercise
we were discussing how to find whether a given tree is a sub tree of another tree. We gave a recursive solution and an iterative solution. The recursive solution did not address the case when the subtree is not a leaf in the original tree. Since we know that the original and subtree are distinct we can improve the recursive solution by adding a condition that returns true if the subtree node is null and the original is not.

4
2 6
1 35 7
InOrder : 1 2 3 4 5 6 7
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return true; // modified
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}

Saturday, October 28, 2017

We were discussing Master Data management and how it differs between MongoDB and traditional databases.
Master Data Management participates in:
Planning and architecture
Execution and deployment
Quality and monitoring
Governance and Conrol
Standards and Compliance

Due to the lagging embrace of cloud based master data management technologies ( with the possible exception of Snowflake due to its architecture ), Riversand still continues to enjoy widespread popularity in an on-premise solution.
Riversand offers data modeling, data synchronization, data standardization, and flexible workflows within the tool. It offers scalability, performance and availability based on its comprehensive Product Information Management (PIM) web services. Functionality is provided in layers of information management starting with print/translation workflows at the bottom layer, followed by workflow or security for access to the assets, their editing, insertions and bulk insertions, followed by integrations or portals followed by flexible integration capabilities, full/data exports, multiple exports, followed by integration portals for integration with imports/exports, data pools and platforms, followed by digital asset management layer for asset on-boarding and delivery to channels, and lastly data management for searches, saved searches, channel based content, or localized content and the ability to author variants, categories, attributes and relationships to stored assets.

#codingexercise
Find if one binary tree is a subtree of another
We were discussing recursive solution and why it does not apply when the subtree is not a leaf in the original. There is also an iterative solution that can be modified to suit this purpose.
bool IsEqualIterative(Node root1, Node root2, ref Stack<Node> stk)
{
var current1 = root1;
var current2 = root2;

while (current1 != null ||  current2 != null || stk. empty() == false)
{

if (current1 == null && current2 != null ) return false;
if (current1 != null && current2 == null ) return false;

if (current1 != null)
{
if (current1.data != current2.data) return false;
stk.push(current1);
current1 = current1.left;
current2 = current2.left;
continue;
}

if (stk. empty() == false)
{
// we can introduce sequence equal-logic-so-far here
current1 = stk. pop();
current1 = current1.right;
current2 = current2.right;
}

}
return true;
}

Friday, October 27, 2017

We have been discussing MongoDB improvements in catalog management. Let us recap quickly the initiatives taken by MongoDB and how it is different from a regular Master Data Management. Specifically, we can compare it with MDM from Riversand which we discussed earlier:
Comparisions
MongoDB:
- organizes Catalog as a one stop shop in its store
-- no sub-catalogs or fragmentation or ETL or MessageBus
- equally available to Application servers, API data and services and webservers.
- also available behind the store for supply chain management and data warehouse analytics
- catalog available for browsing as well as searching via Lucene search index
- geo-sharding with persisted shard ids or more granular store ids for improving high availability and horizontal scalability
- local real-time writes and tuned for read-dominated workload
- bulk writes for refresh
- RelationalDB stores point in time loads while overnight they are pushed to catalog information management and made available for real-time views
- NoSQL powers insight and analytics based on aggregations
- provides front-end data store for real time queries and aggregations from applications.
- comes with incredible monitoring and scaling

Both MDM and MongoDB Catalog store hierarchy and facet in addition to Item and SKUs as a way of organizing items.

Riversand MDM
- rebuildable catalog via change data capture
- .Net powered comprehensive web services
- data as a service model
- Reliance on traditional relational databases only

#codingexercise
Find if one binary tree is a subtree of another
boolean IsSubTree (Node root1, Node root2)
{
Var original = new List <Node>();
Var sequence = new List <Node>();
Inorder (root1, ref original);
Inorder (root2, ref sequence);
return original.Intersect (sequence).equals (sequence);
}
4
2 6
1 35 7
InOrder : 1 2 3 4 5 6 7
Caveat this works for 2,3,4 but for 2,4,6 we need the sequence equals method on the intersection.
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return false;
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}
Note that we cannot apply the isEqual solution as the isSubtree alternative because the subtree may not be a leaf in the original.

Thursday, October 26, 2017

Yesterday we were discussing how user activities are logged for insight by MongoDB.
All user activity is recorded using HVDF API which is staged in User History store of MongoDB and pulled by external analytics such as Hadoop using MongoDB-Hadoop connector. Internal analytics for aggregation is also provided by Hadoop. Data store for Product Map, User preferences, Recommendations and Trends then store and make the aggregations available for personalization that the apps can use for interacting with the customer.
Today we look at monitoring and scaling in MongoDB detail. The user-activity-analysis-insight design works well for any kind of deployments. For example, we can have a local deployment, or an AWS deployment or a remote deployment of the database and they can be scaled from standalone to a replica set to a sharded cluster and in all these cases, the querying does not suffer. A replica set is group of MongoDB instances that host the same data set. It just provides redundancy and high availability. A sharded cluster stores data across multiple machines. When the data sets are large and the throughput is high, sharding eases the load. Tools to help monitor instances and deployments include Mongo Monitoring Service, MongoStat, Mongotop, IOStat and plugins for popular frameworks. These help troubleshoot failures immediately. The observation here is that MongoDB enables comprehensive monitoring through its own services, API and user interface. Being a database, the data from monitoring may also be made available for external query and dashboards such as Grafana stack. Some of the popular metrics involved here include data size versus disk size, active set size versus RAM size, disk I/O and write lock. The goal for the metrics and monitoring is that they should be able to account for the highest possible traffic.
MongoDB recommends that we add replicas to reduce latency to users, add read capacity and increase data safety. If the read does not scale, the data may potentially get stale. Adding or removing replica may be seamless.
#codingexercise
Find if one binary tree is a subtree of another
boolean IsSubTree (Node root1, Node root2)
{
Var original = new List <Node>();
Var sequence = new List <Node>();
Inorder (root1, ref original);
Inorder (root2, ref sequence);
return original.Intersect (sequence).equals (sequence);
}

Wednesday, October 25, 2017

We were discussing how user activities are logged for insight by MongoDB.
All user activity is recorded using HVDF API which is staged in User History store of MongoDB and pulled by external analytics such as Hadoop using MongoDB-Hadoop connector. Internal analytics for aggregation is also provided by Hadoop. Data store for Product Map, User preferences, Recommendations and Trends then store and make the aggregations available for personalization that the apps can use for interacting with the customer.
Thus MongoDB powers applications for products and inventory, recommended products, customer profile and session management. Hadoop powers analysis for elastic pricing, recommendation models, predictive analytics and clickstream history. The user activity model is a json document with attributes for geoCode, sessionId, device, userId, type of activity, itemId, sku, order, location, tags, and timestamp. Recent activity for a user is a simple query as db.activity.find({userId:"123"}).sort({time: -1}).limit(1000)
Indices that can be used include userId+time, itemId+time, time.
Aggregations are very fast and in real time. Queries like finding the recent number of views for a user, the total sales for a user, the number of views/purchases for a item are now near real-time.
A batch query over NoSQL such as a map-reduce calculation for unique visitors is also performant.
This design works well for any kind of deployments. For example, we can have a local deployment, or an AWS deployment or a remote deployment of the database and they can be scaled from standalone to a replica set to a sharded cluster and in all these cases, the querying does not suffer. A replica set is group of MongoDB instances that host the same data set. It just provides redundancy and high availability. A sharded cluster stores data across multiple machines. When the data sets are large and the throughput is high, sharding eases the load. Tools to help monitor instances and deployments include Mongo Monitoring Service, MongoStat, Mongotop, IOStat and plugins for popular frameworks. These help troubleshoot failures immediately.
In the second iteration we can even compare the distance and exit when a coprime is located in the first iteration and the distance for that coprime is more than the current distance.
int dist = 0;
// First Iteration as described yesterday
// Second Iteration as described yesterday
// within second iteration
if (dist > 0 && math.abs(n-i) < dist) {
break;
}

Monday, October 23, 2017

We were discussing how user activities are logged for insight by MongoDB. These activities include search terms, items viewed or wished, cart added or removed. orders submitted, sharing on social network and as impression or clickstream. Instead of using a time series database, the activity store and reporting is improved right from the box by MongoDB Moreover the user activities is used to compute user/product history, product map, user preferences, recommendations and trends. The challenges associated with recording user activities include the following
1) they are too voluminous and usually available only in logs
2) when they are made available elsewhere, they hamper performance.
3) If a warehouse is provided to record user activities, it helps reporting but becomes costly to scale
4) a compromise is available in NoSQL stores
5) but it is harder to provide a NoSQL store at the front-end where real-time queries are performed.
MongoDB addresses these with
1) providing a dedicated store for large stream of data samples with variable schema and controlling the retention period
2) computing aggregations and derivatives separately
3) providing low latency to update data.
All user activity is recorded using HVDF API which is staged in User History store of MongoDB and pulled from external analytics such as Hadoop using MongoDB-Hadoop connector. Internal analytics for aggregation is also provided by Hadoop. Data store for Product Map, User preferences, Recommendations and Trends then store and make the aggregations available for personalization that the apps can use for interacting with the customer.
This completes the Data->Insight->Actions translation for user activities.

#codingexercise
We were implementing the furthest coprime of a given number such that it is in the range 2 – 250.
We made a single pass over the range of 2 to 250 to determine which ones were co-prime.
However we only need the farthest. Consequently we could split the range to be above the number or below the number so long as the input number is in the middle of the range and we could bail at the first encountered co-prime from either extremes. In the second iteration we can even compare the distance and exit when a coprime is located in the first iteration and the distance for that coprime is more than the current distance
int dist = 0;
// First Iteration as described yesterday
// Second Iteration as described yesterday
// within second iteration
if (dist > 0 && math.abs(n-i) < dist) {
break;
}