Cluster computing

Tuesday, May 30, 2017

We continue our discussion of System design for online store as mentioned here and here. We now discuss the data storage aspects across all services from the point of scalability. We assume the store will have infinite users at some point and plan accordingly. The services will need to store large volumes of data. This data will be both user data and logs. The user portal may be composed of data from many different data sources. Images and large static content will likely be served from storage that is optimized for blobs. Most of the per-user information is stored from sharded relational databases. High volume short text such as from community feedback forums, social engineering transcripts, chats and messages, ticket and case troubleshooting conversations will likely be stored in a large distributed key-value store. A conventional relational database may be used as a queuing system on top of this store. Almost all of this data is still corresponding to a user by user basis. It is the data generated by user. Unlike user data, log data is generated by the system from the various operations of the services in the form of log events. Log Events help with analytics. For example, the log events may be used for correlation and as feedback which then leads to improvements in the operations of the services. This feedback-improvement virtuous cycle can go on and on regardless of which user is using the system. Log Events translate to feedback only with analytics. For example, users may be shown the trending bestsellers or newcomers to the store. This may require correlation and collaborative filtering to provide a ranked list. Analytics come with beautiful charts. User and Log data may also be used in many other ways. Data may appear in the form of feeds to users to improve the shopping experience around a product. Data may come in the form of recommendations such as people who liked this also liked that. Data may be represented as graph and used with search. Data may also be used for the integrity of the site. Ads, reviews and insights may also appear as additional data while being separate and distinct in their purpose or usage. Data expands possibilities for the business and hence it eventually becomes the center of gravity. For example, Logging may be used to channel all logs to a central repository which may grow with time in a time series database on a dedicated cluster or a data warehouse. As data expands, scalability concerns grow. Systems may become mature but when size grows, even architectures change. Embrace of Big Data over relational is a trend that comes directly from scalability. Developers may find it enticing to use SQL statements instead of map-reduce to get to the same result. Consequently they may require additional stack over data. Visualization will pull data and will also come with its own stack. Data tools may evolve over the data stack. And the tools and stack will both evolve to better suit the scalability and functionality. The design discussion here is borrowed from what has already been shown to work in companies like Facebook that have grown significantly.
#codingexercise
input [2,3,1,4]
output [12,8,24,6]

Multiply all fields except it's own position.
List<int> GetSumProduct(List<int> A)
{
assert(A.Any(x => x == 0) == false);
var product = 1;
A.ForEach( x => {product *= x;});
var ret = new List<int>();
A.ForEach( x => { ret.Add(product/x); });
return ret;
}
if we were to avoid division, we could use multiply for every entry other than itself in each iteration. If we were to make it linear and without division, we would keep track of front and rear products in separate passes and combine for the results. We need to start with 1.

Cluster computing

Tuesday, May 30, 2017

No comments:

Post a Comment