Cluster computing

Saturday, December 2, 2017

We were discussing the argument for standard query operators. Today I want to contrast this with OData. While services provide scrubbing and analysis over data from tables, OData exposes the entire database to the web so they may be accessed by REST APIs. The caller can then use the database just like any other browsable API and from any device. It uses the well known HTTP methods, query parameters and request body to make the web conversation. The difference between standard query operator and this API is that the former standardizes the programming needs across applications while the latter serves to open up a data source to the web. One may even be considered a layer on top of another and there is no denying that the former has a lot more flexibility as we mix and match even collections across data sources.The former plays an important role with data virtualization while the latter plays an important role in connecting a data source. Still they are both services.

There was not much difference between the two when we don't worry about the syntax of the query and we view the results as an enumerable. Even popular relational databases are hosted as a service with programmability features so you can leverage them in your code.Similarly, standard query operators may be implented entirely in ORM. With the introduction of microservices, it became easy to host not only a dedicated database but also a dedicated database server instance. Use microservices with Mesos based clusters and shared volumes, we now have many copies of the server for high availability and failover. This is possibly great for small and segregated data but larger companies often require massive investments in their data, often standardizing tools, processes and workflows to better manage their data. In such cases consumers of the data don't talk to the database directly but via a service that sits behind say even a message bus. If the consumers proliferate, they end up creating and sharing many different instances of services for the same data each with its own view rather than the actual table. APIs for these services are more domain based rather than implementing a query friendly interface that lets you directly work with the data. As services are organized, data may get translated or massaged as it makes its way from one to another. I have seen several forms of organizing the services starting with service-oriented architecture at the enterprise level to fine grained individual microservices. It is possible to have a bouquet of microservices that can take care of most data processing for the business requirements. Data may even be at most one or two fields of an entity along with its identifier for such services. This works very well to alleviate the onus and rigidity that comes with organization, the interactions between the components and the various chores that need to be taken to keep it flexible to suit changing business needs. The flat ring of services on the other hand are already business friendly to begin with letting services do their work. The graph of service dependencies may get heavily connected but at least it becomes better understood with very little stickiness that comes with ownership of data. Therefore, a vast majority of services may now be decoupled from any data ownership considerations and those that do may find it convenient to not remain database specific and can even form a chain if necessary.

#codingexercise

Yesterday we were given three sorted arrays and we wanted to find one element from each array such that they are closest to each other. One of the ways to do this was explained this way: We could also traverse all three arrays while keeping track of maximum and minimum difference encountered with the candidate set of three elements. We traverse by choosing one element at a time in any one array by incrementing the index of the array with the minimum value.

By advancing only the minimum element, we make sure the sweep is progressive and exhaustive.

List<int> GetClosest(List<int> A, List<int> B, List<int> C)
{
var ret = new List<int>();
int i = 0;
int j = 0;
int k = 0;
int dif f = INT_MAX;
while ( i < A.Count && j < B.Count && k < C.Count)
{
var candidates = new List<int>() { A[i], B[j], C[k] };
int range = Math.Abs(candidates.Min() - candidates.Max());
if ( range < diff)
{
diff = range;
ret = candidates.ToList();
}
if (range == 0) return ret;
if (candidates.Min() == A[i])
{
i++;
} else if (candidates.Min() == B[j])
{
j++;
} else {
k++;
}
}
return ret;
}

Friday, December 1, 2017

The argument for standard query operators.

Recently I came across a mindset among the folks of a company that databases are bad and services are good. There was not much difference between the two when we don't worry about the syntax of the query and we view the results as an enumerable. Even popular relational databases are hosted as a service with programmability features so you can leverage them in your code. With the introduction of microservices, it became easy to host not only a dedicated database but also a dedicated database server instance. Use microservices with Mesos based clusters and shared volumes, we now have many copies of the server for high availability and failover. This is possibly great for small and segregated data but larger companies often require massive investments in their data, often standardizing tools, processes and workflows to better manage their data. In such cases consumers of the data don't talk to the database directly but via a service that sits behind say even a message bus. If the consumers proliferate, they end up creating and sharing many different instances of services for the same data each with its own view rather than the actual table. APIs for these services are more domain based rather than implementing a query friendly interface that lets you directly work with the data. As services are organized, data may get translated or massaged as it makes its way from one to another. I have seen several forms of organizing the services starting with service-oriented architecture at the enterprise level to fine grained individual microservices. It is possible to have a bouquet of microservices that can take care of most data processing for the business requirements. Data may even be at most one or two fields of an entity along with its identifier for such services. This works very well to alleviate the onus and rigidity that comes with organization, the interactions between the components and the various chores that need to be taken to keep it flexible to suit changing business needs. The flat ring of services on the other hand are already business friendly to begin with letting services do their work. The graph of service dependencies may get heavily connected but at least it becomes better understood with very little stickiness that comes with ownership of data. Therefore, a vast majority of services may now be decoupled from any data ownership considerations and those that do may find it convenient to not remain database specific and can even forma chain if necessary.

Enterprise architects strive to lay the rules for different services but most are all the more willing to embrace their company's initiatives including investments in the cloud or making it more consistent with the others. Unless a team specifically asks for a one-off treatment by way of non-traditional databases or special requirements, they are all the more excited to use cookie cutters or corral the processing to a service. Instead if these same architects were to also take on the responsibility to open up some services with APIs implementing standard query operators on their data akin to what a well-known managed language does or what web developers practice with their REST API using standard query parameters, they will do away with much of the case by case needs that come their way. In essence, promoting standard query operators for data over and on top of business interactions with the service seems a win-win for everyone.

#codingexercise
Yesterday we were given three sorted arrays and we were finding one element from each array such that the element is closest to the given element. The elements were one each from each of the arrays.
Now if we wanted to find one element from each array such that they are closest to each other, we can reuse the GetClosest methods earlier in iteration for every element of one of the array until the criteria is satisfied We check the absolute value of the difference to the candidate value. Alternatively, we could also traverse all three arrays while keeping track of maximum and minimum difference encountered with the candidate set of three elements. We traverse by choosing one element at a time in any one array by incrementing the index of the array with the minimum value.

By advancing only the minimum element, we make sure the sweep is progressive and exhaustive.

Thursday, November 30, 2017

We resume our discussion about correlation versus regression. We saw that one of the best advantage of a linear regression is the prediction with regard to time as in independent variable. When the data point have many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen. This is far easier to do than come with up a model that behaves as good fit for all the data points. It gives an indication of the trend which is generally more helpful than the data points themselves. Also a scatter plot s only changing in one dependent variable in conjunction with the independent variable. Thus lets us pick the dimension we consider to fit the linear regression independent of others. Lastly, the linear regression also gives an indication of how much the data is adhering to the trend via the estimation of errors.
We also saw how model parameters for linear regressions are computed. We saw how the best values for the model parameters can be determined from the
The correlation coefficient describes the strength of the association between two variables. If the two variables, the correlation coefficient tends to +1. If one decreases as another increases, the correlation coefficient tends to -1. If they are not related to one another, the correlation coefficient stays at zero. In addition, the correlation coefficient can be related to the results of the regression. This is helpful because we now find a correlation not between parameters but between our notions of cause and effect. This also leads us to use correlation between any x and y which are not necessarily independent and dependent variables. This follows from the fact that the correlation coefficient (denoted by r) is symmetric in x and y. This differentiates the coefficient from the regression.
Non-linear equations can also be "linearized" by selecting a suitable change of variables. This is quite popular because it makes the analysis simpler. But reducing the dimensions is prone to distortion of the error structure. It is a oversimplification of the model. It violates key assumptions and impacts the resulting parameter values. All of this contributes toward incorrect predictions and are best avoided. Non-linear squares analysis has well defined techniques that are not too difficult with computing. Therefore it is better to do non-linear square analysis when dealing with non-linear inverse models.
#codingexercise
Given three sorted arrays, find one element from each array such that the element is closest to the given element. All the elements should be from different arrays.
For Example :-
A[] = {1, 4, 10}
B[] = {2, 15, 20}
C[] = {10, 12}
Given input: 10
Output: 10 15 10
10 from A, 15 from B and 10 from C
List<int> GetClosestToGiven(List<int> A, List<int> B, List<int>C, int value)
{
assert (A.Count > 0 && B.Count > 0 && C.Count > 0);
var ret = new List<int>();
ret.Add(GetClosest(A, value)); // using binary search
ret.Add(GetClosest(B, value));
ret.Add(GetClosest(C, value));
return ret;
}
int GetClosest(List<int> items, int value)
{
int start = 0;
int end = items.Count-1;
int closest = items[start];
while (start < end)
{
closest = Math.Abs(items[start]-value) < Math.Abs(items[end]-value) ? items[start] : items[end];
int mid = (start + end ) /2;
if (mid == start) return closest;
if (mid == end) return closest;
if (items[mid] == value)
{
return value;
}
if (items[mid] < value)
{
start = mid;
}else{
end = mid;
}
}
return closest;
}

Wednesday, November 29, 2017

We were discussing detecting accounts owned by a user and displaying last signed in. We referred only to the last signed in feature without descrption of its implementation. This information can be persisted as a single column in the identity table. Most tables generally have Created, Modified timestamps so in this case we can re-purpose the modified timestamp. However the Identity record may also be modified for purposes other than last signed in. In addition, the last signed in activity is also more informational when it describes the device from which it is signed in. Therefore keeping track of devices table joined with login time will help this case. Making an entry for the device id and timestamp is sufficient and we only need to keep track of one per login used by the owner. Translation to domain objects can then proceed with Object Relational Mapping. Finally we merely add an attribute on the view model for the display on the page.
The data is all read and write by the system and therefore has no special security considerations. It is merely for display purposes.

#codingexercise
Print all possible palindromic partitions of a string

We can thoroughly exhaust the search space by enumerating all substrings
with starting position as 0 to Length-1
and size of substring as that single character itself to a length including the last possible character.
For each of these substrings, we can determine whether it is a palindrome or not
bool isPalindrome(string A)
{
int start = 0;
int end = A.Length-1;
bool ret = false;
if (A.Length == 0) return false;
while(start <= end)
{
if (A[start] != A[end])
{
return false;
}
start++;
end--;
}
ret = true;
return ret;
}

Tuesday, November 28, 2017

Detecting accounts owned by a user and displaying last signed in.

We were talking about networking devices in the earlier post. While shared devices seem to be shrinking in number and personal devices seem to be growing in number, we face all the more question on whether those devices are secured. Identity is used to login to a device or site and usually consists of a username and a credential. With the devices becoming personal, a user may remain signed in to the device since it is physically secured by her.

An owner may want to sign in with different credentials if she wants to separate the concerns by authenticating and performing actions under a different login. This may not be typical in every day use but having more than one account to your blog or email provider or another company website is altogether very common.

When an owner wants more than one identities created, usually she gives them a different name. While many take the precaution of using a common prefix or suffix to recognizes these different accounts, they are not necessarily required. Consequently, grouping the different accounts for the same person is not easy except by the owner when she recalls all that she created. On the other hand, when the account is created and if we can leave an annotation or allow the account creation process to read the previously used account from the devices, some associations may be set up. This is very helpful to group the accounts. The step does not need to be taken only at registration time but can be taken at any time after that should the owner want to tag different accounts. Otherwise we are left to discovering these related accounts by matches in first name, last name and some other fields such as recovery email.

Discovering related accounts is one thing. Presenting those to the user for actions such as deletion is another thing. Just like we display last signed in activity for different devices as a security measure for the user, we can could also display related accounts for account hygiene by the owner.

For businesses, the use case for displaying the last-signed-in-on-device activity is perhaps more relevant than showing related accounts but this may change quickly with the ability to switch accounts when shown on the login page.

#codingexercise
we were discussing bridges and torches problem and how the participants can either be on the left or right side of bridge.
if we use a bitmask for their presence on one side of bridge, we can quickly calculate the other side as follows:
To get the right mask value we can use:
int GetLeftMask(int rightmask)
{
// use the 1s in all positions of the number and xor it with the right mask
return ((1 << n) - 1) ^ rightmask;
}

Monday, November 27, 2017

Wireless Access Point in base station mode, relay mode and remote mode:

Wireless access points are ubiquitous in home and office. They are often called wireless routers and usually connect to a wired cable that lets the router connect to the internet. PCs, laptops, phones and iPads connect to it wirelessly using the protocols of the 802.11 family which enable mobility for the person wanting to access the internet.

However, the wireless routers have a limited range. When the devices can find sufficient signal from the router, connectivity to the internet is a joy. We are happy to browse and stream data over the connection. When the connectivity is poor, there are a few options available and we discuss these. The typical course of action is to buy an upgraded router – one preferably with better antenna and replace the existing router. This has had remarkable impact in most usages not only from the improved hardware but also from the improved protocols. The family of data protocols used with the wireless router to establish and maintain a wireless connection also called Wi-Fi protocols (short for Wireless Fidelity) have undergone several iterations with improvements in data transmission rates, power management and so on. These Wi-Fi protocols were labeled alphabetically with ‘b’, ‘g’ and ‘n’ becoming notable revisions. Together this alphabet soup protocols came out of the box and gave added power to the user. The range however does not extended automatically.

The protocol however allows the wireless access points to work in one of three following modes:

As a base station to connect to the internet over an LAN cable or Ethernet.

As a relay base station to relay data between other base stations

As a remote base station that allows clients to connect but passes the data to 2) or 1) for connectivity

While commercial devices allow the functionality of 1) and the protocols technical feasibility of extending range with 2) and 3), users seldom leverage the ability of the access point to operate in relay or remote mode. Wireless companies don’t make it any easier to leverage these functionalities. On the other hand, they sell separate devices for those with large homes and call them wireless extenders. These wireless extenders are not only sold separately, they are even bundled with signal amplifiers and traffic snooping capabilities. Dedicated wireless repeater is also sold separately. This contains two wireless routers where one of them picks up the existing Wi-Fi network and then transfers the signal to the other router with boosted signals. This technique of using one network with another is called bridging and the definition is expanded to include those where one of the networks is wired. Bridging can even be done on network that share similar infrastructure. If you think wired Ethernet cables are the only ones that allow network traffic to be conducted, even those for electrical circuit of the house can be reused to create a link from the Wi-Fi router to your device as for example with Powerline Ethernet kit. While extenders and repeaters improve coverage, they still load the existing main base station. This reduces speed in some cases. Consequently bridging is favored over the extenders. If we wanted to convert existing older model routers to bridge or repeat, we are possibly out of luck even with reconfiguration of the device. Perhaps the devices of tomorrow can be made more open to begin with in their corresponding areas of operation.

Sunday, November 26, 2017

#codingexercise
We were discussing a sample problem of crossing the bridge.
There are 4 persons (A, B, C and D) who want to cross a bridge in night.
A takes 1 minute to cross the bridge.
B takes 2 minutes to cross the bridge.
C takes 5 minutes to cross the bridge.
D takes 8 minutes to cross the bridge.
There is only one torch with them and the bridge cannot be crossed without the torch. There cannot be more than two persons on the bridge at any time, and when two people cross the bridge together, they must move at the slower person’s pace.
Can they all cross the bridge in 15 minutes ?
Solution: A and B cross the bridge. A comes back. Time taken 3 minutes. Now B is on the other side.
C and D cross the bridge. B comes back. Time taken 8 + 2 minutes. Now C and D are on the other side.
A and B cross the bridge. Time taken is 2 minutes. All are on the other side.
Total time spent is 3 + 10 + 2 = 15 minutes.
Next we wanted to generalize this.
The combination chosen works for this example by observing the tradeoff between having at least one fast member available on the right to come back and the pairing of slow folks on the left to go to the right so that they are not counted individually.
We noted that they have overlapping subproblems.
We have left and right sides. The number of people on the left side can vary between 0 to a large number. The next move can either be from the left side to the right side or vice versa. Therefore we can maintain a dynamic programming table of that many rows and two columns. At any time, this table stores the minimum time it takes for that many number of people on the left side given that move so we need not recalculate. it. Also we don't just store the number of people, we actually store the bitmask so as to give the positions of the people present on the left side. With this we can immediately know the presence roll on the right side. Given that any one of the n people can make the move, we pick the one that yields the minimum time on recursion. We evaluate this for both moves separately since two can go in one direction and only one on return. We try for every pair and given that we have already exhausted the cases for a number less than the current iteration i from 0 to n-1, we try the pairing with numbers between i+1 to n-1. Finally, we return the minimum time.

To get the right mask value we can use:
int GetRightMask(int leftmask)
{
// use the 1s in all positions of the number and xor it with the left mask
return ((1 << n) - 1) ^ leftmask;
}
For the return path we are only selecting one person. We can find this one by iterating through 1 to n for those on the right side to find the recusive minimum time For the forward path, we have to pick a pair. The pair can be combined by selecting any one of the n people together with any of the candidate from those greater than i but less than n.