Cluster computing

Friday, March 17, 2017

Today we start reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisons of data warehouse and big data as a cloud offering. As Gartner mentioned, there will be more than 20 billion connected devices expected by the year 2020 and the amount of data exchanged by the sensors is going to be way more than the amount of data exchanged by human beings. The size of data is only growing. Many find it easier to directly work on such large scale data Big Data refers to very large and complex data sets that traditional data sets are incapable of processing For a more detailed comparision, I refer an earlier blog post. The main takeaway is that BigData is not only about storage but also about a different type of algorithms. These load, store and query a massive scale of data in batches by a technique called MapReduce and can run in parallel across a distributed cluster. Social network is one example of Big Data. Many cloud providers have established new datacenters for hosting social networking, business media content or scientific applications and services. In fact storage from cloud providers is measured in gigabyte-month and compute cycle is priced by the CPU-hour.

IBM data scientists argue that the key dimensions of big data are : volume, velocity, variety and veracity. The size and type of existing deployments show ranges along these dimensions. Many of these deployments get data from external providers. A Big data as a service stack may get data from other big data sources, operational data stores, staging databases, data warehouses and data marts. Typically the operational datastores, staging databases and warehouses are relational data. Data marts allow analysis over dimensions along a cube. Big Data sources can include source systems in Compliance, Trading, CRM, Research, Finance, MDM, Pricing and other IoT data sources.

Zheng et al described a big data as a service offering for service generated data. He showed that the stack for this service includes all three layers of analytics, platform and infrastructure in that hierarchy. The data feeding into this service comes from service generated big-data that includes service logs, service quality of service QoS and service relationship. The log analysis comes useful for visualization and diagnosis. The QoS provides fault tolerance and prediction. The service relationship provides service identification and migration.

#codingexercise

Count all Palindromic subsequences in a given string

Int GetCountPalin(string A, int start, int end)

{

If (String.IsNullOrEmpty(A) || A.Length == 0 ) return 0;

// Assert(start >= 0 && start < A.Length && end >= 0 && end < A.Length && start <=end);

If (start == end) return 1;

Int count = 0;

If (A[start] == A[end]){

count += GetCountPalin(A, start+1, end);

count += GetCountPalin(A, start, end-1);

count += 1;

}else{

count += GetCountPalin(A, start+1, end);

count += GetCountPalin(A, start, end-1);

count -= GetCountPalin(A, start+1, end-1);

}

return count;

}

Void  Combine(string A, ref stringbuilder b, int start, int level, ref List<int> palindromecombinations)

{

for (int I =start; I < A.length; I++)

{

     b[level] = A[i];

If(IsPalindrome(b.toString()))

palindromecombinations.add(b.toString());

if (I < A.length)

    Combine(A, ref b, start+1, level+1, ref palindromecombinations);

  b[level] = '/0';

}

Thursday, March 16, 2017

Endpoint protection

Endpoint protection is a core concept of any company’s digital security. It helps a company defend against Internet based breaches and data losses. It provides barriers against malware, data loss and theft. Technologies behind endpoint protection also mitigate network intrusion. Generally a company has many services offered via its endpoints. They type and number of endpoints, how it is hosted – on site, in a virtualized environment or in the cloud, the management tools required whether it is on-site, remote or mobile, its performance expectations and support determine the choice of vendor for the endpoint protection. Reviewers of endpoint protection technologies indicate that the size of a company does not matter to the endpoints being protected. Also endpoint protection typically scales to hundreds or thousands of endpoints.

Let us take a look at the architecture behind this offering. An endpoint device is an Internet-capable computer hardware device on a TCP/IP network with an address and port that clients can connect. Typically they can be any web service or application of any size that can be reached over say http or https. The devices hosting the applications can be cloud based servers, on site web farms, desktop computers, laptops, smart phones, tablets, thin clients, printers or other specialized hardware such as POS terminals and smart meters.

Policies are associated with endpoints and these are managed as network rules and firewalls within an organization. A system administrator may divide the network, secure access via firewalls, disable ports and establish static rules to prevent undesirable access to devices hosting endpoints. There are software tools that can manage these centrally for a spread of computers - be it on premise or in the cloud. Such tools are usually associated with system center management platforms.

Another technique is to use an http proxy. As a man in the middle, it does not require any invasion of the server offering the services and is capable of performing the same mitigations that could have been taken on the said server. A http proxy like Mashery can also monitor and measure incoming traffic to the advantage of provided detailed statistics on a variety of metrics. Proxy can not only support relay behavior but also filtering. They support promiscuous mode listening. In our case we have a transparent proxy that does not modify the requests or responses. Proxies can also be forward or reverse. A forward proxy helps with anonymity in that it retrieves resources from the web on behalf of the users behind the proxy. A reverse proxy is one that secures the resources of the corporate from outside access This comes in helpful to maintain quarantine lists to stage access to the protected resources. Network address translation is a useful concept in this regard. Its also referred to as fencing when used with virtual machines. A reverse proxy can do several things such as load-balancing, authentication, decryption or caching. By treating the proxy as another server, the clients don't know which server processed the request. The reverse proxy can distribute the load to several servers which means the configuration is not just a fail over but a load balancing one. SSL acceleration is another option where this proxy enables hardware acceleration and a central place for SSL connectivity for clients.

The proxy can also choose to serve/cache static content and facilitate compression or to communicate to the firewall server since the rules on the firewall server could now be simplified. Firewalls can also use proxies and they can be transparent or opaque. Thus proxy is a very useful technique to protect endpoints.

Another technique is to use a variety of intelligent, lightweight sensors which capture and record all relevant endpoint activity ensuring true visibility across the environment. They may come with a small footprint, no reboot, no daily AV definitions, no user alerts, no impact on the endpoints and protect both offline and online access. Crowdstrike is well known for use this kind of technique. The use of distributed sensors also implies a centralized analysis service that can be hosted in the cloud so that it can scale arbitrarily. Together with the use of sensors and services, this kind of platform can crunch large amount of data and form the so called ThreatGraph. By correlating billions of events in real time and applying graph based techniques, a ThreatGraph can draw link between events and adversary activity quickly. It’s a powerful and massive scalable graph database that can be used with machine learning techniques to detect patterns. More on some of the machine learning techniques can be read in my blog post here.

A survey research in network and intrusion analysis shows a lot of the assessment techniques are reactionary and often too late because the attacks quickly escalate into a deluge. Sensors are also not able to discern the good from the bad. Therefore predictive algorithms and precautionary load balancing and throttling between regions becomes an active area of study.

Conclusion: Endpoint protection is no longer a set of static rules but a field of work in itself.

#codingexercise
Find the number of Palindromic paths in a matrix of alphabets.

int GetPalinPathCount( char[,] A, int R, int C, int rs, int re, int cs, int ce)
{
if ( rs < 0 || re >= R || cs < 0 || ce >= C || re < 0 || rs >= R || cs < 0 || ce >= C) return 0;

if (A[rs, cs] != A[re, ce]) return 0;
if (Math.Abs((rs - re) + (cs - ce)) <= 1)
return 1;
int ret = 0;
ret += GetPalinPathCount(A, R, C, rs+1, re-1, cs, ce);
ret += GetPalinPathCount(A, R, C, rs+1, re, cs, ce-1);
ret += GetPalinPathCount(A, R, C, rs, re, cs+1, ce-1);
ret += GetPalinPathCount(A, R, C, rs, re-1, cs+1, ce);
return ret;
}

Wednesday, March 15, 2017

Today we start reading the paper "Big Data and Cloud Computing: A survey of the State-of-the-Art and Research Challenges" by Skourletopoulos et al. This paper talks about the comparisions of data warehouse and big data as a cloud offering.
We could calculate the TCO as per the model described here :

https://1drv.ms/w/s!Ashlm-Nw-wnWrxZ9WCvR2TEdRruC

#codingexercise
Find the n'th super ugly number which is a positive number whose factors are all in the given prime list.
For example, primes = [2,5]
Super ugly number is 1,2, 4,5,8 and the 5th super ugly number is therefore 8.

int GetUgly(List<int> primes, int n)
{
var multiples = new int[primes.Length]{primes};
var operands= new int[primes.Length]{0};
var ugly = new List<int>();
ugly.insert(1);
while (ugly.length < n)
{
var nextUgly = multiples.min();
ugly.insert(nextUgly);
for (int j = 0; j < primes.Length; j++)
{
if (nextUgly == multiples[j])
{
operands[j]++;
int start = 0;
for (int i = 1; i <= operands[j];i++)
start++;
multiples[j] = primes[j] * ugly[start];
break;
}
}
}
return ugly.Last();
}

Tuesday, March 14, 2017

#codingexercise
Find the shortest path with exactly k edges in a directed and weighted graph
int GetShortestPath(int[,] graph, int u, int v, int k)
{
if (k == 0 && u == v) return 0;
if (k == 1 && graph[u,v] != int_max) return graph[u,v];
if (k <= 0) return int_max;
int res = int_max;
for (int i = 0; i < graph.Vertices; i++)
{
if (graph[u,i] != int_max && u != i && v != i)
{
int rec = GetShortestPath(graph, i, v, k-1);
if (rec != int_max)
res = min(res, graph[u,i] + rec);
}
}
return res;

}
Given a positive integer N, count the number of binary strings without consecutive 1s
int GetCount(int n)
{
var a = int[n]; // number of required strings ending in 0
var b = int[n]; // number of required strings ending in 1
for (int i = 1; i < n; i++)
{
a[i] = a [i-1] + b[i-1];
b[i] = a[i-1];
}
return a[n-1]+b[n-1];
}

Monday, March 13, 2017

Today we surveyed literature on TCO models and draw the following examples:

Strabel and Stage developed an economic decision model that compares costs for the internal IT infrastructure (server and storage expenses) and the external provisioning by means of Cloud Computing Services (fees for CPU hour, time contingent, storage, internet service provider costs and inbound and outbound data transfer costs). Theirs is a regression model focused on the hybrid usage of internal and external infrastructure sources.

Kondo et al applied a cost benefit analysis that focused on Infrastructure as a Service aka IaaS. They compared cloud computing services to volunteer computing services and found the latter to be more economical in the long run but preferred the former for performance.

Li et al were probably ground breaking in terms of applying the company perspective of Cloud Computing on TCO because they focus on the provider. They discuss setting up and maintenance costs for a cloud such as costs of hardware, software, power, cooling, staff and real-estate. Instead of focusing on physical hardware, they try to maximize the virtual machines that can be deployed within a datacenter so that the organization is flexible to compute demands.

Martens et al detailed costing the cloud computing services and elaborated on the listing and calculation of cost factors.

This is summarized as follows:

The TCO of a cloud computing service equals the sum total of all cost types

The total amount of a cost type equals the sum total of all involved cost factors

The involved cost for a factor is determined on usage period accumulation over a set of timer periods

The cost within a unit period is the product of the required quantity and the unit cost or price.

Without going into their comprehensive elaboration of cost types and cost factors, we just take away their notion of aggregating costs and observe that it seems to work well for Infrastructure as a Service, Platform as a service and Software as a service.

#codingexercise
Greedy algorithm for Egyptian Fraction:
Egyptian Fraction is a representation of fraction as a sum of unit fractions. For example: to represent 6/14, we find the ceiling of 14/6 as 3 and write down one of the components of the egyptian fraction as 1/3. Then we recur for (6/14-1/3)
Void PrintEgyptianFraction(int nr, int dr)
{
If (dr == 0 || nr == 0) return;
If (dr%nr == 0) return;
If (nr %dr == 0) {
Console.WriteLine("1/{0}", nr/dr);
Return;
}
If (nr % dr == 0)
{
Console.WriteLine("{0}", nr/dr);
}
If (nr > dr)
{
Console.WriteLine("{0}", nr/dr);
PrintEgyptian(nr%dr, dr);
return;
}
Int n = Math.Ceil(dr/nr);
Console.WriteLine("1/{0}", n);
PrintEgyptian(nr * n - dr, dr * n);

}

Count binary strings with k times appearing adjacent set bits:
long GetCountPairBits(string bits, int n, int k)
{
assert(bits.length == n);
if (n==1) return 0;
if (k == 0 || k >= n) return 0;

long count = 0;

if (bits.endswith(“0”)){
count += GetCountPairBits(bits.substring(n-1), n-1, k);
}else{
if (bits.substring(n-1).endswith(“0”))
count += GetCountPairBits(bits.substring(n-1), n-1,k);
if (bits.substring(n-1).endswith(“1”))
count += GetCountPairBits(bits.substring(n-1), n-1, k-1) + 1;

}
return count;

}

Sunday, March 12, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We reviewed the partition layer which translates objects such as blob, table or queue to storage. We saw how the partition manager maintains a massive Object Table that is sharded on different partition servers which maintain their consistency across concurrent transactions using a lock service. We read the partition layer data model and the supported data types and operations, the partition layer architecture and the range partition data structures. we reviewed the load balancing, split and merge operations. Then we reviewed the intrastamp and interstamp replication. We also took stock of TCO, performance, application profiles and their comparision study
Today we conclude reviewing the list of design designs taken for the Windows Azure Service. This is primarily an append-only system which simplifies snapshot and replication because snapshots of previous states can now be kept at little or no cost. To keep the space overhead low, a Garbage Collection System is crucial.
To maintain integrity of user data, WAS needed to use end to end checksums. Checksum is padded onto the data from the moment it is received in the system.
Another design decision was to spread servers evenly across fault and upgrade domains. This way if the fault domain goes down, only a fraction of the servers is lost. Upgrade domains may be a different fraction.
Yet another decision was to support multiple data abstractions from a single stack. Blobs, Tables and Queues are all supported on the same stack.
We will recall that blobs, tables and queues were all tracked with the same master Object Table. This system defined abstraction reduces management by the system.
Storage is handled in buckets of 100 TB. WAS limis the amount of storage for an account to be no more than 100TB which allows all storage data pertaining to an account to fit within a given storage stamp. To obtain more storage capacity within a single data center, customers use more than one account within that location. This was considered alright for large customers because they already had multiple accounts.
WAS mentions that it seems to defy CAP theorem WAS provides high availability with strong consistency guarantees while the theorem states that high availability, consistency and partition tolerance may not be achieved at the same time.
Lastly, WAS uses extensive debug logging infrastructure and the logs are maintained on local disks.
The WAS System has also been stress tested using pressure points.
#codingexercise
To count total number of N digit numbers whose sum of even and odd digits are both even, we modify it as follows:
long GetCountEqual(int digits, int even, int odd, int n, bool isOdd)
{
if (digits == n) return (even % 2 == 0 && odd % 2 == 0);
long count = 0;
if (isOdd){
for (int i = 0; i <= 9; i++)
count += GetCountEqual(digits+1, even, odd+i, n, false);
}else{
for (int i = 0; i <= 9; i++)
count+= GetCountEqual(digits+1, even+i, odd, n, true);
}
return count;

}

Saturday, March 11, 2017

We continue with a detailed study of Microsoft Azure stack as inferred from an introduction of Azure by Microsoft. We reviewed some more features of Azure storage. We saw how erasure coding works in Windows Azure Storage aka WAS and how its different from Pelican. We saw what steps WAS takes to improve durability. We also saw the benefits of journaling. We reviewed the partition layer which translates objects such as blob, table or queue to storage. We saw how the partition manager maintains a massive Object Table that is sharded on different partition servers which maintain their consistency across concurrent transactions using a lock service. We read the partition layer data model and the supported data types and operations, the partition layer architecture and the range partition data structures. we reviewed the load balancing, split and merge operations. Then we reviewed the intrastamp and interstamp replication. We also took stock of TCO, performance, application profiles and their comparision study
We reviewed some of their design choices. They separated nodes dedicated to computation from nodes dedicated to storage. They decided to use range based partitioning /indexing instead of hash based indexing for the partition layer's object table. They used throttling or isolation which proved very useful when accounts are not well behaved. They built in automatic load balancing on range based partitioning approach and the account based throttling. They used separate log streams for each RangePartition to isolate the load time of each RangePartition. They used journaling to overcome the read/writes contention on the same drive. This also helps optimize small writes which is the norm when the log files are separated per RangePartition.
Arguably the single most beneficial design choice was the appendoly system. The data is never overwritten once committed to the replica. If there are failures the extent is immediately sealed. Since the data doesn't change, its commit length can be used to enforce consistency across all the replicas. Another application of the append-only system is in its ability to provide snapshot or versioning features at little or no cost. With versions, erasure coding can also be done. Moreover, append only system has been a tremendous benefit for diagnosing issues as well as repairing/recovering the system on failures. With versioning, snapshots and history of the changes, we can now have tools to repair and recover corrupted state to a prior consistent state. The way this works is that the original is untouched and a copy is made for every write and appended after the previous such record. Since all changes are consistent, we can have enough performance for scaling while providing diagnosis and troubleshooting.
An append only system comes with certain costs. Since each copy takes more space, an efficient garbage collection is required to keep the space overhead low. Similarly end to end checksum is a hallmark for this kind of service so that w keep the critical information in a safe place.
#codingexercise
To count total number of N digit numbers whose difference between sum of even and odd digits is equal, we modify it as follows:
long GetCountEqualf(int digits, int even, int odd, int n, bool isOdd)
{
if (digits == n) return even == odd;
long count = 0;
if (isOdd){
for (int i = 0; i <= 9; i++)
count += GetCountEqual(digits+1, even, odd+i, n, false);
}else{
for (int i = 0; i <= 9; i++)
count+= GetCountEqual(digits+1, even+i, odd, n, true);
}
return count;

}