Cluster computing

Monday, April 8, 2019

Uses of object storage with Sequences:

Sequence databases are niche storage. They do not find everyday use in commercial systems because traditional relational and non-relational databases provide the ability to store large sets of data and their indexes. Their availability in the cloud has allowed the notion of unlimited data in representations such as BigTable. Sequences however do not expand along the columns of a table. Instead they run to the order of billions in the number of rows.

The processing for large sets of rows has remained somewhat similar so traditional databases served for all data including sequences with their tables. Sequence processing however has deviated from this conventional analytical stack. Sequences involve prefix tree. The processing stages tend to prune, clean and perform canonicalization before the sequence patterns are discovered. Even a dynamic bit vector datastructure or a bloom filter is used to determine whether an element is part of a sequence or not.

Generation of sequences is also a multi-stage processing. It involves discovering elements for the sequences prior to collecting the sequences. Such extraction of elements requires cleaning, stemming and even running neural nets so that they can be weighted before they are extracted. Sequences merely help with the formation of groups. Sometimes the ordering is important and at other times they degenerate to groups.

Groups have had limited application in analysis because groups proliferated and there is no good way to determine what is important and what isn’t. There is also no easy way to tell how many elements should remain in a group or what to exclude. This makes groups difficult for analysis as opposed to vectorization of elements where their latent power and associations are easier to form patterns with data mining techniques. Statistical and other forms of analysis also prefer vectorization. Graphs and page ranks also work better on elements that are vectors rather than scalars.

However, groups do have the ability to form pseudo elements and these elements can also participate in the formation of graphs and their analysis via page ranks. The use of a search engine with the web resources is a demonstration that page ranking can assign weights to elements. With the help of groups as pseudo elements, there is some categorization which can lead to hierarchies or levels. These hierarchies or levels add value to the otherwise flat representation of resource rankings where only the top few ever get noticed and the remaining ignored.

The meaningfulness of unordered groups or ordered sequences can improve the search as well as the prediction of queries. This has been the underlying basis for collaborative filtering where users are genuinely interested in viewing items that others have viewed similar to what they were trying to find. Therefore, groups and sequences hold a lot of promise.

Object storage is a representation of infinite web accessible storage of key value collections in a hierarchical namespace that is well suited for groups and collections. With the compute resources available to access the storage directly over the web and the object storage demonstrating the best practice of the storage industry, the analysis using groups and sequences becomes much more agile with such storage and compute.

Sunday, April 7, 2019

Today we take a break from discussing the best practice from storage engineering.

Sequence Analysis:
Data is increasing more than ever and at a fast pace. Algorithms for data analysis are required to become more robust, efficient and accurate. Specializations in databases, higher end processors suitable for artificial intelligence have contributed to improvements in data analysis. Data mining techniques discover patterns in the data and are useful for predictions but they tend to require traditional databases.
Sequence databases are highly specialized and even though they can be supported by B-Tree data structure that many contemporary databases use, they tend to be larger than many commercial databases. In addition, algorithms for mining non-sequential rules focus on generating all sequential rules. These algorithms produce an enormous number of redundant rules. The large number not only makes mining inefficient, it also hampers iterations. Such algorithms depend on patterns obtained from earlier frequent pattern mining algorithms. However, if the rules are normalized and redundancies removed, they become efficient to be stored and used with a sequence database.
The data structures used for sequence rules have evolved. The use of a dynamic bit vector data structure is now an alternative. The data mining process involves a prefix tree. Early data processing stages tend to prune, clean and perform canonicalization and these have reduced the rules.
In the context of text mining, sequences have had limited applications because the ordering of words has never been important for determining the topics. However, salient keywords regardless of their locations and coherent enough for a topic tend to form sequences rather than groups. Finding semantic information with word vectors does not help with this ordering. They are two independent variables. And the word vectors are formed only with a predetermined set of dimensions. These dimensions do not increase significantly with progressive text. There is no method for vectors to be redefined with increasing dimensions as text progresses.
The number and scale of dynamic groups of word vectors can be arbitrarily large. The ordering of the words can remain alphabetical. These words can then map to a word vector table where the features are predetermined giving the table a rather fixed number of columns instead of leaving it to be a big table.
Since there is a lot of content with similar usage of words and phrases and almost everyone uses the language in day to day simple English, there is a higher likelihood that some of these groups will stand out in terms of usage and frequency. When we have exhaustively collected and persisted frequently co-occuring words in a groupopedia as interpreted from large corpus with no limit to the number of words in a group and the groups-persisted in the sorted order of their frequencies, then we have a two-fold ability to shred a given text into pre-determined groups there-by instantly recognizing topics and secondly adding to pseudo word vectors where groups translate as vectors in the vector table.

#codingexercise
Yesterday's coding question continued:
Since we have A2 as a small array, we can directly use it to sort the elements
public List <Integer> relativeSort (List <Integer> A1, List <Integer> A2) {
List <Integer> result = new ArrayList <>();
int start = 0;
for (Integer a: A2) {
for (Integer index = findFirst (A1, a, start); start < A1.size () && index != -1; start = index +1) {
result.add (A1 [index]);
}
start = 0;
}
if ( result.size () < A1.size () ) {
result.addAll (A1.getRange (result.size ()).sort ());
}
return result;
}

public int findFirst (List <int> A1, int a, int start) {
for ( int i = start; i < A1.size (); i++) {
if ( A1 [i] == a ) {
return i;
}
}
return -1;
}

Saturday, April 6, 2019

Today we continue discussing the best practice from storage engineering:

680) There is a lot in common between installer packages on any other desktop platform and the containerization deployment operators.

681) The setup of service accounts is part of the operator actions for deployment and it remains one of the core requirements for the deployment of the product.

682) The dependencies of the software deployment may include such things as identity provider. The identity provider facilitates the registrations of service accounts

683) Identity providers issue token for accounts and these can be interpreted for user and roles

684) Identity providers do not mandate the use of one protocol such as Oauth or JWT and it is easy to switch from one to the other with http requests.

685) The accounts can be made suitable for use with any Identity provider.
686) Storage products keep a journal with the data which helps with data collection.

687) Storage products also keep a ledger for the accesses to the data.
688) Storage products keep their indexes for their data so that it can be looked up faster.

689) Storage products use the journal, ledger and index to maintain data but they are eventually just collections of serialized data structures and they do not look very different at the lower levels.

690) Storage products do have to honor user organizations which means data has to be copied as many times as user indicates.

#codingexercise
Given two arrays A1[] and A2[] of size N and M respectively. The task is to sort A1 in such a way that the relative order among the elements will be same as those in A2. For the elements not present in A2, append them at last in sorted order. It is also given that the number of elements in A2[] are smaller than or equal to number of elements in A1[] and A2[] has all distinct elements.

public class RelativeComparator {

private Integer[] A2;
public RelativeComparator(Integer[] A2) {
this.A2 = A2;
}

public int compareTo(Integer a1, Integer a2, Integer[] A2) {
int indexA = first(A2, a1);
int indexB = first(A2, a2);
if (indexA != -1 && indexB != -1) {
return indexA - indexB;
} else if (indexA != -1) {
return -1;
} else if (indexB != -1) {
return -1;
} else {
return (a1 - a2);
}
}
}
int first(Integer[]A2, Integer key) {
for (int index = 0; index < A2.length; index++) {
if (A2[index] == key){
return index;
}
}
return -1;
}
public void sortA1ByA2(int[] A1, int[] A2) {
java.util.Arrays.sort(A1, new RelativeComparator<Integer>(A2));
}

Friday, April 5, 2019

Today we continue discussing the best practice from storage engineering:
674) There is no limit to the number of operators run during deploy. It is preferred to run them sequentially one after the other. The more granular the operators the better they are for maintenance.
675) The diagnosability of operators improves with each operator being small and transparent.
676) The upgrade operator can be rolling upgrades. Then the operator does not require scale up or down.
677) The operators can invoke methods defined in other operators as long as they are accessible.
678) Frequently storage applications are written with dependencies on other applications. Those applications may also support operators. Operators can be chained one way or the other.
679) Requirements on operator logic change whether the product is an analysis package or a storage product.
680) There is a lot in common between installer packages on any other desktop platform and the containerization deployment operators.
681) The setup of service accounts is part of the operator actions for deployment and it remains one of the core requirements for the deployment of the product.
682) The dependencies of the software deployment may include such things as identity provider. The identity provider facilitates the registrations of service accounts
683) Identity providers issue token for accounts and these can be interpreted for user and roles
684) Identity providers do not mandate the use of one protocol such as Oauth or JWT and it is easy to switch from one to the other with http requests.
685) The accounts can be made suitable for use with any Identity provider.

#codingexercise

// topN
for (int k = 0; k < N; k++) {
Double value;
if (first[i] > second[j]) {
value = first[i];
i++;
} else {
value = second[j];
j++;
}
}

Thursday, April 4, 2019

Today we continue discussing the best practice from storage engineering :

670) There is no special requirement for performance or security from the containerization framework than what the application wants from the host because this is internal and not visible to the user.

671) Operators like update involve scaling down the nodes, then performing the update and then scaling back up. There is no restriction to reuse logic between operators.

672) Operators generally work one at a time on the same cluster. This prevents states from being mixed and allowing each reconcile between state and deploy to happen sequentially.

673) Operators do not have retain any information between invocations. Anything that needs to be persisted has to be part of state.

674) There is no limit to the number of operators run during deploy. It is preferred to run them sequentially one after the other. The more granular the operators the better they are for maintenance.

675) The diagnosability of operators improves with each operator being small and transparent.

676) The upgrade operator can be rolling upgrades. Then the operator does not require scale up or down.

677) The operators can invoke methods defined in other operators as long as they are accessible.

678) Frequently storage applications are written with dependencies on other applications. Those applications may also support operators. Operators can be chained one way or the other.

679) Requirements on operator logic change whether the product is an analysis package or a storage product.

680) There is a lot in common between installer packages on any other desktop platform and the containerization deployment operators.

Wednesday, April 3, 2019

Today we continue discussing the best practice from storage engineering:

667) The type of features from the operator sdk used by the product depend on the automation within the container-specific operator specified by the product. If the product wants to restrict its usage of the container, it can take advantage of just the minimum.

668) One of these features is metering and it works the same way as in the public cloud. Containers are still catching up with the public cloud on this feature.

669) The operators can be written in a variety of languages depending on the sdk however in many cases a barebone application without heavy interpreters or compilers is preferred. Go language is used for these purposes and particularly in devOps.

670) There is no special requirement for performance or security from the containerization framework than what the application wants from the host because this is internal and not visible to the user.

671) Operators like update involve scaling down the nodes, then performing the update and then scaling back up. There is no restriction to reuse logic between operators.

672) Operators generally work one at a time on the same cluster. This prevents states from being mixed and allowing each reconcile between state and deploy to happen sequentially.

673) Operators do not have retain any information between invocations. Anything that needs to be persisted has to be part of state.

674) There is no limit to the number of operators run during deploy. It is preferred to run them sequentially one after the other. The more granular the operators the better they are for maintenance.

675) The diagnosability of operators improves with each operator being small and transparent.

Tuesday, April 2, 2019

Today we continue discussing the best practice from storage engineering:

661) The operator used to build and deploy the storage server can take care of all the human oriented administrative tasks such as upgrade, scaling, and backups.

662) The logic of these administrative tasks remains the same across versions, size and nature.

663) The parameters for the task are best described in the declarative syntax associated with the so-called custom resource required for these deployment operators

664) The number of tasks or their type depend from application to application and can became sophisticated and customized. There is no restriction to this kind of automation

665) The containerized image built has to be registered in the hub so that it is made available everywhere.

666) There are many features on the container framework that the storage product can leverage. Some of these features are available via the SDK. However, container technologies continue to evolve in terms of following the Platform as a service layer and the public cloud example.

667) The type of features from the operator sdk used by the product depend on the automation within the container-specific operator specified by the product. If the product wants to restrict its usage of the container, it can take advantage of just the minimum.

668) One of these features is metering and it works the same way as in the public cloud. Containers are still catching up with the public cloud on this feature.

669) The operators can be written in a variety of languages depending on the sdk however in many cases a barebone application without heavy interpreters or compilers is preferred. Go language is used for these purposes and particularly in devOps.

670) There is no special requirement for performance or security from the containerization framework than what the application wants from the host because this is internal and not visible to the user.