Cluster computing

Sunday, May 2, 2021

Special purpose algorithms

1. Speech recognition - This is used to convert voice to text. It requires an analog to digital converter. The conversion can be visualized in a graph known as a spectrogram. A sound wave is captured and the amplitude over time plot is drawn in units of decibels. The wave is then converted to unit-second slices and stored in quantitative form. Frequency, intensity, and time are also required to plot the spectrogram which converts the audio to frames that are 20 to 40 milliseconds. Phenomes are used to disambiguate the sound. Variations in accents are measured as allophones.

Hidden Markov Model, popular with many neural network models, is also used in speech recognition. It comprises several layers. The first layer assigns the probabilities that the phenomes detected is the correct one. The second layer checks the co-occurrence of phenomes and assigns the probabilities. The third layer checks that just like the second layer but at the word level. It assigns probabilities for their co-occurrence.

The model checks and rechecks all the probabilities to come up with the most likely text that was spoken.

The advantage of using neural networks is that they learn by training on data and are flexible and can change over time. Neural networks keep improving when it knows the desired state and the actual state and corrects the error. It grasps a variety of phenomes. It can detect the uniqueness of sounds originating from accents and emotions and the Hidden Markov Model which improves the neural network. When the output variables are arranged in a sequence or a linear chain, we get a sequence model. This is the approach taken by a hidden Markov Model. An HMM models a sequence of observations by assuming that there is a sequence of states. Each state depends only on the previous state. And it is independent of all its ancestors. An HMM assumes that each observation variable depends only on the current state. This model is therefore specified by three probability distributions: the distribution p(y) over initial states, the transition distribution from one state to another, and finally the observation distribution of the occurrence with respect to the state.

When we include interdependence between features, we use a generative model. This is usually done in one of two ways - enhance the model to represent dependencies among the inputs or make simplifying independence assumptions. The first approach is difficult because we must maintain tractability. The second approach can hurt performance. The difference in their behaviors is large since one is generative and the other is discriminative.

Generative means it is based on the model of the joint distribution of a state regarding observation. Discriminative means it is based on the model of the conditional distribution of a state given the observation. We can also form a model based on generative-discriminative pairs.

2. Image recognition – Some image recognition algorithms make use of pattern recognition. Pattern recognition refers to the classification or description of objects or patterns. The patterns themselves can range from characters in an image of printed text to biological waveforms. The recognition involves identifying the patterns and assigning labels for categories. We start with a set of training patterns. The main difference between pattern recognition and cluster analysis is the role of pattern class labels. In pattern recognition, we use the labels to formulate decision rules. In cluster analysis, we use it to verify the results. Pattern recognition requires extrinsic information. In cluster analysis, we use only the data.

There are two basic paradigms to classify a pattern into one of K different classes. The first is a geometric or statistical approach. In this approach, a pattern is represented in terms of d features and the pattern features are as independent of one another as possible. Then given training patterns for each pattern class, the objective is to separate the patterns belonging to different classes.

In statistical pattern recognition, the features are assumed to have a probability density function that is conditioned on the pattern class. A pattern vector x belonging to a class wj is a data point drawn from the conditional probability distribution P(x/wj) where j is one of the K different classes. Concepts from statistical decision theory and discriminant analysis are utilized to establish decision boundaries between the pattern classes. If the class conditional densities are known, then Bayes decision theory gives optimal decision rule. Since they are generally not known, a classifier is used based on the nature of the information available. A classifier requires supervised or unsupervised learning in supervised learning, in the form of class conditional densities is known, we use a parametric or non-parametric decision rules. In unsupervised learning, the density functions are estimated from training samples. Here the labels one each training pattern represents the category to which the pattern belongs. The categories may be known beforehand, or they may be unknown.

When the number of pattern classes is unknown, it tries to find natural groupings in the data.

Saturday, May 1, 2021

Data mining techniques

Introduction: Data mining techniques add insights to database that are typically not known from standard query operators and IT operations rely largely on some form of data store or CMDB for their inventory and associated operations. OLE DB standardized data mining language primitives and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch. Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL.
The OLE DB for Data Mining stack uses a data mining extension (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information. Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training: The data are loaded into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.

When the data mining is not sufficient in terms of grouping, ranking and sorting, AI techniques are used. The Microsoft ML package provides:
fast linear for binary classification or linear regression
one class SVM for anomaly detection
fast trees for regression
fast forests for churn detection and building multiple trees
neural net for binary and multi-class classification
logistic regression for classifying sentiments from feedback

#codingexercise: Canonball.docx

Friday, April 30, 2021

Mobile data

There are types of data storage that are specific to mobile applications. Generally, a SQLite database is popular for mobile devices. It does not provide encryption but SQLCipher can be used to do this. An example of a Key-Value store, on the other hand, is the Oracle Berkeley database. Couchbase allows native JSON artifacts to be stored while Mongo Realm and Object box can store objects.

Mobile applications are increasingly becoming smarter with one of the two popular mobile platforms that have their own app stores and find universal appeal among their customers. The Android platform comes with support for Java programming and the tools associated with software development in this well-established language. The Android studio supports emulator mode running and debugging of the application that thoroughly vets the application just like it would be tested on a physical device. Modern Android development tools include Kotlin, Coroutines, Dagger-hilt, Architecture components, MVVM, Room, Coil and FireBase. The last one is a pre-packaged, open-source bundle of code called extensions to automate common development tasks.

The Firebase services that are required to be enabled in the Firebase console for our purpose include Phone Auth, Cloud Firestore, Realtime Database, Storage and Composite Indexes. The Android Architecture Components include the following: a Navigation Component that handles in-app navigation with a single Activity, LiveData which has data objects that notify views when the underlying database changes, ViewModel which stores UI-related data that isn't destroyed on UI changes, DataBinding that generates a binding class for each XML layout file present in that module and allows you to more easily write code that interacts with views and declaratively binds observable data to UI elements, WorkManager which is an API that makes it easy to schedule deferrable, asynchronous tasks that are expected to run even if the app exits or the device restarts and Room which is an Object Relational Mapping between SQLite object and POJO.

The Dependency injection is handled via Dagger-Hilt which incorporates Inversion of control via Dagger Dependency Injection and the Hilt-ViewModel which injects dependencies to ViewModel.

The Firebase extensions support cloud messaging for sending notification to client application, the Cloud Firestore for flexible, scalable, NoSQL cloud database to store and sync data, Cloud Storage for store and serving user-generated content and Authentication for creating account with mobile number.

The Kotlin serializer converts specific classes to and from the JSON.Runtime library with core serialization API and supports libraries with different serialization formats. Coil-Kt is used as an image loading library for Android backends by Kotlin Coroutines.

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWzA9GsoSyfDDANrXf?e=vu6Pah

Thursday, April 29, 2021

Synchronization of state with remote (continued...)

The key features of data synchronization include the following:

1. Data scoping and partitioning: Typically, the enterprise stores contain a lot more data than what is needed by client devices and their applications. This calls for scoping of data that needs to be synchronized. There are two ways to do this. First, we restrict the data to only those tables that pertain to the user from that client. If there are any data that does not pertain to the user, we can skip those. Second, the data that needs to be synchronized from those tables is minimized.
Partitioning can also be used to reduce the data that is synchronized. As is typical to partitioning, it can be horizontal or vertical. Usually, the vertical partitioning is done prior to horizontal because it trims the columns that need to be synchronized. The set of columns are easily found by comparing what is required for that user versus what is used by the application. After the columns are decided, a filter can be applied to reduce the rowset. Again, this can be done with the help of predicates that involves the user clause. Also reducing the scope and partition of the data, also reduces the errors introduced by way of conflicts which also improves performance.

2. Data compression: Another way to reduce an already scoped and partitioned data is to reduce the number of bytes that is used for its transfer. Data compression reduces this size. It is useful for reducing both time and cost although it incurs some overhead via compression and decompression routines. Some of these routines may be expensive for a mobile device more than it is for the server. Also, some data types are easy to compress while others aren’t. Therefore, the compression helps only in cases where those data types are used.

3. Data transformation: By the same argument as above, it is easy for the server to process and transform the data because of its compute and storage resources. Therefore, conversion to format that is suitable for mobile devices is easier done on the server side. Such transformations might even include conversion to data types that are compression friendly. Also, numerical data might be converted to string if the mobile devices find it easier to handle string.

4. Transactional integrity: This means that either all the change are committed or none of the changes are committed. Transactional changes occur in isolation. It should not affect others. Once the transaction is committed, its effects are persistent even against failures. Maintaining transactional behavior over network involves retries and this is not efficient. It is easier to enforce transactions within the store. If the synchronization involves remote transactions, then the rollbacks on the remote will require rollback of the entire synchronization and retry during the next synchronization. When databases allow net changes to be captured, there is an option to not look at transaction log if the individual transactions in the log are not important and the initial and final state are what matters. If the order of the changes made is also important, then transaction logs are important as well. A transaction log can be read in chronological order and executed on destination database.

#c#codingexercise

Given clock hands positions for different points of time as pairs A[I][0] and A[I][1] where the order of the hands does not matter but their angle enclosed, count the number of pairs of points of time where the angles are the same

public static int[] getClockHandsDelta(int[][] A) {

int[] angles = new int[A.length];

for (int i = 0; i < A.length; i++){

angles[i] = Math.max(A[i][0], A[i][1]) - Math.min(A[i][0],A[i][1]);

}

return angles;

}

public static int NChooseK(int n, int k)

{

if (k < 0 || k > n || n == 0) return 0;

if ( k == 0 || k == n) return 1;

return Factorial(n) / (Factorial(n-k) * Factorial(k));

}

public static int Factorial(int n) {

if (n <= 1) return 1;

return n * Factorial(n-1);

}

public static int countPairsWithIdenticalAnglesDelta(int[] angles){

Arrays.sort(angles);

int count = 1;

int result = 0;

for (int i = 1; i < angles.length; i++) {

if (angles[i] == angles[i-1]) {

count += 1;

} else {

if (count > 0) {

result += NChooseK(count, 2);

}

count = 1;

}

if (count > 0) {

result += NChooseK(count, 2);

count = 0;

}

return result;

}

        int [][] A = new int[5][2];
         A[0][0] = 1;    A[0][1] = 2;
         A[1][0] = 2;    A[1][1] = 4;
         A[2][0] = 4;    A[2][1] = 3;
         A[3][0] = 2;    A[3][1] = 3;
         A[4][0] = 1;    A[4][1] = 3;
1 2 1 1 2
1 1 1 2 2
4

Wednesday, April 28, 2021

Synchronization of state with remote (continued...)

Efficiency in data synchronization in these configurations and architectures come from determining what data changes, how to scope it and how to reduce the traffic associated with propagating the change. It is customary to have a synchronization layer on the client, a synchronization middleware on the server, and a network connection during the synchronization process that supports bidirectional updates. The basic synchronization process involves the initiation of synchronization – either on demand or on a periodic basis, the preparation of data and its transmission to a server with authentication, the execution of the synchronization logic on the server side to determine the updates and the transformations, the persistence of the changed data over a data adapter to one or more data stores, the detection and resolution of conflicts and finally the relaying of the results of the synchronization back to the client application.

The choice of synchronization technique depends on the situation. One of the factors that plays into this is the synchronization mode. There are two main modes of synchronization: snapshot and net change. Snapshots are the data as of a point of time. The data in a snapshot does not change. So it is useful to compare. This synchronization makes it possible to move large amount of data from one system to another. This is the case when the data has not changed at the remote location. Since snapshots might contain a large amount of data, a good network connection is required to transfer it. Updates to the product catalog or price list is a great use case for snapshot synchronization because the updates are collected in a snapshot, transferred and loaded at once on the destination store.

The net changes mode of synchronization can be considered slightly more efficient than the snapshot synchronization. In these cases, only the changed data is sent between the source and destination datastores and it reduces the network bandwidth and connection times. If the data were to change quite often on one server, only the initial and final state is required to create the changes that can then be made on the destination. Both the modes are bidirectional so the changes made in a local store can be propagated to the enterprise store. The net changes mode does not take into consideration the changes made in individual transactions. If those were important, the transaction log based synchronization may work better.

Transmission of data is also important in the effectiveness of synchronization technique. If an application is able to synchronize without the involvement of the user, it will work on any network – wired or wireless otherwise the latter usually requires human intervention to set up a connection. There are two types of data propagation methods – session based and message based. The session based synchronization method requires a direct connection. The updates can be done both ways and they can be acknowledged. The synchronization resumes even after a disruption. The point from which the synchronization resumes is usually the last committed transaction. The connection for this data propagation method can be established well in advance.

Message based synchronization requires the receiver to take the message and perform the changes. When this is done, a response is sent back. Messages help when there is no reliable network connection. The drawback is there is no control over when the messages will be acted upon and responded.