Cluster computing

Friday, May 14, 2021

Write once mobile applications:

Introduction: Mobile applications were written once for Android and then rewritten for iPhones and expertise grew around each ecosystem while the business logic remained independent. This has now been addressed with write-once business logic that can be run platform-independent. This article describes the methods involved in writing an application using the Codename One framework, which facilitates the features required from the native stacks.

Description:

Codename One emphasizes the “write once, deploy everywhere” model of mobile application development. While the Android platform has become synonymous with Java development and Apple device development has with Objective C, developers struggle to port business logic to either platform. Instead, Codename One allows the application to be written in Java and released to multiple platforms including iOS. The binaries appear native to the platform on which they are released so the experience for the end-user is seamless, and this works for them since they do not want to know if the application was written in one language or the other. A sample program is included below for reference.

Conclusion: The application can be written once. It can be deployed everywhere by targeting the build for those platforms and writing installers. The setup for the environment might take several attempts due to dependencies but writing the application is easy once the environment is up and running.

package com.shrinktext.precis;

import static com.codename1.ui.CN.*;

import com.codename1.ui.Button;

import com.codename1.ui.Form;

import com.codename1.ui.Dialog;

import com.codename1.ui.Display;

import com.codename1.ui.Label;

import com.codename1.ui.plaf.UIManager;

import com.codename1.ui.util.Resources;

import com.codename1.io.Log;

import com.codename1.ui.Toolbar;

import com.codename1.ui.layouts.BoxLayout;

import com.codename1.io.NetworkEvent;

/**

* This file was generated by <a href="https://www.codenameone.com/">Codename One</a> for the purpose

* of building native mobile applications using Java.

public class Precis {

private Form current;

private Resources theme;

public void init(Object context) {

// use two network threads instead of one

updateNetworkThreadCount(2);

theme = UIManager.initFirstTheme("/theme");

// Enable Toolbar on all Forms by default

Toolbar.setGlobalToolbar(true);

// Pro only feature

Log.bindCrashProtection(true);

addNetworkErrorListener(err -> {

// prevent the event from propagating

err.consume();

if(err.getError() != null) {

Log.e(err.getError());

}

Log.sendLogAsync();

Dialog.show("Connection Error", "There was a networking error in the connection to " + err.getConnectionRequest().getUrl(), "OK", null);

});

}

public void start() {

if(current != null){

current.show();

return;

}

TextModeLayout tl = new TextModeLayout(3, 2);

Form f = new Form("Text Summarization form", tl);

TextComponent input = new TextComponent().label("Input").multiline(true);

TextComponent output = new TextComponent().label("Output").multiline(true);

f.add(tl.createConstraint().widthPercentage(60), input);

f.add(tl.createConstraint().widthPercentage(40), output);

f.add(tl.createConstraint().horizontalSpan(2), description);

f.setEditOnShow(input.getField());

Button b = new Button("Summarize");

f.add(b);

b.addActionListener((e) -> OnClickSummarize(b));

Button b = new Button("Reset");

f.add(b);

b.addActionListener((e) -> OnClickReset(b));

f.show();

}

Thursday, May 13, 2021

A comparison of libraries in Java for machine learning applications

Problem statement: Machine learning algorithms are computation-intensive tasks and usually heavy matrix operations. Statistical and numerical analysis were traditionally developed with interpreters such as Matlab and Python. These libraries grew over time to include several models and algorithms. Data science made many of these popular with the use of libraries such as NumPy, spaCy, and Transformers. Machine learning applications continue to evolve with Python as the preferred choice of language. This article investigates the gap between what’s available in Python versus those available in Java for the common tasks used in machine learning. We use an example from text analysis to implement services in the Java language.

Description: The example taken for writing service in Java to perform extractive summarization of text requires the use of a natural language processing algorithm called BERT that can generate embeddings or matrix of doubles representing probabilities of sequences. The algorithm is part of a pre-trained model that must be loaded before the embeddings are requested which are helpful to score and evaluate the sequences into which the text is shredded. Given this example, we have requirements for a natural language library that represents the model and can be loaded such as transformers and spacy, a machine learning library that performs clustering, decomposition, and other analysis such as sklearn, a library to hold the data structures for representing the sequences and vectors from which the embeddings are derived such as tensors and lastly, operations using collections which may not be part of the programming language and structures for holding intermediary data and computations such as numpy.

We see that this application has just a handful of requirements and they are well-served by existing libraries in the Python language such as transformers, spacy, sklearn, torch, and numpy. The example we took has very little customization and performs only scoring over the results from the model to produce the output. This lets us look for alternatives to the mentioned libraries in the Java programming language. The natural language processing library relies on an algorithm that has been implemented in both Python and Java. However, the requirements from the application would do well with a higher-level library since much of the processing is common across applications. Transformers serve this purpose but there is no equivalent Java library. There are ML from Firebase for Android and there is a gallery of custom implementations, but none serve this purpose. On the contrary, sklearn library that provides computations for clustering, decompositions, and mixtures has an equivalent “Deep Java Library”. Torch library is required for vectorization and Tensor serves this purpose very well. They are also well-documented as well as easy to use requiring just the minimum conversions to a format that the library can perform the said operations on. Numpy does not have an equivalent java library per se, but much of the operations can be implemented with out-of-box primitives from the Java programming language for specific purposes.

Conclusion: We see that most of the gap is in a higher programming level library for the natural language processing in Java. All other operations are doable with existing implementations.

Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWzHzJoGjFeZdbJ8J3?e=Iex4dk

Wednesday, May 12, 2021

Problem statement: Given a floor of size M x N units and tiles of size 1 x 1 square unit and 2 x 2 square units, count the number of ways in which the tiles can be arranged on the floors such that each square unit of the floor is covered with a tile and if any square unit is covered by a tile different from that of other arrangements, then that arrangement is different from others. The number of unit-squares in the column-wise direction can be at most 7.

Solution:
Some sample cases: Floor dimensions: M x N as

1 x 2 = > 1 arrangement
2 x 2 => 2 arrangements
3 x 3 => 5 arrangements
2 * 4 => 5 arrangements
3 x 4 => 11 arrangements ( 1 (all-small-tile) + 6 (1-big-tile) + 4 (double-big-tiles) )

Assume there are big tiles across a horizontal line representing a row where the occurrence sequence is governed by an isValid method that makes big tiles don’t overlap. We can also infer two rows instead of one because a row running through big tiles can have the other row only as an adjacent to the big tile and not crossing the big tiles. Representing the matrix distribution for varying positions of I and j along row and column for big tiles sequences to be valid, we have the essential block to compute the Fibonacci sequence with. Since we have a matrix, we use matrix exponentiation to compute the Fibonacci sequence. At the end of the iterations, we look up the value corresponding to the requirement that the top and bottom horizontal lines do not cross any big tiles while they can be adjacent to those tiles.

public static boolean isValid(int x) {

boolean valid = true;

while (x > 0){

if ((x & 1) == 1) {

valid = !valid;

} else if (!valid) {

return false;

}

x >>= 1;

}

return valid;

}

public static int[][] getMatrix(int column) {

int length = (int)Math.pow(2,column);

int[][] A = new int[length][length];

for (int i = 0; i < length; i++) {

if (isValid(i)) {

for (int j = 0; j < length; j++){

if (isValid(j) && (i & j) == 0) {

A[i][j] = 1;

}

return A;

}

public static int[][] identity(int size) {

int[][] I = new int[size][size];

for (int i = 0; i < size; i++) {

I[i][i] = 1;

}

return I;

}

public static int[][] matrixMultiplication(int[][] A, int[][] B, int modulo) {

int[][] C = new int[A.length][A.length];

for (int i = 0; i < A.length; i++) {

for (int j = 0; j < A.length; j++) {

int value = 0;

for (int k = 0; k < A.length; k++) {

value = (value + (A[i][k] % modulo) * (B[k][j] % modulo)) % modulo;

}

C[i][j] = value;

}

return C;

}

public static int[][] matrixExponentiation(int[][] A, int row, int modulo ) {

int [][] B = identity(A.length);

while (row > 0) {

if ((row & 1) == 1) {

B = matrixMultiplication(A, B, modulo);

}

A = matrixMultiplication(A, A, modulo);

row >>= 1;

}

return B;

}

public static int getFillCountFromMatrix(int row, int col) {

int[][] B = matrixExponentiation(getMatrix(col), row, 10000007);

return B[0][0];

}

getFillCountFromMatrix(3,4) => 11

getFillCountFromMatrix(4,4) => 35

Tuesday, May 11, 2021

DBScan vs K-means

Clustering algorithms:

The Clustering Algorithm identifies the relationships in a dataset and then generates a cluster. The model specified by the user identifies the relationship and so there is no predictable column required. Taking the example of chip buyers, we can see that the kids form a separate cluster from the others. Further splitting the clusters into age groups, yields smaller clusters. The clustering behavior can be tuned with parameters such as the maximum number of clusters or changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input columns, and other predictable columns. Some tools also ship with a Cluster Viewer that shows the clusters in a diagram.

A partitioning clustering algorithm partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized. k is specified by the user. A hierarchical clustering algorithm generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of one single record, the algorithm merges two partitions in each step until one single partition remains in the end. DBScan and K-means are both partition algorithms. They differ slightly.

K-Means is based on distance or similarity measure and the clusters minimize the sum of squares of errors. The k number of clusters is pre-defined by the user. Clustering involves grouping data points together based on their measures. When we cluster based on similarity measures, we are looking for the similarity of a vector with that of cluster centers. The clustering begins with a set of finite clusters and the centroids of these clusters are adjusted as data points are added to them. This kind of algorithm is preferred for sparse data where we can determine the number of partitions we want to make.

DBScan is a simple and effective density-based clustering algorithm that illustrates several important concepts. We first consider the notion of density. There are several methods to measure density that is distinct and different from measuring similarity. In the center-based approach, density is estimated for a particular point by counting the number of points within a certain radius including that point. This method is simple, but the density measured this way depends on radius. If the radius is large enough, then all points will have a density of m which is the number of points in a dataset. If the radius is too small, then all points will have a density of 1.
By this approach, points will either lie in the core of a density-based cluster, or they will lie in the border between the neighborhoods of several core points or they may be outliers where they are neither part of a cluster nor in the neighborhood of two or more clusters. Thus, points in this approach are classified as core points, border points, and noise points.
The algorithm proceeds like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border, and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point, we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(m log m). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border, or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters varies widely. It also has difficulty with the high-dimensional data but it's one of the simple ways of clustering.