Cluster computing

Monday, May 10, 2021

Transfer Learning in summarizing text.

Problem statement: Transfer learning is a machine learning technique where a model trained on one task is re-purposed on a second related task. It only works in deep learning where the model features learned from the first task are general and the features are suitable to both the base and the target tasks. This form of transfer is also called inductive transfer and the scope for the allowed hypothesis is narrowed down in a beneficial way because the model is fit on a different but related task. When the data is sparse or expensive, this focus becomes especially efficient. This article describes how text summaries are generated with transfer learning.

Solution: spaCy is a newer natural language processing library compared to its predecessors that were written in Java. It is written in Python, provides word vectors, and performs syntactic analysis, tokenization, sentence boundary detection, POS tagging, syntactic parsing, text alignment, and entity recognition reliably, accurately, and in a production-ready manner. It begins by preparing a language object with a pipeline where the pipeline has the necessary configuration information to load the model data and weights. There are several prepackaged pipelines to choose from and each one returns a language object which is ready for trying out with the given text sample. This pre-trained model approach is an example of transfer learning. The model is used as a starting point on the given text. It can also be tuned with different configuration options.

The word embedding is already part of their pre-trained models. This is a mapping of words to a high-dimensional continuous vector space where different words with a similar latent meaning have a similar vector representation.  These are obtained from convolutional neural networks which are run against a large corpus of text. When the model is initially trained, it learns more and more as the training data increases. When this pre-trained model is run against the target data, it tends to have a higher starting point, a higher slope, and a higher asymptote. The start refers to the initial skill of the model before any learnings from the dataset.  A pre-trained model has a non-zero start when compared to a model that is starting from scratch. The rate of learning is also higher on the pre-trained model because of the narrowing scope in the allowed hypothesis. The convergence of the model trained on the target data set is better than the control model because of the improved learning. The only caution here is that this is general across all pre-trained models. It does not mean that a model is the right choice for the given data set. There might be an alternative to choose from and the training regime and underlying corpus matter to the results.

Transformers also come with such pre-trained models. While spaCy is for natural language processing, Transformers is used for natural language understanding. It comes with several general-purpose models including BERT and its variations and can be used in conjunction with spaCy. A model can encapsulate both.

Conclusion:  Transfer learning is not just a convenience. It is a requirement for sparse datasets and faster prediction.

Sunday, May 9, 2021

A comparison of libraries in Java for machine learning applications:

Problem statement: Machine learning algorithms are computation-intensive tasks and usually heavy matrix operations. Statistical and numerical analysis were traditionally developed with interpreters such as Matlab and Python. These libraries grew over time to include several models and algorithms. Data science made many of these popular with the use of libraries such as NumPy, spaCy, and Transformers. Machine learning applications continue to evolve with Python as the preferred choice of language. This article investigates the gap between what’s available in Python versus those available in Java for the common tasks used in machine learning. We use an example from text analysis to implement services in the Java language.

Description: The example taken for writing service in Java to perform extractive summarization of text requires the use of a natural language processing algorithm called BERT that can generate embeddings or matrix of doubles representing probabilities of sequences. The algorithm is part of a pre-trained model that must be loaded before the embeddings are requested which are helpful to score and evaluate the sequences into which the text is shredded. Given this example, we have requirements for a natural language library that represents the model and can be loaded such as transformers and spacy, a machine learning library that performs clustering, decomposition, and other analysis such as sklearn, a library to hold the data structures for representing the sequences and vectors from which the embeddings are derived such as tensors and lastly, operations using collections which may not be part of the programming language and structures for holding intermediary data and computations such as numpy.

We see that this application has just a handful of requirements and they are well-served by existing libraries in the Python language such as transformers, spacy, sklearn, torch, and numpy. The example we took has very little customization and performs only scoring over the results from the model to produce the output. This lets us look for alternatives to the mentioned libraries in the Java programming language. The natural language processing library relies on an algorithm that has been implemented in both Python and Java. However, the requirements from the application would do well with a higher-level library since much of the processing is common across applications. Transformers serve this purpose but there is no equivalent Java library. There are ML from Firebase for Android and there is a gallery of custom implementations, but none serve this purpose. On the contrary, sklearn library that provides computations for clustering, decompositions, and mixtures has an equivalent “Deep Java Library”. Torch library is required for vectorization and Tensor serves this purpose very well. They are also well-documented as well as easy to use requiring just the minimum conversions to a format that the library can perform the said operations on. Numpy does not have an equivalent java library per se, but much of the operations can be implemented with out-of-box primitives from the Java programming language for specific purposes.

Conclusion: We see that most of the gap is in a higher programming level library for the natural language processing in Java. All other operations are doable with existing implementations.

Saturday, May 8, 2021

Find the Fibonacci number using matrix exponentiation:

We start with the matrix F = [[1,1], [1,0]] and compute the matrix exponentiation Math.pow(F, n) as

public static int[][] identity(int size) {

int[][] I = new int[size][size];

for (int i = 0; i < size; i++) {

I[i][i] = 1;

}

return I;

}

public static int[][] matrixMultiplication(int[][] A, int[][] B, int modulo) {

int[][] C = new int[A.length][A.length];

for (int i = 0; i < A.length; i++) {

for (int j = 0; j < A.length; j++) {

int value = 0;

for (int k = 0; k < A.length; k++) {

value = (value + (A[i][k] % modulo) * (B[k][j] % modulo)) % modulo;

}

C[i][j] = value;

}

return C;

}

public static int[][] matrixExponentiation(int[][] A, int row, int modulo ) {

int [][] B = identity(A.length);

while (row > 0) {

if ((row & 1) == 1) {

B = matrixMultiplication(A, B, modulo);

}

A = matrixMultiplication(A, A, modulo);

row >>= 1;

}

return B;

}

Int [][] result = matrixExponentiation(F, n, 1000007);

Nth Fibonacci number is result[0][0];

N = 4, result = [[5, 3], [3, 2]], N’th Fibonacci number = 5

Friday, May 7, 2021

BeeWare mobile application development lessons learnt

Lessons Learnt: BeeWare is a mobile application development framework with a strong emphasis on the “write once, run everywhere” mode of writing applications. It leverages python language for writing applications and cross compiles application logic to the targeted platform such as android and iOS devices. The framework supports Pythonnet for enabling the applications written to be run and tested on Windows desktop without targeting any platform. This enables independent testing of applications. It supports Briefcase which is a tool for converting Python projects into a standalone native application. Briefcase allows scaffolding boilerplate code for application bootstrap as well as building, running, and packaging the application into native formats for targeted platforms. The framework supports Toga for user interface development and comes with the ability to compose a user interface with cross-platform widgets. These widgets allow basic functionality for the user interface so that forms can be written and user input can be collected and validated. The framework supports Rubicon which allows the written application to be built for android and iOS respectively and there is a version of Rubicon each for those platforms. This article goes through some of the idiosyncrasies involved with the use of this framework in detail and the lessons learnt.

Details: BeeWare offers convenience in writing applications with Python which is also machine-learning friendly given the well-known libraries supported by this protocol. These applications are written with dependencies on several modules which are imported into the application logic. When the BeeWare build command is written, the dependencies are also imported into the build application if they are specified as required for the application in the project file at the base directory of the project. Even machine learning libraries can be imported, build, and included in the package of the applications. However, the python modules must conform to well-known formats of dependency publication such that BeeWare can include them. Proprietary and customized modules might not be included by default. When the BeeWare includes the modules, both the compiled forms of Python code as well as the built binary of the targeted platform will be available under the project file layout. When the application is run on the desktop, these dependencies are loaded from the project file layout. It is also possible that the dependencies might be found outside the application file layout and it will still be found. When the application is run on an emulator with debug output visible, the console will indicate with exception stack if a module is missing. The emulator provides a kind of sandbox so the dependencies from outside the application package cannot be loaded. These are some of the ways in which this issue is detected and resolved.

One of the considerations in cross-compiling application logic is that the dependencies are not only for the application but also for the framework as well as those for the targeted platform. These layers of dependencies require versions and version compatibility can quickly get out of hand even for modules that appear within the project layout after build. A common resolution to this dependency mismatch is to roll back the Python version to older ones or in some cases apply overrides. When the applications are developed on Windows, there is an additional dependency on Visual Studio compiler binaries and the Windows .NET runtime framework. Unfortunately, there is no panacea for the dependency conflict resolutions, and it must be tackled on a case-by-case basis. A significant fraction of the application development time goes into this. One of the common errors encountered is "Error while dexing".

Another important consideration is that the machine learning models take input data and make predictions. They are computation heavy, and a model must be trained with a lot of data before it can be made to predict with high precision and recall. This calls for the models to be developed on the desktop or in high resource available devices. The model can be created and trained elsewhere and then they can be packaged and run with the mobile application and the emulator. This makes it easy to write machine learning applications with BeeWare. The flip side is that the development of user interface heavy applications is somewhat limited. The Toga library provides a basic set of widgets and containers which makes it hard for custom widgets or imported widgets to be used. Even the set of widgets provided are not all universally usable. Some of them may not be used on Android applications, while others may be used with iOS devices. The documentation of each widget also includes callouts for these but a missing widget availability on a targeted platform does not have a workaround without significant investments. The bright side of the application development on BeeWare is that the machine learning application development requires only a form with minimal widgets to capture manual entry of data with which the predictions are made. In such cases, the library is sufficient.

Conclusion: Among the lessons learnt, the callouts are for budgeting time into the setting up of the framework as well as the investigations into the deployment and smooth running of the application. The ability to write application logic with minimal code appears unmatched with this framework.

Thursday, May 6, 2021

Introduction: Mobile applications can bring machine learning models closer to the end-users so that they can run against the local data. They are also efficient because the model has already been trained and does not require large datasets where they run. There are a few frameworks that allow machine learning models to be ported to mobile devices. These frameworks include TensorFlow Lite and BeeWare development framework. Both frameworks allow Python-based development and are easy to implement the machine learning libraries available in that language. The differences between the TensorFlow Lite and BeeWare is called out below.

Description: BeeWare is a write-once run-everywhere application that works very well to write the business logic once irrespective of the platform targeted for the mobile application. Popular platforms include Android and iOS. The former requires Java bytecodes, and the latter is written with Objective-C. BeeWare allows the python bytecode to be reinterpreted for Java so that the logic runs natively to the Android platform. Similarly, the conversion for the iOS platform is performed during the build and a suitable installer binary is generated during the packaging stage. This provides the opportunity for developers to write little or no code for the platform and focus entirely on the business logic. When a machine learning model is used, this logic usually makes a prediction against data in real-time.

TensorFlow is a dedicated machine learning framework to author models. TensorFlow makes it easy to construct a model for mobile applications using the TensorFlow Lite ModelMaker. The model can make predictions only after it is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. The model is trained using functions like tf.train.AdamOptimizer() and compiled with a loss function. The optimizer was just created, and a metric such as top k in terms of categorical accuracy help tune the model. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model training can be controlled. Annotations help TensorFlow Lite converter to fuse TF.Text API. This fusion leads to a significant speedup than conventional models. The architecture for the model is also tweaked to include a projection layer along with the usual convolutional layer and attention encoder mechanism which achieves similar accuracy but with much smaller model size. There is native support for HashTables for NLP models.

On the other hand, all the steps of build and test for BeeWare can be performed as if it was written for the desktop. The packaging of the binaries creates a redistributable which can be tested with a suitable emulator. When the emulator shows launch failures, there might be nothing to see on the emulator but the debug console on certain frameworks provides additional details. Proper SDK and debug symbols must be provided to such a framework for use with the package on the emulator. The debug build of the package will be better for diagnosis than the release. Switching the framework to load and run a simulator allows more visibility into the execution of the application on the targeted platform.

The differences in the behavior of the application between those on desktop and emulator might be attributed to application lifecycle routines on the targeted platform. These can be exercised on the emulator once all the dependencies and their versions have ensured the success of the launch.

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWzGb-l7RO6fnpMcTH?e=g72pCx

Wednesday, May 5, 2021

Write once mobile applications:

Introduction: BeeWare is a program that allows applications to be written once and deployed to different platforms, such as, Android phones and Apple devices. Previously, these applications were written once for Android and then rewritten for iphones and expertise grew around each ecosystem while the business logic remained independent. This has now been addressed with write-once business logic that can be run platform independent. This article describes the methods involved.

Description:

BeeWare emphasizes “write once, deploy everywhere” mode of mobile application development. While Android platform has become synonymous with Java development and Apple device development has with Objective C, developers struggle to port business logic to either platform. Instead, BeeWare allows the application to be written in Python and released to multiple platforms. The binaries appear native to the platform on which they are released so the experience for the end-user is seamless, and this works for them since they do not want to know if the application was written in one language or the other. BeeWare is open source and the process of creating an application takes the following steps:

The pre-requisites are installed. These are also affectionately called pre-beeware. It requires the installation of pythonnet and briefcase on Windows machines. Pythonnet depends on .NET and the corresponding wheel might have a different version requirement than the latest Python. Briefcase allows easy packaging of all code and provides a framework to run the application on desktop. Toga, a cross platform widget toolkit and Rubicon libraries for targeting platform specific code can also be counted as prerequisites for their purpose.

The first application is simply an empty placeholder. It is created and run with briefcase dev command and launches the application with an interface which has no business logic.

Since the application is a python class, it can be enhanced with just the business logic without any operational concerns.

The packaging for distribution is done with scaffolding and writing an installer. We leverage briefcase for this step which makes it a breeze. The package must be signed before it can be deployed.

The application code can be updated and run in one step. The dependencies must be updated prior to run.

All the steps until this point can be performed on the desktop. The application can be build for phones by targeting that platform.

Once the application has been build, for that device, it can be put on the web as a web site.

Once the installer has been written, the application can be published with the briefcase command.

Conclusion: The application can be written once. It can be deployed everywhere by targeting the build for those platforms and writing installer. The setup for the environment might take several attempts due to dependencies but writing the application is easy once the environment is up and running.