Problem statement: Machine learning algorithms
are computation-intensive tasks and usually heavy matrix operations.
Statistical and numerical analysis were traditionally developed with
interpreters such as Matlab and Python. These libraries grew over time to
include several models and algorithms. Data science made many of these popular
with the use of libraries such as NumPy, spaCy, and Transformers. Machine
learning applications continue to evolve with Python as the preferred choice of
language. This article investigates the
gap between what’s available in Python versus those available in Java for the
common tasks used in machine learning. We use an example from text analysis to
implement services in the Java language.
Description: The example taken for writing
service in Java to perform extractive summarization of text requires the use of
a natural language processing algorithm called BERT that can generate
embeddings or matrix of doubles representing probabilities of sequences. The
algorithm is part of a pre-trained model that must be loaded before the
embeddings are requested which are helpful to score and evaluate the sequences
into which the text is shredded. Given this example, we have requirements for a
natural language library that represents the model and can be loaded such as
transformers and spacy, a machine learning library that performs clustering,
decomposition, and other analysis such as sklearn, a library to hold the data structures for
representing the sequences and vectors from which the embeddings are derived
such as tensors and lastly, operations
using collections which may not be part of the programming language and
structures for holding intermediary data and computations such as numpy.
We see that this application has just a handful of
requirements and they are well-served by existing libraries in the Python
language such as transformers, spacy, sklearn, torch, and numpy. The example we
took has very little customization and performs only scoring over the results
from the model to produce the output. This lets us look for alternatives to the
mentioned libraries in the Java programming language. The natural language
processing library relies on an algorithm that has been implemented in both
Python and Java. However, the requirements from the application would do well
with a higher-level library since much of the processing is common across
applications. Transformers serve this purpose but there is no equivalent Java
library. There are ML from Firebase for Android and there is a gallery of
custom implementations, but none serve this purpose. On the contrary, sklearn library that
provides computations for clustering, decompositions, and mixtures has an
equivalent “Deep Java Library”. Torch library is required for vectorization and
Tensor serves this purpose very well. They are also well-documented as well as
easy to use requiring just the minimum conversions to a format that the library
can perform the said operations on. Numpy does not have an equivalent java
library per se, but much of the operations can be implemented with out-of-box
primitives from the Java programming language for specific purposes.
Conclusion: We see that most of the gap is in a
higher programming level library for the natural language processing in Java.
All other operations are doable with existing implementations.
No comments:
Post a Comment