Cluster computing: A comparison of libraries in Java for machine learning applications

Problem statement: Machine learning algorithms are computation-intensive tasks and usually heavy matrix operations. Statistical and numerical analysis were traditionally developed with interpreters such as Matlab and Python. These libraries grew over time to include several models and algorithms. Data science made many of these popular with the use of libraries such as NumPy, spaCy, and Transformers. Machine learning applications continue to evolve with Python as the preferred choice of language. This article investigates the gap between what’s available in Python versus those available in Java for the common tasks used in machine learning. We use an example from text analysis to implement services in the Java language.

Description: The example taken for writing service in Java to perform extractive summarization of text requires the use of a natural language processing algorithm called BERT that can generate embeddings or matrix of doubles representing probabilities of sequences. The algorithm is part of a pre-trained model that must be loaded before the embeddings are requested which are helpful to score and evaluate the sequences into which the text is shredded. Given this example, we have requirements for a natural language library that represents the model and can be loaded such as transformers and spacy, a machine learning library that performs clustering, decomposition, and other analysis such as sklearn, a library to hold the data structures for representing the sequences and vectors from which the embeddings are derived such as tensors and lastly, operations using collections which may not be part of the programming language and structures for holding intermediary data and computations such as numpy.

We see that this application has just a handful of requirements and they are well-served by existing libraries in the Python language such as transformers, spacy, sklearn, torch, and numpy. The example we took has very little customization and performs only scoring over the results from the model to produce the output. This lets us look for alternatives to the mentioned libraries in the Java programming language. The natural language processing library relies on an algorithm that has been implemented in both Python and Java. However, the requirements from the application would do well with a higher-level library since much of the processing is common across applications. Transformers serve this purpose but there is no equivalent Java library. There are ML from Firebase for Android and there is a gallery of custom implementations, but none serve this purpose. On the contrary, sklearn library that provides computations for clustering, decompositions, and mixtures has an equivalent “Deep Java Library”. Torch library is required for vectorization and Tensor serves this purpose very well. They are also well-documented as well as easy to use requiring just the minimum conversions to a format that the library can perform the said operations on. Numpy does not have an equivalent java library per se, but much of the operations can be implemented with out-of-box primitives from the Java programming language for specific purposes.

Conclusion: We see that most of the gap is in a higher programming level library for the natural language processing in Java. All other operations are doable with existing implementations.

Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWzHzJoGjFeZdbJ8J3?e=Iex4dk

Cluster computing

Thursday, May 13, 2021

A comparison of libraries in Java for machine learning applications

No comments:

Post a Comment