Tuesday, May 4, 2021

 Writing a Textract service for Azure 

 

Problem statement: Data is not always readily available in the form of text. Such hidden data must be interpreted from documents, slides, spreadsheets and all other formats of files that are used for collaborations including web pages. Text, in these cases, is enclosed within xml or html attributes. This might not be a problem for end-user who see the text on the rendered page, but it requires to be extracted from such markup programmatically. This is where the textract library comes helpful. This library provides a single interface to extract text from any file type and hence its name.  The library is available in python but there is a java version available primarily from Amazon SDK. There is no equivalent library or service on the Azure side. Textract is also known for screen scraping of text via Optical Character Recognition. This article explores the use case for using textract with java libraries. 

Solution: A sample program to use this library involves something like: 
 

public static String textract(String url) { 

    String text  = ""; 

    try { 

        // Fetch source 

        HttpClient client = HttpClient.newBuilder().version(Version.HTTP_1_1).followRedirects(Redirect.NORMAL).build(); 

        HttpRequest request = HttpRequest.newBuilder().uri(URI.create(url)).build(); 

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); 

        text = response.body(); 

 

        // Arrange 

        EndpointConfiguration endpoint = new EndpointConfiguration( 

        "https://textract.us-east-1.amazonaws.com", "us-east-1"); 

        AmazonTextract tclient = AmazonTextractClientBuilder.standard() 

        .withEndpointConfiguration(endpoint).build(); 

 

        // Act 

        DetectDocumentTextRequest drequest = new DetectDocumentTextRequest() 

        .withDocument(new Document().withBytes(ByteBuffer.wrap(text.getBytes(Charset.forName("UTF-8"))))); 

         

        DetectDocumentTextResult result = tclient.detectDocumentText(drequest); 

 

        // Assert 

        if (result != null && result.getBlocks() != null && result.getBlocks().size() > 0 ) { 

            StringBuilder sb = new StringBuilder(); 

            result.getBlocks().stream().forEach( x -sb.append(x.getText())); 

            text = sb.toString(); 

        } 

    } catch (Exception e) { 

        System.out.println(e); 

    } 

    return text; 

} 

An UnsupportedDocumentException is thrown if the input document is not in the jpg or png format. Asynchronous calls can use pdf document. 

Equivalent code for using textract with just text detection from markup would look something like this: 

# some python file 

import textract 

text = textract.process("path/to/file.extension") 

Java Textract requires a Document which can be build from S3 objects in addition to ByteBuffers 

 

 

 

 

Monday, May 3, 2021

Mobile application data management.

 We continue with our discussion for Mobile application data from our previous article here. Not all data must be accessed from the enterprise data stores. Some data remains local to the mobile applications. These include personal information management and mobile device management.  The former is all about the data for the end-user such as her email, calendar, task lists, address books, and notepads. These might sync up with data from enterprise data stores which does not mean that they do not store local data. In fact, the emphasis is on personal information and the sync is also referred to as the PIM sync.  Local data is limited by size but not by its lifetime. Applications may keep local data for the duration that the application is installed and even beyond. It is critical that this data is managed with the same caution as that for the data on the enterprise servers. Security is a little bit more involved on the enterprise server and it is a little bit more challenging to secure the local data on mobile devices but best practices like encryption can still be enforced. The size of the data used to be limited but newer devices is increasing the capacity to the order of Gigabytes. Mobile Applications certainly have the flexibility to reduce their storage footprint but it is not as much a priority as saving and persisting that is all relevant to the end-user including some application state and statistics.  Superfast data structures like skip lists can enable faster compute and iterations on the stored data on mobile devices.

The latter storage such as for mobile device management is required because a growing number of companies are asking their IT department to keep the devices of the end-users up to date and in working condition.  While end-users can certainly take action themselves on their devices and not depend on the IT department, the point here is the economy of scale and the convenience by way of automation. Both the deployment and management of mobile devices and applications fall under their responsibility. It is likely that companies might ship applications to the application store for mobile device end-users to download and install on their devices. Such applications may even allow the packaging, export, and backup of mobile local data for use with another device or at a later point in time. Such applications are usually light-weight and dedicated for a single purpose. The IT department always has the luxury of publishing more than one application to the application store which may access some of the device's local data. 

#codingexercise: TextractAzure.docx



Sunday, May 2, 2021

Special purpose algorithms


1.       Speech recognition - This is used to convert voice to text. It requires an analog to digital converter. The conversion can be visualized in a graph known as a spectrogram. A sound wave is captured and the amplitude over time plot is drawn in units of decibels. The wave is then converted to unit-second slices and stored in quantitative form. Frequency, intensity, and time are also required to plot the spectrogram which converts the audio to frames that are 20 to 40 milliseconds. Phenomes are used to disambiguate the sound. Variations in accents are measured as allophones.

Hidden Markov Model, popular with many neural network models, is also used in speech recognition. It comprises several layers. The first layer assigns the probabilities that the phenomes detected is the correct one. The second layer checks the co-occurrence of phenomes and assigns the probabilities.  The third layer checks that just like the second layer but at the word level. It assigns probabilities for their co-occurrence.

The model checks and rechecks all the probabilities to come up with the most likely text that was spoken.

The advantage of using neural networks is that they learn by training on data and are flexible and can change over time.  Neural networks keep improving when it knows the desired state and the actual state and corrects the error. It grasps a variety of phenomes. It can detect the uniqueness of sounds originating from accents and emotions and the Hidden Markov Model which improves the neural network. When the output variables are arranged in a sequence or a linear chain, we get a sequence model. This is the approach taken by a hidden Markov Model. An HMM models a sequence of observations by assuming that there is a sequence of states. Each state depends only on the previous state. And it is independent of all its ancestors. An HMM assumes that each observation variable depends only on the current state. This model is therefore specified by three probability distributions: the distribution p(y) over initial states, the transition distribution from one state to another, and finally the observation distribution of the occurrence with respect to the state.

When we include interdependence between features, we use a generative model. This is usually done in one of two ways - enhance the model to represent dependencies among the inputs or make simplifying independence assumptions. The first approach is difficult because we must maintain tractability.  The second approach can hurt performance. The difference in their behaviors is large since one is generative and the other is discriminative.

Generative means it is based on the model of the joint distribution of a state regarding observation. Discriminative means it is based on the model of the conditional distribution of a state given the observation. We can also form a model based on generative-discriminative pairs.

2.       Image recognition – Some image recognition algorithms make use of pattern recognition. Pattern recognition refers to the classification or description of objects or patterns. The patterns themselves can range from characters in an image of printed text to biological waveforms. The recognition involves identifying the patterns and assigning labels for categories. We start with a set of training patterns. The main difference between pattern recognition and cluster analysis is the role of pattern class labels. In pattern recognition, we use the labels to formulate decision rules. In cluster analysis, we use it to verify the results. Pattern recognition requires extrinsic information. In cluster analysis, we use only the data.

There are two basic paradigms to classify a pattern into one of K different classes. The first is a geometric or statistical approach.  In this approach, a pattern is represented in terms of d features and the pattern features are as independent of one another as possible. Then given training patterns for each pattern class, the objective is to separate the patterns belonging to different classes.

In statistical pattern recognition, the features are assumed to have a probability density function that is conditioned on the pattern class. A pattern vector x belonging to a class wj is a data point drawn from the conditional probability distribution P(x/wj) where j is one of the K different classes. Concepts from statistical decision theory and discriminant analysis are utilized to establish decision boundaries between the pattern classes. If the class conditional densities are known, then Bayes decision theory gives optimal decision rule. Since they are generally not known, a classifier is used based on the nature of the information available. A classifier requires supervised or unsupervised learning in supervised learning, in the form of class conditional densities is known, we use a parametric or non-parametric decision rules. In unsupervised learning, the density functions are estimated from training samples.  Here the labels one each training pattern represents the category to which the pattern belongs. The categories may be known beforehand, or they may be unknown.

When the number of pattern classes is unknown, it tries to find natural groupings in the data. 

Saturday, May 1, 2021

Data mining techniques

 

Introduction: Data mining techniques add insights to database that are typically not known from standard query operators and IT operations rely largely on some form of data store or CMDB for their inventory and associated operations. OLE DB standardized data mining language primitives and became an industry standard. Prior to OLE DB it was difficult to integrate data mining products. If one product was written using decision tree classifiers and another was written with support vectors and they do not have a common interface, then the application had to be rebuilt from scratch.  Furthermore, the data that these products analyzed was not always in a relational database which required data porting and transformation operations.
OLEDB for DM consolidates all these. It was designed to allow data mining client applications to consume data mining services from a wide variety of data mining software packages. Clients communicate with data mining providers via SQL.
The OLE DB for Data Mining stack uses a data mining extension (DMX), a SQL like data mining query language to talk to different DM Providers. DMX statements can be used to create, modify and work with different data mining models. DMX also contains several functions that can be used to retrieve statistical information.  Furthermore, the data and not just the interface is also unified. The OLE DB integrates the data mining providers from the data stores such as a Cube, a relational database, or miscellaneous other data source can be used to retrieve and display statistical information.
The three main operations performed are model creation, model training and model prediction and browsing.
 Model creation A data mining model object is created just like a relational table. The model has a few input columns and one or more predictable columns, and the name of the data mining algorithm to be used when the model is later trained by the data mining provider.
Model training: The data are loaded into the model and used to train it. The data mining provider uses the algorithm specified during the creation to search for patterns. These patterns are the model content.
Model prediction and browsing: A select statement is used to consult the data mining model content in order to make model predictions and browse statistics obtained by the model.
An example of a model can be seen with a nested table for customer id, gender, age and purchases. The purchases are associations between item_name and item_quantitiy. There are more than one purchases made by the customer. Models can be created with attribute types such as ordered, cyclical, sequence_time, probability, variance, stdev and support.
Model training involves loading the data into the model. The openrowset statement supports querying data from a data source through an OLE DB provider. The shape command enables loading of nested data.

When the data mining is not sufficient in terms of grouping, ranking and sorting, AI techniques are used. The Microsoft ML package provides:
fast linear for binary classification or linear regression
one class SVM for anomaly detection
fast trees for regression
fast forests for churn detection and building multiple trees
neural net for binary and multi-class classification
logistic regression for classifying sentiments from feedback

#codingexercise: Canonball.docx