Cluster computing

Tuesday, May 4, 2021

Writing a Textract service for Azure

Problem statement: Data is not always readily available in the form of text. Such hidden data must be interpreted from documents, slides, spreadsheets and all other formats of files that are used for collaborations including web pages. Text, in these cases, is enclosed within xml or html attributes. This might not be a problem for end-user who see the text on the rendered page, but it requires to be extracted from such markup programmatically. This is where the textract library comes helpful. This library provides a single interface to extract text from any file type and hence its name. The library is available in python but there is a java version available primarily from Amazon SDK. There is no equivalent library or service on the Azure side. Textract is also known for screen scraping of text via Optical Character Recognition. This article explores the use case for using textract with java libraries.

Solution: A sample program to use this library involves something like:

public static String textract(String url) {

String text = "";

try {

// Fetch source

HttpClient client = HttpClient.newBuilder().version(Version.HTTP_1_1).followRedirects(Redirect.NORMAL).build();

HttpRequest request = HttpRequest.newBuilder().uri(URI.create(url)).build();

HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

text = response.body();

// Arrange

EndpointConfiguration endpoint = new EndpointConfiguration(

"https://textract.us-east-1.amazonaws.com", "us-east-1");

AmazonTextract tclient = AmazonTextractClientBuilder.standard()

.withEndpointConfiguration(endpoint).build();

// Act

DetectDocumentTextRequest drequest = new DetectDocumentTextRequest()

.withDocument(new Document().withBytes(ByteBuffer.wrap(text.getBytes(Charset.forName("UTF-8")))));

DetectDocumentTextResult result = tclient.detectDocumentText(drequest);

// Assert

if (result != null && result.getBlocks() != null && result.getBlocks().size() > 0 ) {

StringBuilder sb = new StringBuilder();

result.getBlocks().stream().forEach( x -> sb.append(x.getText()));

text = sb.toString();

}

} catch (Exception e) {

System.out.println(e);

}

return text;

}

An UnsupportedDocumentException is thrown if the input document is not in the jpg or png format. Asynchronous calls can use pdf document.

Equivalent code for using textract with just text detection from markup would look something like this:

# some python file

import textract

text = textract.process("path/to/file.extension")

Java Textract requires a Document which can be build from S3 objects in addition to ByteBuffers.

Complete Program: https://1drv.ms/u/s!Ashlm-Nw-wnWzEE_wF1QYdDJS4sj?e=vvYPy0

Monday, May 3, 2021

Mobile application data management.

We continue with our discussion for Mobile application data from our previous article here. Not all data must be accessed from the enterprise data stores. Some data remains local to the mobile applications. These include personal information management and mobile device management. The former is all about the data for the end-user such as her email, calendar, task lists, address books, and notepads. These might sync up with data from enterprise data stores which does not mean that they do not store local data. In fact, the emphasis is on personal information and the sync is also referred to as the PIM sync. Local data is limited by size but not by its lifetime. Applications may keep local data for the duration that the application is installed and even beyond. It is critical that this data is managed with the same caution as that for the data on the enterprise servers. Security is a little bit more involved on the enterprise server and it is a little bit more challenging to secure the local data on mobile devices but best practices like encryption can still be enforced. The size of the data used to be limited but newer devices is increasing the capacity to the order of Gigabytes. Mobile Applications certainly have the flexibility to reduce their storage footprint but it is not as much a priority as saving and persisting that is all relevant to the end-user including some application state and statistics. Superfast data structures like skip lists can enable faster compute and iterations on the stored data on mobile devices.

The latter storage such as for mobile device management is required because a growing number of companies are asking their IT department to keep the devices of the end-users up to date and in working condition. While end-users can certainly take action themselves on their devices and not depend on the IT department, the point here is the economy of scale and the convenience by way of automation. Both the deployment and management of mobile devices and applications fall under their responsibility. It is likely that companies might ship applications to the application store for mobile device end-users to download and install on their devices. Such applications may even allow the packaging, export, and backup of mobile local data for use with another device or at a later point in time. Such applications are usually light-weight and dedicated for a single purpose. The IT department always has the luxury of publishing more than one application to the application store which may access some of the device's local data.

#codingexercise: TextractAzure.docx