Cluster computing

Writing a Textract service for Azure

Problem statement: Data is not always readily available in the form of text. Such hidden data must be interpreted from documents, slides, spreadsheets and all other formats of files that are used for collaborations including web pages. Text, in these cases, is enclosed within xml or html attributes. This might not be a problem for end-user who see the text on the rendered page, but it requires to be extracted from such markup programmatically. This is where the textract library comes helpful. This library provides a single interface to extract text from any file type and hence its name. The library is available in python but there is a java version available primarily from Amazon SDK. There is no equivalent library or service on the Azure side. Textract is also known for screen scraping of text via Optical Character Recognition. This article explores the use case for using textract with java libraries.

Solution: A sample program to use this library involves something like:

public static String textract(String url) {

String text = "";

try {

// Fetch source

HttpClient client = HttpClient.newBuilder().version(Version.HTTP_1_1).followRedirects(Redirect.NORMAL).build();

HttpRequest request = HttpRequest.newBuilder().uri(URI.create(url)).build();

HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

text = response.body();

// Arrange

EndpointConfiguration endpoint = new EndpointConfiguration(

"https://textract.us-east-1.amazonaws.com", "us-east-1");

AmazonTextract tclient = AmazonTextractClientBuilder.standard()

.withEndpointConfiguration(endpoint).build();

// Act

DetectDocumentTextRequest drequest = new DetectDocumentTextRequest()

.withDocument(new Document().withBytes(ByteBuffer.wrap(text.getBytes(Charset.forName("UTF-8")))));

DetectDocumentTextResult result = tclient.detectDocumentText(drequest);

// Assert

if (result != null && result.getBlocks() != null && result.getBlocks().size() > 0 ) {

StringBuilder sb = new StringBuilder();

result.getBlocks().stream().forEach( x -> sb.append(x.getText()));

text = sb.toString();

}

} catch (Exception e) {

System.out.println(e);

}

return text;

}

An UnsupportedDocumentException is thrown if the input document is not in the jpg or png format. Asynchronous calls can use pdf document.

Equivalent code for using textract with just text detection from markup would look something like this:

# some python file

import textract

text = textract.process("path/to/file.extension")

Java Textract requires a Document which can be build from S3 objects in addition to ByteBuffers.

Complete Program: https://1drv.ms/u/s!Ashlm-Nw-wnWzEE_wF1QYdDJS4sj?e=vvYPy0

Cluster computing

Tuesday, May 4, 2021

No comments:

Post a Comment