Tuesday, May 4, 2021

 Writing a Textract service for Azure 

 

Problem statement: Data is not always readily available in the form of text. Such hidden data must be interpreted from documents, slides, spreadsheets and all other formats of files that are used for collaborations including web pages. Text, in these cases, is enclosed within xml or html attributes. This might not be a problem for end-user who see the text on the rendered page, but it requires to be extracted from such markup programmatically. This is where the textract library comes helpful. This library provides a single interface to extract text from any file type and hence its name.  The library is available in python but there is a java version available primarily from Amazon SDK. There is no equivalent library or service on the Azure side. Textract is also known for screen scraping of text via Optical Character Recognition. This article explores the use case for using textract with java libraries. 

Solution: A sample program to use this library involves something like: 
 

public static String textract(String url) { 

    String text  = ""; 

    try { 

        // Fetch source 

        HttpClient client = HttpClient.newBuilder().version(Version.HTTP_1_1).followRedirects(Redirect.NORMAL).build(); 

        HttpRequest request = HttpRequest.newBuilder().uri(URI.create(url)).build(); 

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString()); 

        text = response.body(); 

 

        // Arrange 

        EndpointConfiguration endpoint = new EndpointConfiguration( 

        "https://textract.us-east-1.amazonaws.com", "us-east-1"); 

        AmazonTextract tclient = AmazonTextractClientBuilder.standard() 

        .withEndpointConfiguration(endpoint).build(); 

 

        // Act 

        DetectDocumentTextRequest drequest = new DetectDocumentTextRequest() 

        .withDocument(new Document().withBytes(ByteBuffer.wrap(text.getBytes(Charset.forName("UTF-8"))))); 

         

        DetectDocumentTextResult result = tclient.detectDocumentText(drequest); 

 

        // Assert 

        if (result != null && result.getBlocks() != null && result.getBlocks().size() > 0 ) { 

            StringBuilder sb = new StringBuilder(); 

            result.getBlocks().stream().forEach( x -sb.append(x.getText())); 

            text = sb.toString(); 

        } 

    } catch (Exception e) { 

        System.out.println(e); 

    } 

    return text; 

} 

An UnsupportedDocumentException is thrown if the input document is not in the jpg or png format. Asynchronous calls can use pdf document. 

Equivalent code for using textract with just text detection from markup would look something like this: 

# some python file 

import textract 

text = textract.process("path/to/file.extension") 

Java Textract requires a Document which can be build from S3 objects in addition to ByteBuffers 

 

 

 

 

No comments:

Post a Comment