Cluster computing

Sunday, July 9, 2017

Planning for a cloud hosted web service for text summarization.

Introduction: We introduced a service for summarizing text from documents uploaded by users. Users will want to create, update and retrieve summaries. The service does not allow deletion of summaries. Every version of the original document generates its own summary. User’s need not upload the documents themselves. They can point and authorize any document libraries such as OneDrive, GoogleDrive, AmazonDrive, Sharepoint and S3. The documents are identified by the same criteria as in these document stores. The summaries are maintained by the summarization service.

System design: We describe the planning involved for designing the system supporting the summarization service. The summaries are documents themselves. As such we allow documents to be stored and retrieved. The document store can does not need to support online transaction processing. Moreover it needs to scale to large numbers and volumes per user. This calls for a document store such as a MongoDB as opposed to relational databases. Since our service is being planned for deployment to cloud, the database can also scale better when it is native to cloud. For example, we can choose Cosmos Database or DynamoDB. Both these databases are capable of scaling to arbitrary size especially for JSON documents. Moreover, query and aggregation are also facilitated. A proprietary database named TextDB could be written for the storing the documents in the format conducive for natural language processing but this is not an immediate necessity as much as the summarization logic and the final summary from each document for the operations purposes. The choice of the database for the summary store should meet or exceed a LMbench criteria of 512MB memory read during retrieval of a single document with strict latency and throughput for thousands of concurrent access. That said, the micro-benchmark for service is held at a higher standard since text analysis is known to be memory intensive. To support hundreds of documents with their summary for each user based on the user’s personalization, we could budget for 40GB storage in favor of each user for their summaries and statistics. The difference we make here is that the summaries are JSON data as opposed to files hence we want to keep them as part of the database instead of in S3 or file storage. The service itself is REST based and hence can be accessed from clients on mobile devices or desktops from around the world. The service could have the same service level agreements as any document libraries as cited above. Since the documents and their summaries are both user specific, it will be helpful to have a user centric approach to the resources rather than by a shared approach. The summaries will keep an MD5 hash of the original document so we know if the data has changed and the summary is invalidated. Each document in a document library is assumed to be identifiable by a URL which is also stored with the summary. Documents uploaded to the summarization service will make its way to S3 and a corresponding URL will be generated for read only access. Previous versions of the same document in the external document library will also be assumed to be available by URLs. The Summarization service has an input collection and an output flow. The input collection works to access documents in external document libraries via their REST based services and by delegation of user authentication at different document libraries. Authorization for documents at different libraries can be facilitated with membership to the summarization service allowing access to these libraries at the time of registration or on demand subsequently. Summaries can be thought as a selection of few sentences from the original text and therefore depend on the input text.

Conclusion: This article merely suggested a cloud based architecture for a microservice that provides a REST based API for integration by different clients. The model is very similar to how existing document libraries operate in the cloud with the difference that the summarization service maintains a database.
#codingexercise
int binary_search(List<int> nums, int start, int end, int val)
{
int mid = (start + end)/2;
if (nums[mid] == val) return mid;
if (start == end && nums[mid] != val) return -1;
if (nums[mid] < val)
return binary_search(nums, mid+1, end, val);
else
return binary_search(nums, start, mid, val);
}

Cluster computing

Sunday, July 9, 2017

No comments:

Post a Comment