Planning for a cloud hosted web service for text summarization.
Introduction: We introduced a service for
summarizing text from documents uploaded by users. Users will want to create,
update and retrieve summaries. The service does not allow deletion of
summaries. Every version of the original document generates its own summary. User’s
need not upload the documents themselves. They can point and authorize any
document libraries such as OneDrive, GoogleDrive, AmazonDrive, Sharepoint and
S3. The documents are identified by the same criteria as in these document
stores. The summaries are maintained by the summarization service.
System design: We describe the planning involved for designing
the system supporting the summarization service. The summaries are documents themselves. As
such we allow documents to be stored and retrieved. The document store can does
not need to support online transaction processing. Moreover it needs to scale
to large numbers and volumes per user. This
calls for a document store such as a MongoDB as opposed to relational
databases. Since our service is being planned for deployment to cloud, the
database can also scale better when it is native to cloud. For example, we can choose
Cosmos Database or DynamoDB. Both these
databases are capable of scaling to arbitrary size especially for JSON
documents. Moreover, query and aggregation are also facilitated. A proprietary
database named TextDB could be written for the storing the documents in the
format conducive for natural language processing but this is not an immediate
necessity as much as the summarization logic and the final summary from each
document for the operations purposes. The choice of the database for the
summary store should meet or exceed a LMbench criteria of 512MB memory read during
retrieval of a single document with strict latency and throughput for thousands
of concurrent access. That said, the
micro-benchmark for service is held at a higher standard since text analysis is
known to be memory intensive. To support hundreds of documents with their summary
for each user based on the user’s personalization, we could budget for 40GB
storage in favor of each user for their summaries and statistics. The
difference we make here is that the summaries are JSON data as opposed to files
hence we want to keep them as part of the database instead of in S3 or file
storage. The service itself is REST
based and hence can be accessed from clients on mobile devices or desktops from
around the world. The service could have the same service level agreements as
any document libraries as cited above. Since the documents and their summaries
are both user specific, it will be helpful to have a user centric approach to
the resources rather than by a shared approach. The summaries will keep an MD5
hash of the original document so we know if the data has changed and the
summary is invalidated. Each document in a document library is assumed to be identifiable
by a URL which is also stored with the summary. Documents uploaded to the
summarization service will make its way to S3 and a corresponding URL will be
generated for read only access. Previous
versions of the same document in the external document library will also be
assumed to be available by URLs. The Summarization service has an input
collection and an output flow. The input collection works to access documents
in external document libraries via their REST based services and by delegation
of user authentication at different document libraries. Authorization for documents
at different libraries can be facilitated with membership to the summarization
service allowing access to these libraries at the time of registration or on
demand subsequently. Summaries can be thought as a selection of few sentences
from the original text and therefore depend on the input text.
Conclusion: This article merely suggested a cloud based
architecture for a microservice that provides a REST based API for integration by
different clients. The model is very similar to how existing document libraries
operate in the cloud with the difference that the summarization service
maintains a database.
#codingexercise
int binary_search(List<int> nums, int start, int end, int val)
{
int mid = (start + end)/2;
if (nums[mid] == val) return mid;
if (start == end && nums[mid] != val) return -1;
if (nums[mid] < val)
return binary_search(nums, mid+1, end, val);
else
return binary_search(nums, start, mid, val);
}
#codingexercise
int binary_search(List<int> nums, int start, int end, int val)
{
int mid = (start + end)/2;
if (nums[mid] == val) return mid;
if (start == end && nums[mid] != val) return -1;
if (nums[mid] < val)
return binary_search(nums, mid+1, end, val);
else
return binary_search(nums, start, mid, val);
}
No comments:
Post a Comment