We reviewed the writeup on Storage requirements for the
summarization service indicating that it can be hosted on top of a document
library such as sharepoint, OneDrive, Google drive or S3 as long as we can keep
track of the summaries generated and the documents from which the summary was
generated. Today we discuss the compute requirements for the summarization
service. The tasks for summarization involve using a latent word vector space and
a Markov random walk from word vector to word vector. This translated to about two
times the two dimensional matrix of floating point values with about three
hundred features. The selection of keywords and the ranked representation of sentences
having those keywords take only linear order storage.  Therefore, we can make a back of the envelope
calculation for the allocation of these two dimensional matrices as roughly half
MB space each. If we introduce a microbenchmark of processing ten thousand
documents per second, we will require a total of about 10GB space for the
service in addition to all operational requirements for memory. Therefore, we
consider the service to operate on a scale out basis on any cloud compute capabilities
or a Marathon stack. With the storage being shared and the isolation per document
per user, we have no restrictions in keeping our service elastic to growing
needs. At the same time, we also allow the web service the same level of
production level standards as with any public domain web service. These include
troubleshooting, logging, monitoring, health checks and notifications for
performance degradations and such other services. Most platform as a service
model for deployments automatically come with these benefits or can be extended
easily for our needs. Similarly, the public cloud also has features for
monitoring and billing that come very useful to plan and scale out the service
as appropriate. Service may also be hosted in one region or the other.  The web application hosting best practices recommend
to create one or more availability zones each with the traditional model of
hosting a web service for improving the reliability and availability of the
service to the users. Furthermore, it recommends the use of available cloud
based features such as routing and caching to improve performance and security
for the web application.  For example, a
global network of edge locations can be used to deliver dynamic, static and
streaming content. The requests for the content are routed to the nearest edge
location in such case. Network traffic can also be filtered and security
improved not just at the edge routers but also at the host level.  Security groups can be used to manage access
to the ports on the host. Similarly identity and access management services can
be utilized to authenticate and authorize users. Data access layer that interacts
with the database or document library can also be secured with internal service
accounts and privileges.  Data entering
the data tier can be secured at rest and in transit.  Routine backups and maintenance activities
can be planned for safeguarding and protecting the data. Scaling and Failover
can be automatically provisioned for these services.
#word vector match http://ideone.com/2NUwvu
#word vector match http://ideone.com/2NUwvu
No comments:
Post a Comment