Tuesday, July 11, 2017

We reviewed the writeup on Storage requirements for the summarization service indicating that it can be hosted on top of a document library such as sharepoint, OneDrive, Google drive or S3 as long as we can keep track of the summaries generated and the documents from which the summary was generated. Today we discuss the compute requirements for the summarization service. The tasks for summarization involve using a latent word vector space and a Markov random walk from word vector to word vector. This translated to about two times the two dimensional matrix of floating point values with about three hundred features. The selection of keywords and the ranked representation of sentences having those keywords take only linear order storage.  Therefore, we can make a back of the envelope calculation for the allocation of these two dimensional matrices as roughly half MB space each. If we introduce a microbenchmark of processing ten thousand documents per second, we will require a total of about 10GB space for the service in addition to all operational requirements for memory. Therefore, we consider the service to operate on a scale out basis on any cloud compute capabilities or a Marathon stack. With the storage being shared and the isolation per document per user, we have no restrictions in keeping our service elastic to growing needs. At the same time, we also allow the web service the same level of production level standards as with any public domain web service. These include troubleshooting, logging, monitoring, health checks and notifications for performance degradations and such other services. Most platform as a service model for deployments automatically come with these benefits or can be extended easily for our needs. Similarly, the public cloud also has features for monitoring and billing that come very useful to plan and scale out the service as appropriate. Service may also be hosted in one region or the other.  The web application hosting best practices recommend to create one or more availability zones each with the traditional model of hosting a web service for improving the reliability and availability of the service to the users. Furthermore, it recommends the use of available cloud based features such as routing and caching to improve performance and security for the web application.  For example, a global network of edge locations can be used to deliver dynamic, static and streaming content. The requests for the content are routed to the nearest edge location in such case. Network traffic can also be filtered and security improved not just at the edge routers but also at the host level.  Security groups can be used to manage access to the ports on the host. Similarly identity and access management services can be utilized to authenticate and authorize users. Data access layer that interacts with the database or document library can also be secured with internal service accounts and privileges.  Data entering the data tier can be secured at rest and in transit.  Routine backups and maintenance activities can be planned for safeguarding and protecting the data. Scaling and Failover can be automatically provisioned for these services.
#word vector match http://ideone.com/2NUwvu

No comments:

Post a Comment