Sunday, July 9, 2017

Planning for a cloud hosted web service for text summarization.
Introduction: We introduced a service for summarizing text from documents uploaded by users. Users will want to create, update and retrieve summaries. The service does not allow deletion of summaries. Every version of the original document generates its own summary. User’s need not upload the documents themselves. They can point and authorize any document libraries such as OneDrive, GoogleDrive, AmazonDrive, Sharepoint and S3. The documents are identified by the same criteria as in these document stores. The summaries are maintained by the summarization service.
System design: We describe the planning involved for designing the system supporting the summarization service.  The summaries are documents themselves. As such we allow documents to be stored and retrieved. The document store can does not need to support online transaction processing. Moreover it needs to scale to large numbers and volumes per user.  This calls for a document store such as a MongoDB as opposed to relational databases. Since our service is being planned for deployment to cloud, the database can also scale better when it is native to cloud. For example, we can choose Cosmos Database or DynamoDB.  Both these databases are capable of scaling to arbitrary size especially for JSON documents. Moreover, query and aggregation are also facilitated. A proprietary database named TextDB could be written for the storing the documents in the format conducive for natural language processing but this is not an immediate necessity as much as the summarization logic and the final summary from each document for the operations purposes. The choice of the database for the summary store should meet or exceed a LMbench criteria of 512MB memory read during retrieval of a single document with strict latency and throughput for thousands of concurrent access.  That said, the micro-benchmark for service is held at a higher standard since text analysis is known to be memory intensive. To support hundreds of documents with their summary for each user based on the user’s personalization, we could budget for 40GB storage in favor of each user for their summaries and statistics. The difference we make here is that the summaries are JSON data as opposed to files hence we want to keep them as part of the database instead of in S3 or file storage.  The service itself is REST based and hence can be accessed from clients on mobile devices or desktops from around the world. The service could have the same service level agreements as any document libraries as cited above. Since the documents and their summaries are both user specific, it will be helpful to have a user centric approach to the resources rather than by a shared approach. The summaries will keep an MD5 hash of the original document so we know if the data has changed and the summary is invalidated. Each document in a document library is assumed to be identifiable by a URL which is also stored with the summary. Documents uploaded to the summarization service will make its way to S3 and a corresponding URL will be generated for read only access.  Previous versions of the same document in the external document library will also be assumed to be available by URLs. The Summarization service has an input collection and an output flow. The input collection works to access documents in external document libraries via their REST based services and by delegation of user authentication at different document libraries. Authorization for documents at different libraries can be facilitated with membership to the summarization service allowing access to these libraries at the time of registration or on demand subsequently. Summaries can be thought as a selection of few sentences from the original text and therefore depend on the input text.
Conclusion: This article merely suggested a cloud based architecture for a microservice that provides a REST based API for integration by different clients. The model is very similar to how existing document libraries operate in the cloud with the difference that the summarization service maintains a database.
#codingexercise
int binary_search(List<int> nums, int start, int end, int val)
{
int mid = (start + end)/2;
if (nums[mid] == val) return mid;
if (start == end && nums[mid] != val) return -1;
if (nums[mid] < val)
return binary_search(nums, mid+1, end, val);
else
return binary_search(nums, start, mid, val);
}

Saturday, July 8, 2017

The appeal of text summarization software:
These list a few use cases of text summarization for the savvy readers.
First, users go to a web site where they can sign in and upload any text document in well known formats such as doc, pdf, text, html, etc and these will be instantly summarized and archived for perusal. The summary, if not the original content, is never deleted and maintained forever. The summary can also be shared with shortened urls or via one of the many social engineering application mediums.  The collection of a text from a user can be automatically grouped and maintained based on day, week, month or year so that they can be retrieved or scrolled. The summaries could appear as thumbnail images of their respective original documents and rendered in a readable manner with enhancements for clear reading.
Second, users can download an application on their mobile devices that looks like a recycle bin but it is intended for dragging and dropping text documents for summarization. Since the summaries are maintained in the cloud and from a central web-service, they are accessible both via browser on the mobile devices as well as in their own applications. The users download this application only because it is handy for the users to transfer the documents from downloads folder to their home in their personal summaries cloud. The mobile applications could be made available on a variety of platforms such as Android, IoS and Windows.
Third, the users can have browser based plugins that provides a one-step click only push button service to summarize the current document and upload it to their summaries cloud space. This enables them to generate as many summaries as necessary instead of otherwise copious notes and to be able to view them after the visit to the different web pages of the source of the original text.  Furthermore, they can view all the summaries along with their organizations on one page. In all the displays of the summary, the rendering is done from a single copy in the cloud.
Fourth, this software can be bundled with various documents editing software such as Microsoft word and PDF writers.  Not only does this software enable summarization of the content made available but can also be made available on any editor where the content is produced. This includes books, magazines and other forms of publications and could not only stay current for every version of the document but also compare with previous related publications.
Fifth, this software can be made available on any device that can recognize text from document formats that are otherwise not directly readable. For Example, optical character recognition devices and software can now not only recognize text from images but also instantly summarize them. It might help to translate the images to a different file format that is more conducive to select and summarize text. This may even be automated as part of the text recognition workflow so that the output from the devices also includes a short link to the summary.

Finally, text is ubiquitous in space, domain and time series. Summarization enables users to focus on the salient ideas in the text rather than being able to merely retrieve the original document by indexing. In that sense, it is a friend to the user in aiding comprehension and retention rather than just locating the original document for the sometimes time consuming task of reading.


Friday, July 7, 2017

Today we get back on our discussion of the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. We discussed the infrastructure which consists of physical and virtual nodes Data is partitioned based on consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. With multiple copies of data, we need consistency models. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates. The topology that helps with this can be classified as a single master model, a multi-master or no-master model or a Quorum based 2-PC model that provides strict consistency.
The syncing of states is achieved with techniques such as the traditional 2PC. To improve availability, quorum based 2PC was introduced. In this model, the coordinator asks all the  replicas to write the data but waits only for the responses from a select number of them. When reading the data, a minimum set of replicas are read and the one returned with the latest timestamp. This technique provided strict consistency. 
If we don't require strict consistency, we can go with a Gossip protocol to propagate the updates via message exchanges. A vector clock is used to help with the messaging. Both the client and replicas maintain vector clocks Each local vector clock merges the timestamp by picking out the maximum value. The client does not share its vector clock, only the replicas do When the client sends its vector clock, the replicas will check whether the client state precedes any of its current versions and discard it accordingly. When it receives updates from other replicas via gossip, it will merge the version trees. 
Replicas may also exchange their logs and check whether certain operation can be applied because their dependencies have been applied Since the operations are sorted by vector clock in causal order, simultaneous operations can also be serialized.  The replica performing the first update gets to pick a sequence number higher than all others and everyone follows this sequence number in their vector clocks. Since operations are in different stages of processing on different replicas, a replica will not discard the operations it has completed until it sees the vector clocks from all others to have preceded it.
Deletes are somewhat different because when an object is deleted, its timestamp may also be lost. Therefore to handle this case, the associated metadata and timestamp information of the deleted objects is kept around long enough until confirmed by all other replicas.
Storage for the map-reduce may also be in the form of distributed files or even databases. The advantage of using databases in such cases is that they not only provide copy-on-write but also multi-version concurrency control. A bloom filter or an index can be used to determine if a key exists in a set. All operations are logged. All data and logs are flushed to disk to handle recovery for that node. 
The API model for a large keystore can include the following 
get(key) which extracts the value given in a key
put(key, value) which creates or updates the value given its key
delete(key) which removes the key and its associated value
execute(key, operation, parameters) which invokes an operation to the value
mapreduce(keylist, mapFunc, reduceFunc) which invokes mapreduce on a key range
#codingchallenge http://ideone.com/SjlLWd
deserialization http://ideone.com/CoZYi0
#palindromes in a string : http://ideone.com/2WA4tn 

Thursday, July 6, 2017

We know how to build graphs from keywords which is popular in text analysis. In this post, I discuss the use of IdeaMap using a graph of ideas but on a personal level and one that can be determined from a person’s activity monitoring in the form of email or written communication. 
Use Case:  Personal communications tend to be client-centric. A person sends out emails with responses that have his signature thoughts and ideas. On a day to day basis, a person in a particular vocation may have very specialized communication and deal with recurrent themes and ideas in a continuous basis. Such ideas can be captured in a pictorial representation with graphs. 
Giving the user this representation of his emphasis on a daily basis helps her be reminded of her impact areas and how she may need to tune her efforts in order to not miss on outliers and to make better channelization of her energy. This in a nutshell is the purpose of discovering ideas or themes and representing them in graphs on a daily basis as a form of personalization for the user. 
Today users can install a variety of machine data analysis toolsets like Splunk where we can feed all the documents received by the user to the inputs to Splunk and have charts drawn to highlight the most frequently used topics. However, the analysis suggested in this writeup goes far beyond that. To illustrate a few differences, we are interested in capturing a graph of topics for the day with edges and weights assigned to show interdependence and similarities. These are not the same as bubble charts in the visualization. Second, we are interested not just in the illustration but also the analysis to deeper than frequency counts of text chosen from input formatters. Specifically we refer to text analysis methods such as keyword detection and topic analysis. For example, we talk about normalizing all communications from the user to be represented as free text which we parse and analyze as a bag of words with a weighted matrix of keywords. Furthermore, we review the word embeddings and graph embeddings of the presented text to decipher the salient semantic content. We don't just stop there but also consider further semantic analysis as analogies, sequence completion and classification. Together,  these represent the best in the industry for deep learning of the communications from a user to present the topics of the day.  
The daily graphs can be a pictorial representation embedded in any of the documents or software used by the user for collaboration or referred to on a public website dedicated for this purpose through shortened hypertext or QR codes 
Finally, the daily graphs can also be collected over time to be rolled up for monthly, yearly or other suitable periods. Since the graphs are accumulated over time, they present the opportunity to the user to drill down as appropriate.  
In conclusion, topic analysis in personalized communications and a visualization presented for summary are expected to delight the customer in reviewing with a glance the impact of the efforts spent over time. 
#codingexercise
#check divisibility by 9 for any permutation of a given number: http://ideone.com/tvGBxH 

Wednesday, July 5, 2017

Today we continue to review the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. We discussed the infrastructure which consists of physical and virtual nodes Data is partitioned based on consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. With multiple copies of data, we need consistency models. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates. The topology that helps with this can be classified as a single master model, a multi-master or no-master model or a Quorum based 2-PC model that provides strict consistency.
The syncing of states is achieved with techniques such as the traditional 2PC. To improve availability, quorum based 2PC was introduced. In this model, the coordinator asks all the  replicas to write the data but waits only for the responses from a select number of them. When reading the data, a minimum set of replicas are read and the one returned with the latest timestamp. This technique provided strict consistency. 
If we don't require strict consistency, we can go with a Gossip protocol to propagate the updates via message exchanges. A vector clock is used to help with the messaging. Both the client and replicas maintain vector clocks Each local vector clock merges the timestamp by picking out the maximum value. The client does not share its vector clock, only the replicas do
We now discuss the gossip protocol. This comes in useful to propagate the changes. 
In state transfer model, each replica maintains a state version tree which contains all the conflicting updates. When the client sends its vector clock, the replicas will check whether the client state precedes any of its current versions and discard it accordingly. When it receives updates from other replicas via gossip, it will merge the version trees. 
In operation transfer model, each replica replica has to first apply all operations corresponding to the cause before those corresponding to the effect. This is necessary to keep the operations in the same sequence on all replicas and is achieved by adding another entry in the vector clock, a V-state, that represents the time of the last updated state. In order that this causal order is maintained, each replica will buffer the update operation until it can be applied to the local state A tuple of two timestamps - one from the client's view and another from the replica's local view is associated with every submitted operation. This now comes in useful to the replica to collect all the pending operations in its queue. With its vector clock larger than that of the client, it can proceed to executing the operations. The same is true for background operations when the clients receive updates with each other's vector clock. In this case, they exchange their logs and with this, each replica will check whether certain operation can be applied because their dependencies have been applied Since the operations are sorted by vector clock in causal order, simultaneous operations can also be serialized.  The replica performing the first update gets to pick a sequence number higher than all others and everyone follows this sequence number in their vector clocks. Since operations are in different stages of processing on different replicas, a replica will not discard the operations it has completed until it sees the vector clocks from all others to have preceded it.
As a specific example, let us look at the map reduce operation over the input key list. The operations for map and reduce are sent to all the nodes whose replicas own the key ranges.  When the map operations complete, the reduce operations are executed to aggregate the results. This is different from scatter and gather because the latter requires a single point of consolidation and map-reduce doesn't. Moreover, that single point is usually the point of request whereas in map-reduce, it may vary from the node servicing the originating request. Scatter-gather is a pattern for multiprogramming or parallel programming where the data is sent with the split requests. In Map-Reduce only the computation is sent to the data resident to the nodes and they use it with their local key-ranges. All operations including map-reduce occur with causal order.
Deletes are somewhat different because when an object is deleted, its timestamp may also be lost. Therefore to handle this case, the associated metadata and timestamp information of the deleted objects is kept around long enough until confirmed by all other replicas.
Storage for the map-reduce may also be in the form of distributed files or even databases. The advantage of using databases in such cases is that they not only provide copy-on-write but also multi-version concurrency control. A bloom filter or an index can be used to determine if a key exists in a set. All operations are logged. All data and logs are flushed to disk to handle recovery for that node.
It may be interesting to note that the distributed system concepts such as Lamport timestamps, Byzantine generals, are on a level different from the data store and data related operations at a local node. For example distributed query can work with other instances of databases over a network and with OLE DB data sources. In other words, this can be internalized as a database specific feature. The charm of NoSQL is the ability to replicate, scale out and add or remove more commodity servers without impacting ongoing operations. There are more than one ways to access data in a distributed manner. The block chain from Bitcoin is also a distributed database. Also, the scale out capability of NoSQL together with the requirements for transactions and strict consistency, were the motivating factors behind NewSQL data architecture.
Courtesy horicky blog and Ricky Ho
#codingexercise http://ideone.com/2pBEvk

Tuesday, July 4, 2017

Today we continue to review the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. We discussed the infrastructure which consists of physical and virtual nodes Data is partitioned based on consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. With multiple copies of data, we need consistency models. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates. The topology that helps with this can be classified as a single master model, a multi-master or no-master model or a Quorum based 2-PC model that provides strict consistency.
The single master model is helpful for scaling out the reads such as with replication where the master applies the updates and it is asynchronously propagated to the slaves and reads can be serviced by slaves.The updates propapagate by a state transfer model that overwrites the state or a operation transfer model that reexecutes the operations Generally the full object is not sent over the network, only a part of it as determined by hash tree, is sent as the delta of the changes. 
The multimaster model is also called as the No master model where clients can issue update to any server and each replica will eventually get to the same state. This is very helpful when we want to scale out the writes such as with such as when there are hot spots between the key range circle in consistent consistent hashing.
The syncing of states is achieved with techniques such as the traditional 2PC. This provides strict consistency when there are N replicas for data. It is done via two phases. During the prepare phase, the co-ordinator asks every replica whether it is ready to perform the update and the replicas flush to disk and respond with success.Next during the commit phase, each replica writes another log entry to confirm the update. This technique suffers from poor availability as disk I/O completes. Therefore a new technique named Quorum based 2PC is preferred. In this model, the coordinator asks all the  replicas to write the data but waits only for the responses from a select number of them. This improves probabilties in favor of availability. When reading the data, a minimum set of replicas are read and the one returned with the latest timestamp. For strict consistency, the important condition is to make the readset and the writeset overlap where the minimum number of replicas participating in the read and those participating in the writes exceed the total number of replicas.
If we don't require strict consistency, we can go with a Gossip protocol to propagate the updates via message exchanges. A vector clock is used to help with the messaging.It is a timestamp mechanism that helps each replica to order its operations based on a notion of time.Each replica keeps its own vector clock and advances it whenever there is an operation such as when an internal operation happens or when a message is sent to another replica or when it receives a message from another replica. The timestamp is also included in the message. Each local vector clock merges the timestamp by picking out the maximum value. A partial ordering among vector clocks helps with causal relationships. The state always flows unidirectionally from the replica to the client and the vector clock is represented by one entry for each clock. The client keeps a vector clock for the last replica and submits it in the requests it makes. The replicas ensure that this clock is exceeded in its operations. The gossip itself is the model for state and operation transfer. Each replica maintains a vector clock as well as a state version tree that contains all the conflicting updates. For queries, the client sends its vector clock and the replicas send back subsets of state tree preceding the clients vector clock and the client merges them. Similarly the replicas update only if the client's vector clock is advanced than its own.Replicas merge their sets using the gossip protocol by maintaining a queue for the update requests and merging the clocks.
Courtesy:nosql patterns from horicky
#codingexercise
# robot in a circle problem and solution:  http://ideone.com/vHFOJ9
# get similarities based on permutations: http://ideone.com/nunvC1

Monday, July 3, 2017

Today we continue to review the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. We discussed the infrastructure which consists of physical and virtual nodes. Data is partitioned based on consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. With multiple copies of data, we need consistency models. These include strict consistency, read your write consistency, session consistency, monotonic read consistency and eventual consistency. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates. The topology that helps with this can be classified as a single master model, a multi-master or no-master model or a Quorum based 2-PC model that provides strict consistency. The single master model is one that has each data partition managed by a single master and multiple slaves. All update requests go to the master where the update is applied and propagated to the slaves asynchronously. This scales out the reads to the slaves unless the most recent updates are required in which case they go to the master. The master is still a virtual node. When a physical node crashes, some of the masters may be lost in which case the most uptodate slave may be chosen as master. This model works very well for high read/write ratio which is the case for most data replication.Replication happens by way of state or operations transfer which is similar to the well known log shipping and mirroring options. State gets overwritten on the destinations where as operations are reexecuted on the destination.
Multimaster model comes useful for high write scenarios or when a key range has hot spots. Multi-master model allows the updates to be spread more evenly because it happens at any replica which is why it is sometimes called a no-master. With updates happening to any replicas, state synchronization becomes important. One such method is Quorum based 2-PC which uses the traditional 2PC protocol to bring all replicas to the same state at every update. There is a prepare phase when all the replicas are flushed to disk and the commit phase when each replica writes a log entry to commit the udpate.
Courtesy:nosql patterns from horicky
#codingexercise 

1. Check divisibility by 8 for any permutation of a number 

    static string CheckDivisibilityBy8(string number) 

    { 

        var divs = new List<int>(Enumerable.Range(1, 999)); 

        divs = divs.Where(x => x % 8 == 0).ToList(); 

        divs.ForEach(x => Console.Write("{0},", x)); 

        var d = new Dictionary<char, int>(); 

        for (char c = '0'; c < '9'; c++d.Add(c, 0); 
        for (int i = 0; i < number.Lengthi++) 
        { 
            if (d.ContainsKey(number[i])) 
                d[number[i]]++; 
            else  
                d.Add(number[i], 1); 
        } 
       var reqLen = number.Length < 3 ? number.Length : 3; 
        for (int i = 0; i < divs.Counti++) 
        { 
            if (divs[i].ToString().Length >= reqLen && divs[i].ToString().ToCharArray().All(x => d.ContainsKey(x) && d[x] >= divs[i].ToString().Count(t => t == x))) 
                return "YES"; 
        } 
        return "NO"; 
    } 
2. If Similarity is defined as the length of the prefix matching two strings, find the sum of the similarities found from a string when it is matched with all its suffixes.
    static int GetSimilarity(string A, string B) 
    { 
        int count = 0; 
        for (int i = 0; i < Math.Min(A.LengthB.Length); i++) 
            if (A[i] != B[i]) 
            { 
                // Console.WriteLine(A.Substring(0, i)); 
                return count; 
            } 
            else 
            { 
                count++; 
            } 
        // if (count != 0) Console.WriteLine(A); 
        return count; 
    } 
    static int GetSimilarities(string A) 
    { 
        // sum up similarity for each suffix of A beginning with A itself 
        int sum = 0; 
        for (int k = 0; k < A.Length; k++) 
        { 
            var suffix = A.Substring(k, A.Length-k); 
            var count = GetSimilarity(A, suffix); 
            if (count != 0) 
            { 
                Console.WriteLine("suffix={0},count={1}", suffix, count); 
                sum += count; 
            } 
            else  
            { 
                Console.WriteLine("suffix={0}", suffix); 
            } 
        } 
        return sum; 
    }