Cluster computing

Sunday, July 16, 2017

We were discussing document libraries including OneDrive and OneDrive for business. OneDrive for Business is different from OneDrive for users. The former is an integral part of Office 365 or Sharepoint Server and provides place in the cloud where users can store, share and sync their work files. It is managed by the organization with the help of Sharepoint services and is virtually isolated from any or all personal storages of users such as OneDrive personal accounts. That said, the files are easy to be moved from one to the other if the users have setup access using corresponding accounts.
Sharepoint services is however different from OneDrive for Business. While both are offered through Office365 business plans, OneDrive for Business evolved from Sharepoint workspace and before that Groove whereas Sharepoint online is a Cloud-based version of the Sharepoint Service that dates back to Office XP. Both are powered by Sharepoint. While one is referred to as location, another is referred to as team site. All files in the former default as private while those in the latter inherit the permissions of the folder they are uploaded in. The interface is also different between the two where the former is exclusive to the user and the latter has a theme shared by the organization.

File Synchronization services allow files to be sync’ed between local desktop and the cloud. The Microsoft Sync framework is actually well known in this space. It is a comprehensive synchronization platform that can synchronize any type of data, using any protocol over any network. It uses a powerful metadata model that enables peer to peer synchronization of file data with support for arbitrary topologies. One of the main advantages for the developers is that they can use it to build file synchronization and roaming scenarios without having to worry about directly interacting with the file system.

Some of the features of this system include incremental synchronization of changes between two file system locations specified via local or UNC path, synchronization of file contents, file and folder names, file timestamps, and attributes. It provides support for optional filtering of files based on filenames/extensions, sub-directories or file attributes. It provides optional use of file-hashes to detect changes to file contents if file timestamps are not reliable. It provides reliable detection of conflicting changes to the same file and automatic resolution of conflicts with a no-data-loss policy. It allows for limited user undo operation by optionally allowing file deletes and overwrites to be moved to the recycle bin. It supports Preview mode which provides a preview of the incremental synchronization operation without committing changes to the file system. It lets user start synchronization with equal or partially equal file hierarchies on more than one replica. It supports graceful cancellation of an ongoing synchronization operation such that the remaining changes can be synchronized later without having to re-synchronize changes that were already synchronized.

Saturday, July 15, 2017

We were discussing document libraries. Sharepoint is an implementation of Content Databases. OneDrive is also a document library. In fact, this is one of the earliest file hosting service which is operated by Microsoft. Every user gets a quota which can be enhanced with subscriptions. The service was initially named SkyDrive and was made available in many countries. Later, Photos and videos were allowed to be stored on SkyDrive via Windows Live Photos, which allowed users to access their photos and videos stored on SkyDrive. It was therafter expanded to include Office Live Workspace. Files and folders became accessible to Windows Live Users and Groups which made sharing and file management easier. Subsequently SkyDrive began to be used with AppStore and Windows Phone Store via the applications released. APIs are also available for OneDrive.
OneDrive for Business is different from OneDrive for users. The former is an integral part of Office 365 or Sharepoint Server and provides place in the cloud where users can store, share and sync their work files. It is managed by the organization with the help of Sharepoint services and is virtually isolated from any or all personal storages of users such as OneDrive personal accounts. That said, the files are easy to be moved from one to the other if the users have setup access using corresponding accounts.
Sharepoint services is however different from OneDrive for Business. While both are offered through Office365 business plans, OneDrive for Business evolved from Sharepoint workspace and before that Groove whereas Sharepoint online is a Cloud-based version of the Sharepoint Service that dates back to Office XP. Both are powered by Sharepoint. While one is referred to as location, another is referred to as team site. All files in the former default as private while those in the latter inherit the permissions of the folder they are uploaded in. The interface is also different between the two where the former is exclusive to the user and the latter has a theme shared by the organization.

Friday, July 14, 2017

We discussed the Similarity measure between skills vector from a resume and a role to be matched. We could also consider using an ontology of skills for measuring similarity. For example, we can list all the skills a software engineer must have and connect the skills that have some degree of similarity using domain knowledge and human interpretation or from a weighted skills collocation matrix as resolved from a variety of resumes in a training set. With the help of this skills graph, we can now determine similarity as a measure of distance between vertices. This enables translation of skills into semantics based similarity.
The collocation based weights matrix we had come up with so far also can be represented as a graph which we can use for page rank to determine the most important features.
This concludes the text analysis as a service discussion and we now look into the store discussions for text content. In this regard, we briefly mentioned content libraries such as Sharepoint but we are going to discuss their cloud based versions. systems design for cloud based text analysis as a service can make use of such document libraries as an alternative to using S3. We discussed cloud native technologies. Let us now take a look at cloud versions of document libraries.
Sharepoint as an implementation of Content Databases. OneDrive is also a document library. In fact, this is one of the earliest file hosting service which is operated by Microsoft. Every user gets a quota which can be enhanced with subscriptions. The service was initially named SkyDrive and was made available in many countries. Later, Photos and videos were allowed to be stored on SkyDrive via Windows Live Photos, which allowed users to access their photos and videos stored on SkyDrive. It was therafter expanded to include Office Live Workspace. Files and folders became accessible to Windows Live Users and Groups which made sharing and file management easier. Subsequently SkyDrive began to be used with AppStore and Windows Phone Store via the applications released. APIs are also available for OneDrive.
#codingexercise
we discussed the methods of finding the length of the longest subsequence of one string as a substring of another.
Let us compare the performance:
1) Iterative approach brute force O(N^2) works well when substring is small and subsequence is large
2) dynamic programming based on increasing finds is also O(N^2) but it is more efficient because it is supposed to reuse overlapping subproblems. But the dp solution is based for 0 to current index substring.

Thursday, July 13, 2017

Domain Specific Text Classification:
The writeup here introduced domain specific text summarization which unlike the general approach to text summarization utilizes the desired outcome as a vector in itself. Specifically, we took the example of a resume matching to a given skillset required for a job. We said that the role can be described in the form of a vector of features based on skillsets and for a given candidate, we can determine the match score between the candidate’s skill sets and that of the role.
We could also extend this reasoning to cluster resumes of more than one candidate as potential match for the role. Since we compute the similarity score between vectors, we can treat all the resumes as vectors in a given skills matrix. Then we can use a range based separation to draw out resumes that have similarity score in k ranges of the scores between 0 and 1. This helps us determine the set of resumes that are the closest match.
We could also extend this technique between many resumes and many positions. For example, we can match candidates to roles based on k-means clustering. Here, we have the roles as centroids of the clusters we want to form. Each resume matches against all the centroids to determine the cluster closest to it. All resumes are then separated into clusters surrounding the available roles.
By representing the skills in the form of a graph based on similarity, we can even do page rank on those skills. Typically we need training data and test data for this. The training data helps build the skills weight matrix.
Conclusion: With the help of a skillset match and a similarity score, it is possible to perform matches between jobs and candidates as a narrow domain text classification. The similarity measure here is a cosine similarity between the vectors.
#exercise
yesterday we were discussing the length of the longest substring as a subsequence of another. We could do this with dynamic programming in a bottom up approach where the required length is one more than that computed for the previous match if the characters match or its the same as what was computed for the previous length of the substring.

Wednesday, July 12, 2017

Domain Specific Text Summarization:

The writeup here introduced text summarization which is a general approach to reduce the content so we can get the gist of a text with fewer sentences to read and it is available here. This would have been great if it could translate to narrow domains where the jargon and the format also matter and the text is not necessarily arranged in the form of a discourse. For example, software engineering recruiters find it hard to read through resumes because it does not necessarily appear as sentences to read. Further, even the savviest technical recruiter may find that the candidate meant one thing when the resume says another thing. This is especially true for missing technical or buzz words in the resume. On the other hand, recruiters want to sort the candidates into roles. Given a resume, the goal is to label the resume with one of the labels that the recruiter has come up with for the jobs available with her. If such was possible, it would then avoid the task of reading a resume for translations to see if it is a fit for a role. Such translations are not obvious even with a conversation with the candidate.

How do we propose to solve this? When we have pre-determined labels for the open and available positions, we can automate the steps taken by a recruiter to decide if the label is a fit or not for the candidate. The steps are quite comprehensive and rely on a semantic network to do correlation between the resume text and an available vocabulary to determine a score for the match between the candidate’s resume and the role requirements for the label. If the score exceeds a threshold, we determine the candidate can be assigned the label and given the green signal to take the screening test. Both the general text summarization and the narrow domain resume matching rely on treating the document as a bag of words. The key differences however are the features used with the documents. For example, we use the features that include the skill sets in terms of technologies, languages and domain specific terms. By translating the word in the resume to be vectors of features, we are able to better classify a requirement to a role where the role is also described in terms of the features required to do the job. This tremendously improves the reliability of a match and works behind the scenes.

Conclusion: With the help of the text summarization but a predetermined word-vector space for narrow domains, it is possible to avoid a lot of work for the recruiters while relying on latent knowledge about what is being mentioned in the resume.
#word vector match http://ideone.com/2NUwvu
#codingexercise
Find the maximum length of the subsequence of string X which is a substring of another string Y
int GetMax(String X, String Y)
{
int max = 0;
for (int i = 0; i < Y.Length; i++)
{
var single = Y.SubString(i,1);
if (single.IsSubsequence(X) && single.Length > max)
max = single.Length;
for (int j = i+1; j < Y.Length; j++)
{
var sub = Y.SubString(i, j-i+1);
if (sub.IsSubsequence(X) && sub.Length > max)
max = sub.Length;
}
}
return max;
}
http://ideone.com/RNERDx

Tuesday, July 11, 2017

We reviewed the writeup on Storage requirements for the summarization service indicating that it can be hosted on top of a document library such as sharepoint, OneDrive, Google drive or S3 as long as we can keep track of the summaries generated and the documents from which the summary was generated. Today we discuss the compute requirements for the summarization service. The tasks for summarization involve using a latent word vector space and a Markov random walk from word vector to word vector. This translated to about two times the two dimensional matrix of floating point values with about three hundred features. The selection of keywords and the ranked representation of sentences having those keywords take only linear order storage. Therefore, we can make a back of the envelope calculation for the allocation of these two dimensional matrices as roughly half MB space each. If we introduce a microbenchmark of processing ten thousand documents per second, we will require a total of about 10GB space for the service in addition to all operational requirements for memory. Therefore, we consider the service to operate on a scale out basis on any cloud compute capabilities or a Marathon stack. With the storage being shared and the isolation per document per user, we have no restrictions in keeping our service elastic to growing needs. At the same time, we also allow the web service the same level of production level standards as with any public domain web service. These include troubleshooting, logging, monitoring, health checks and notifications for performance degradations and such other services. Most platform as a service model for deployments automatically come with these benefits or can be extended easily for our needs. Similarly, the public cloud also has features for monitoring and billing that come very useful to plan and scale out the service as appropriate. Service may also be hosted in one region or the other. The web application hosting best practices recommend to create one or more availability zones each with the traditional model of hosting a web service for improving the reliability and availability of the service to the users. Furthermore, it recommends the use of available cloud based features such as routing and caching to improve performance and security for the web application. For example, a global network of edge locations can be used to deliver dynamic, static and streaming content. The requests for the content are routed to the nearest edge location in such case. Network traffic can also be filtered and security improved not just at the edge routers but also at the host level. Security groups can be used to manage access to the ports on the host. Similarly identity and access management services can be utilized to authenticate and authorize users. Data access layer that interacts with the database or document library can also be secured with internal service accounts and privileges. Data entering the data tier can be secured at rest and in transit. Routine backups and maintenance activities can be planned for safeguarding and protecting the data. Scaling and Failover can be automatically provisioned for these services.
#word vector match http://ideone.com/2NUwvu

Monday, July 10, 2017

Content Databases

In the writeup, we describe the storage requirements of the text summarization service. We said that this is equivalent to using a cloud based NoSQL document store because our summaries are not files but JSON documents, which we generate and maintain for all users of our service and we intend to use it for analysis. And we referred to the original documents from which the summaries were created to be made available via document libraries such as Sharepoint or OneDrive or Google Drive. When users upload a document to the summarization service for its processing, it could be stored the same way as we do with say Sharepoint that is backed by Microsoft SQL Server. Sharepoint uses HTTP routing mechanism and integrated windows authentication. Sharepoint services maintains multiple databases – system databases which include configuration, administration and content related data, search service database, user-profile databases and many other miscellaneous data stores. The Sharepoint system databases include configuration which contains data about all Sharepoint databases, web services, sites, applications, solutions, packages and templates, application and farm settings specific to Sharepoint Server, default quota and blocked file types. Content databases are separate from configuration. One specific content database is earmarked for central administration web site. The content databases otherwise store all the site content including documents, libraries, web part properties, audit logs, applications, user names, rights and project server data. Usually the size of content databases is kept under 200GB but size upto 1TB is also feasible. The Search service databases include search service application configuration and access control list for the crawl. The crawl database stores the state of the crawled data and the crawl history. The Link database stores the information that is extracted by the content processing component and the click through information. The crawl databases are typically scaled out for every twenty million items crawled. The Link databases stores the information that is extracted with the help of content processing and click through. It might be relevant to note that the crawl database is read heavy where as the link database is write heavy. The user profile service databases can scale up and out because they store and manage users and their social information. These databases also include social tagging information which is the notes created by the users along with their respective URLs. The size is determined by the number of ratings created and used. The synchronization database is also a user profile database and used when profile data is being synchronized with directory services such as Active Directory. This size is determined by the number of users and groups. Miscellaneous services include those that store app licenses and permissions, Sharepoint and access apps, external content-types and related objects, managed metadata and syndicated content-types, temporary objects and persisted user comments and settings, account names and passwords, pending and completed translations, data refresh schedules, state information from InfoPath forms, Web parts and charts, features and settings information for hosted customers, usage and health data collection and document conversions and updates. The tasks and their databases associated with content management indicate a planning required for the summarization service. It might therefore help if the content-management service can be used as a layer below the summarization service so the storage is unified. At the cloud scale, we plan for such stores in the cloud databases or use the Big Table and file storage based solution.
Courtesy: msdn
#codingexercise
Check if the nth bit from last is set in the binary representation of a given number
bool IsSet(int number, int pos)
{
var result = Convert.ToString(number, 2);
if (pos > result.Length)
return false;
else
return result[result.Length-pos] == '1';
}