Friday, March 24, 2017

Yesterday we were talking about Enterprise Search. Enterprise search such as Google does site indexing and is very helpful for pages served on the web for any organization. The difference is in the index generated and the topology of the search and indexing pipeline. Traditionally this was maintained on premise with dedicated hardware. This is now being sunset in favor of services from cloud. There are a few remarkable features of hosting the indexing service in the cloud that often goes missing from on-premise deployments. First, there is a formal separation between indexing and analytics. Second, the service creates an index pipeline instead of a store so that different sources can contribute to the pipeline for indexing.  Third a staging xml or json cache is used so that the configuration changes are not per data source but based on the staging cache. This makes rebuilding of index unnecessary when data sources are added or configuration changes for those data sources. Fourth, by hosting the service in the cloud or on a cluster from the cloud, the service can be elastic and expand to as many resources as necessary with automatic rollover, load balancing and differentiation in terms of components. Finally, cloud based indexing uses Big Data rather than a simple index. Not just Google Enterprise Search Services but also HP TRIM and KeyView are hosted in the cloud Most of them work with Big Data for their index. Connectors can be implemented with a range of popular search engines. Companies pay for the SLAs from the connector. Connectors are available for a growing range of repositories which include File Systems, CIFS and NFS shares, Box.com, S3, etc., relational databases including tables and their snapshots, Content management systems such as Documentum and Sharepoint, Collaboration such as Confluence, Jira, ServiceNow and TeamForge,  CRM software such as SalesForce and RightNow, Websites hosted with Aspider web crawler, Heritrix web crawler and Staging repository as well as a variety of feeds such as FTP, RSS etc. The Service Level Agreements from the connector cover security with sign-on such as LDAP, NTLM and CAS, performance goals such as not to degrade the target repository and content enrichment such as cleansing, parsing, entity extraction and categorization.
 Analytics happens with proprietary query defined language and usually varies depending on the platform. Search Technologies has mentioned a variety of techniques related to search. 
I believe that desktop search can be included with System Center automations. This lets data to be indexed to follow the same route as logs and metrics with the convenience of a central management.
Personal document indexing as a productivity tool then merely requires using queries on a cloud service.

#codingexercise
Break lines to make lines as balanced as possible with the typographical considerations to avoid excessive white space, avoid widows and orphans etc.
void BreakLines(int[] optimals)
{
for (int k =1; k <=n; k++)
{
    optimals[k] = int_max;
    for (int j =0; j<= k-1; j++)
         optimals[k] = Math.min(optimals[k], optimals[j] + Penalty(j, k));
}
}
where Penalty(i,j) is the penalty of starting a line at a position i and ending the line at a position j
The above method could be made recursive without memoization.
BreakLines gives the minimum penalty
To Layout the text, we do it whenever we find lower cost:
void LayoutText(int[] optimals, int[] Best)
{
for (int k =1; k <=n; k++)
{
    optimals[k] = int_max;
    for (int j =0; j<= k-1; j++){
         var temp = optimals[j] + Penalty(j, k);
         if (temp < optimals[k]){
              optimals[k] = temp;
              Best[k] = j;
         }
    }
}
}


dynamic programming is about making the unknown subproblem as recursive. For example, to count the number of permutations with k inversions for a number N, we can say:

int GetInv(int N, int k)
{
if N == 0 return 0;
if k == 0 return 1; // sorted
int count = 0;
for(int i = 0; i <= k; i++)
{
count += GetInv(N-1, i);
}
return count;

}
This can also use memoization.

No comments:

Post a Comment