Wednesday, February 27, 2013

Text Search and keyword lookup

Keyword lookup is a form of search which includes cases where the topic words for the provided text may not be part of the text. This is usually done based on some semantic dictionary or thesaurus involved and lookups.  Unfortunately, such thesaurus is specific to the domain for which the text is written. As an example, let's take a look at the keywords used for the internet. Search engine optimization professionals use keyword research to find and research actual search terms people enter into the search engines when conducting a search. They lookup the relevancy of the keywords to a particular site. They find the ranks of the keywords from various search engines for competing sites. They buy sample campaign for the keywords from search engines and they predict the click-through rate. Although search engine optimization is not related to the automated keyword discovery, it serves to indicate that relevancy and choice of keywords is a complex task and concordance is not enough. Additional metadata in the form of distance vectors or connectivity between words or their hiearchy in wordnet is required. Building a better dictionary or thesaurus for the domain is one consideration. Selecting the metrics to evaluate the keywords is another. Using the statistics for the keywords encountered so far is a third consideration.

Tuesday, February 26, 2013

FullText revisited

Let's take a look at full text search again. Fulltext is about indexing and searching. Fulltext is executed on a document or a full text database. Full text search is differentiated from searches based on metadata  or on parts of the original texts represented in the databases because it tries to match all of the words in all the documents  for the pattern mentioned by the user. It builds a concordance of all the words ever encountered and then executes the search on this catalog. The catalog is refreshed in a background task. Words may be stemmed or filtered before pushing it into the database. This also means that there can be many false positives for a seach, an expression used to denote the results that are returned but not relevant to the intended search. Clustering techniques based on Bayesian algorithms can help reduce false positives.
Depending on the occurences of words relevant to the categories, a search term can be placed in one or more of the categories.  There are  a set of metrics used to describe the search results - precision and recall. Recall measures the relevancy of the results returned by a search and precision is the measure of the quality of the results returned.  Some of the tools aim to improve querying so as to improve relevancy of the results. These tools utilize different forms of searches such as keyword search, field-restricted search, boolean queries, phrase search, concept search, concordance search, proximity search, regular expression, fuzzy search and wildcard search. 

Forensics

Forensic Science has a number of examples of statistical approaches to text search. These include
Factor analysis:
This method describes variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The variations in three or four observed variables mainly reflect the variations in fewer observed variables. Factor analysis searches for such joint variations in response to unobserved latent variables.
Bayesian statistics: This method uses degree of belief, also called Bayesian probabilities, to express a model of the world. These are based on interprtation using probabilities.
Poisson Distribution: This is used to model a number of events occuring within a time event.
Multivariate analysis: This involves observation and analysis of more than one statistical outcome variable at a time.
Discriminant function analysis: This involves predicting a grouping variable based on one or more independent variable.
Cusum analysis : This is a cumulative sum method for finding a two or three letter words in a sentence that forms the habit of the speaker and thus can be used to detect tampered sections
 

Sunday, February 24, 2013

Integrated access to multiple data sources

Integrated access to multiple data sources
Large organizations typically have several databases and users may want to access data from more than one source. For example, an organization can have one data store for product catalog also called master data and another for billing and payments and yet another for reporting. These databases may contain some common information, determining the exact relationship between tables in different databases can get tough. For example, prices in one database might be dollars per dozen item and in another might be dollars per item. This is therefore typically avoided using XML DTDs which offer the promise that such semantic mismatches can be avoided if all parties conform to a single standard DTD. However, there are many legacy databases and most domains may not yet have an agreed-upon DTD. Semantic mismatches can be resolved and hidden from users by using relational views over the tables from the two databases. Defining a collection of views to give users a uniform presentation of relevant data from multiple databases is called semantic integration.  The task of defining these views for semantic integration can be challenging when there is little or no documentation for the existing databases. 
If the underlying databases are managed by different DBMS, some kind of middleware may be used to evaluate queries over integrating views, retrieving data over query execution time. Alternatively, the integrating views can be materialized and stored in a data warehouse. Queries can be run over the warehoused data instead of the source DBMS at run-time.

new venture

http://indexer.cloudapp.net

schema design 2

Normal forms are a guidance to avoid known problems during schema design. Given a schema, whether we decompose it into smaller schema is determined based on this normal forms.  The normal forms are based on functional descriptors and are first normal form, second normal form, third normal form and Boyce Codd normal form. Each of these forms have increasingly restrictive  requirements. A relation is first normal form if every field contains atomic values and not lists or sets. 2NF is mainly historical interest. 3NF and BCNF are important from design standpoint. Boyce Codd Normal form for a relation R holds truie if for every FD X->A that holds over R, one of the following statements is true.
A belongs to X : that is, it is a trivial FD or
X is a superkey
Third normal form holds when for every FD X-> A that holds over R, one of the following statements is true:
A belongs to X : that is, it is a trivial FD or
X is a superkey or
A is part of some key for R

Saturday, February 23, 2013

Setting up a website

If you want to create a website for yourself, here are some of the things you need to do:
1) Your company must have a name and logo. This is central to the theme on any page that displays information about your company.
2) Your company webstite must explain the company's value proposition in a simple and clear manner to the user from the home page. A picture, a slogan or a paragraph that conveys this to the user should be on the front page of your website.
3) Your company website must elicit user attention with achievements, partners or better yet examples of real life usage.
4) Your company website have business contact information if users are supposed to contact.
5) Your company must have a copyright to all information on the website.