Friday, January 24, 2014

In this post, we will begin with a way to differentiate between LSI and PLSI. Latent Semantic Indexing is an approach where we try to associate terms with the same main hidden semantic component as found by co-occurrences in similar context.  LSI uses conditional probabilities.. In  PLSI, we use similar occurrence and co-occurrence of terms but with non-negative weights whose sum is 1. Comparing and clustering  this co-occurrence distributions gives us the latent components  as represented by the centers of the clusters.
We use similar clustering methods with document classification as well. Formally, we use the vector space model. This model represents documents of a collection by using vectors with weights according to the terms appearing in each document.
Vector Space model can be both boolean and term-weighted models. In the boolean model, the presence or occurrence of each term is given a boolean value. In the term-weighted model, the same term may be assigned different weights by different weighting methods such as tf-idf, frequency ratio. When the weights add up to one, they have been normalized. Typically probabilities or conditional probabilities adding up to one are used.
One of the advantages of using vector space model is that it enables relevance ranking of documents of heterogeneous format with respect to user input queries as long as the attributes are well-defined  characteristics of the document.

Thursday, January 23, 2014

In continuation of our discussion on keywords and documents:
We looked at a few key metrics for selecting keywords based on content, frequency and distribution.
However keywords can also be selected based on co-occurrence of similar terms in the same context.
i.e similar meanings will occur with similar neighbors.

This model proposes that the distribution of a word together with other words and phrases is highly indicative of its meaning This method represents the meaning of a word with a  vector where each feature corresponds to a context and its value to the frequency of the word's occurring in that context.
This vector is referred to as the Term profiles. The profile V(t) of the term t is a set of terms from the list T that co-occur in sentences together with term t, that is P(t) = {t' : t' belongs to s(t)}

Once such a representation is built, machine learning techniques can be used to automatically classify or cluster words according to their meaning. However, these are often noisy representations. This is due to say polysemy and synonymy. Synonymy refers to the existence of similar terms that can be used to express an idea or object in most languages. Polysemy refers to the fact that some words have multiple unrelated meanings. If we ignore synonymy, the clusters we create will have many small disjoint clusters while some of these could have been clustered together. If we ignore polysemy, it can lead to clustering of unrelated documents.

Another challenge in using this technique to massive databases is efficient and accurate computation of the largest few singular values and their vectors in the highly sparse matrix. In fact scalability beyond a few hundred thousand documents is difficult.

Note the technique above inherently tries to find the keywords based on the latent semantics and hence is referred to as Latent semantic indexing. The comparable to clustering in this approach is to find the conditional probability distributions such that the observed word distributions can be decomposed into a few of the latent distributions and a noisy remainder.  The latent components are expected to be the same for co-occurring words.

Next, we will also look at a popular way to represent documents in a collection which is with Vector Space Model.

Material read from survey of text mining, papers by Pinto, papers by Wartena

Today we will recap some keyword selection techniques:

In this article, we look at different feature selection algorithms. Feature selection metrics are based on content, frequency and  distribution.

Content based feature selection can be based on say mutual information or information gain.

MI is about how much information feature contains about class and it is given by log P(f,c)/P(f)P(c)

Information Gain measures the number of bits of information obtained about the presence and absence of a class by knowing the presence or absence of a feature.

It is given by Sum-class  Sum-feature  P(g,d) .log P(g,d)/P(g)P(d)

Gain Ratio improves IG by including features with low entropy that may be strongly correlated with a class. It does this by normalizing IG with the entropy of the class.

Term Strength is based on the idea that most valuable features are shared by related documents.
The weight of a feature is defined as the probability of finding it in some document d given that it is also appeared in the document d' similar to d.
To calculate TS for feature f, we should have a threshold between the similarity measure used to judge two words to be sufficiently related. This is done by first deciding how many documents can be related to a given one and then finding the average minimum similarity measure for this number of neighbors over all documents in collection.

Frequency based keyword selection metrics include tf-idf and Transition point.

tf-idf is the term-frequency and inverse document frequency.  This is quite a simple and effective measure for keywords.

Transition Point is about the idea that medium frequency are closely related to the conceptual content of a document. A higher value of weight can be given to each term as its frequency is closer to a frequency named the transition point. This is determined by inspecting the vocabulary frequencies of each text, identifying the lowest frequency from the highest frequency where the frequency is not repeated. The repeating frequencies are more common for low frequency words.
Distribution wise it is debatable whether a term spread out throughout the document can be included or excluded. On one hand its a good candidate for an index and on the other hand, it can be considered overly general.

Wednesday, January 22, 2014

In this post, we revert to the discussion on Teradata:
We will cover help and show commands. Help is a command to list the database objects. For example, we have Help Database, help user, help table, help volatile table, help view, help macro, help procedure, help trigger, help column etc. These display information of the corresponding database objects. The different database objects are aggregate functions (A), combined/analytical functions(B), External stored procedure(E), Function (F), Trigger(G), Method(H), Join Index(I), Journal Table(J), Macro(M), Hash Index(N), Procedure (P), Queue Table (Q), Table Function (R), Ordered Analytical Function (S), Table (T), UDT(U), View(V) and Authorization(X)  There's also corresponding help available for index, statistics, constraint, session, SQL, SPL etc.
Comments can also be added to database objects. These are displayed on help.
Runtime information such as sessions are also displayed. The parsing engine assigns Session.
The show command is used to show the SQL used to create the corresponding database object. For example, the SHOW TABLE <table-name> can be used to create the table statement needed to create the table. Table, view, macro, trigger, procedure can all be shown.
 Aggregates such as count when used with a column specifier ignores null values. When used with a star to denote all columns, includes rows with null values.  The common aggregates are MIN, MAX, AVG, SUM and COUNT. GROUP BY delivers one row per group. Rows can be restricted when used with a predicate and this improves performance. The predicate clause also helps skip calculations in aggregation. The having clause acts as a filter on all aggregates.
 Grouping Sets, rollup and cube are three grouping options.
The Nexus tool could be used for all groupings.
 Joins can be INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER JOIN. A PRODUCT JOIN is a Cartesian product and often not intended.
In continuation of our discussion on improving performance in SQL Server, we now look at I/O affinity option. This is useful on computers with multiprocessor such as 16 cpus or more. This option only supports disk I/Os and does not support any hardware affinity for individual disks or controllers.
The option is set with sp_configure switch with affinity_mask configuration option to specify which CPUs the threads of the SQL Server should run on. The affinity mask can exclude processors that are to be reserved for operating system processes.
When running an instance of SQL Server on a large enterprise-level multiprocessor computers with more than 16 cpus, the IO_affinity_mask can be used in conjunction with affinity_mask. This option is used to specify which processors are to be used for SQL Server disk operations and which CPUs service the remaining.
Care must be taken to specify just the right number of affinitized cpus for disk IO. This is because we don't want to provide more than what the system needs or we might degrade performance on the other CPUs.
The IO_affinity_mask cannot be changed  when the server is running.
Occassionally, we may encounter slow running queries or even suspended queries.  This has nothing to do with performance and is more symptomatic of a resource issue such as a deadlock.
To identify a deadlock, you must first obtain log information. SQL server supports different startup parameters, dbcc traceon, and dynamic management views that help with more visibility on locks.
For example, we can add -T1204 and -T3605 trace flags. The former collects information about the process and the resources when the deadlock detection algorithm encounters a deadlock and the latter collects information everytime the deadlock detection algorithm is run. If the server startup is not an option, traceon facilities can be added.
A SQLProfiler trace can also be collected. These can show full statements in addition to execution plans of the statements.  A SQLProfiler trace may also have a lock event for deadlock and deadlock chain. Turning on the deadlock trace flags during the occurrence of a deadlock and running a SQL profiler trace should provide the data to troubleshoot the deadlock. However, running the SQLProfiler could change the timing of execution enough to prevent a deadlock. Therefore, the information must be captured with trace flags first and then the profiler should be run.


Tuesday, January 21, 2014

This continues on our discussion to improve SQL Server performance with normalizing logical database design, using efficient index design and using efficient query.
 The next step is to use efficient application design. The role that application design plays is critical because clients control the queries, the type of queries and when they are submitted and how the results are processed. This affects the type and duration of the locks, the amount of I/O and CPU load on the server and whether the performance is good or bad.
For this reason, the sooner we correct the application design, the better the performance. At the same time, even for turnkey applications, improvements to performance cannot be done in all cases without changing the application. Queries from clients are often referred to as workloads.
Workloads can use small results sets. This will reduce CPU and I/O load.
Applications should allow cancellation of a query in progress. No application should force reboot.
Applications should process all results to completion.
Always implement a query timeout. Do not allow queries to run indefinitely
Do not use generated SQL unless you have control over the statements or transparency. Query cancellation, query timeout and complete transactional control are still required from such.
Decision support and OLTP queries should be separated as much as possible.
Do not allow orphaned connections.
Use trace feature wherever possible. Isolate the slow query wherever possible.
After the slow query is isolated, do the following:
run it in isolation with minimal buffering and I/O redirection.
set statistics IO on to examine IO consumed by the query.
If the query is nested or wrapped in a stored procedure, extract it.
If there are triggers on the table that can generate I/O when run, silence them
Examine the indexes of the table used by the slow query.
Using the queries previously mentioned, examine the data uniqueness and distribution for each column.
Based on this study, make changes to application, query and index as appropriate. Run the query again to see the improvement.
 
The following are some of the tips to improve SQL Server performance based on what has worked before:
1)Normalize logical database design:
Reasonable normalization of the logical database design yields best performance. In a normalized database, there's usually a greater number of narrow tables while there's fewer and wider tables in a de-normalized database. When there are many joins to make in a normalized database, it can hurt performance. However, the optimizer is efficient at selecting joins when indexes are available.
Note that normalization brings in fewer and narrower indexes, fewer NULLs and less redundant data, allows more clustered indexes and facilitates sorting and index creation. It also requires less locks because the data locked is lesser. Typically four-way or greater joins is indicative of when there has been too much normalization.
2) Use Effective Index design
Indexes are usually left out of logical database design and is usually associated with physical database design. Indexes can be dropped, added and changed without affecting the database schema. The optimizer chooses the best index. By properly designing the index and the optimizer selecting the best index, we can improve performance. We do this with:
a) Examine the where clause of the SQL queries, because this is the primary focus of the optimizer
Pick out the slow queries and for each column in the where clause, consider a possible candidate for index.
b) use narrow indexes which are more effective than multi-column compound indexes with fewer index levels. Narrower indexes provide better performance over a wide range of queries.
c) use clustered indexes
Update and Delete operations are often accelerated by clustered indexes because they involve a lot of reading. Queries that return a range of values are often improved with a clustered index where as a non-clustered index helps with fewer queries.
d) examine column uniqueness - finding the number of unique values in a column. If they are a handful, there should not be an index. If it is far less than the number of rows, consider a clustered index. If there are more, there can be a non-clustered index.
e) examine data distribution in indexed columns:
A long running query occurs because a column with few unique values is indexed, or a join on such a column is performed. When the data is not distributed properly, it increases the page I/O, typically one page I/O per non-clustered index at which point, it becomes efficient to scan the entire table. Distribution can be found with say group by and having count(*) > 1
The availability of an index helps an optimizer in its job to make a decision whether to use indexed retrieval or not for improved performance.
3) Use efficient query design
Some types of queries are inherently resource intensive such as
Large result sets
IN, NOT IN, and OR queries
highly non-unique where clauses.
!= comparision operators
certain column functions, such as SUM
expressions or data conversions in where clause
local variables in where clause
complex views with Group by
To mitigate the resource usage, consider restricting the result set. For example, we can add a predicate saying where zip = '98052'