Cluster computing

Tuesday, January 21, 2014

The following are some of the tips to improve SQL Server performance based on what has worked before:
1)Normalize logical database design:
Reasonable normalization of the logical database design yields best performance. In a normalized database, there's usually a greater number of narrow tables while there's fewer and wider tables in a de-normalized database. When there are many joins to make in a normalized database, it can hurt performance. However, the optimizer is efficient at selecting joins when indexes are available.
Note that normalization brings in fewer and narrower indexes, fewer NULLs and less redundant data, allows more clustered indexes and facilitates sorting and index creation. It also requires less locks because the data locked is lesser. Typically four-way or greater joins is indicative of when there has been too much normalization.
2) Use Effective Index design
Indexes are usually left out of logical database design and is usually associated with physical database design. Indexes can be dropped, added and changed without affecting the database schema. The optimizer chooses the best index. By properly designing the index and the optimizer selecting the best index, we can improve performance. We do this with:
a) Examine the where clause of the SQL queries, because this is the primary focus of the optimizer
Pick out the slow queries and for each column in the where clause, consider a possible candidate for index.
b) use narrow indexes which are more effective than multi-column compound indexes with fewer index levels. Narrower indexes provide better performance over a wide range of queries.
c) use clustered indexes
Update and Delete operations are often accelerated by clustered indexes because they involve a lot of reading. Queries that return a range of values are often improved with a clustered index where as a non-clustered index helps with fewer queries.
d) examine column uniqueness - finding the number of unique values in a column. If they are a handful, there should not be an index. If it is far less than the number of rows, consider a clustered index. If there are more, there can be a non-clustered index.
e) examine data distribution in indexed columns:
A long running query occurs because a column with few unique values is indexed, or a join on such a column is performed. When the data is not distributed properly, it increases the page I/O, typically one page I/O per non-clustered index at which point, it becomes efficient to scan the entire table. Distribution can be found with say group by and having count(*) > 1
The availability of an index helps an optimizer in its job to make a decision whether to use indexed retrieval or not for improved performance.
3) Use efficient query design
Some types of queries are inherently resource intensive such as
Large result sets
IN, NOT IN, and OR queries
highly non-unique where clauses.
!= comparision operators
certain column functions, such as SUM
expressions or data conversions in where clause
local variables in where clause
complex views with Group by
To mitigate the resource usage, consider restricting the result set. For example, we can add a predicate saying where zip = '98052'

Monday, January 20, 2014

There are three types of joins : Nested Loop Joins, Hash Joins and Sort-Merge joins.
In a nested loop join, one of the tables is scanned and for every row, a look up is performed on the second table and the matching rows are retrieved. The tables can be labeled as outer and inner tables. The inner table lookup can be fast based on the presence of indexes. Let us take two tables employees and contractors, employees has 100,000 rows and contractors are 40,000 rows.
Then a table scan on the contractors together with a clustered index seek on employees is an example of a nested join.
The scan count of the employees table is 100,000 and that of the contractors is 1 for a join on id.
In a hash join, a hash table is created in memory for one of the inputs called the build input and searched for matching values from the other input, the probe input. If the build input does not fit in memory, it is partitioned recursively until it fits in memory. The probe input is hashed and used for matches. Hash joins are better when there is a significant difference between the size of the tables.
In a hash join of the example tables, the contractors table is used as the build input and the employees table is used with a clustered index scan.
The scan count of the employees table is 1 and that of the contractors is also 1.
In a sort merge join, the tuples of one input is examined, then the second sorted input is then scanned until a row with a match is found and then this is repeated for every row of the first input. Here the larger input dominates the cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.
This type of join requires both inputs to be in memory as opposed to hash join, where only a smaller input is in memory. The merge join is more efficient if the inputs are already sorted. Note that inequality joins are not handled by this method.
For the given example, the sorting is done with the ids of the contractor table and it is scanned together with clustered index scan of the employees table.
The scan count for the employees table is 1 and that for the contractor table is also 1.

In Teradata, the SQL Syntax is very similar to anywhere else. For example, the TOP command can be used with OrderBy. The TOP WITH TIES can be used to see the ties together. This is the same as Rank or denserank operation. However, in Teradata, top cannot be used with distinct and qualify, with and with by and sample. By the way, GroupBy is preferred over distinct for the queries which return same results. This is because Distinct skews data where as groupby is faster.
Also note that when using where clauses, it is preferable to specify IS NULL or IS NOT NULL explicitly. This is particularly true when testing for the exclusion of some column values in the result set. Paranthesis in Where clause can be used to change the order of precedence. The IN and NOT IN operators can also help with inclusion and exclusion. Note that the NOT=ALL works just like NOT IN operator.
The IN list is an excellent way to look up multiple column values. BETWEEN is inclusive and BETWEEN works for character data. LIKE can be used for string comparision. It can take % and _ as wildcard characters. Like, Like all and Like any can be used in conjunction with conditions.
Data Manipulation commands are also similar. For example we can give INSERT INTO My_Table (Column1, Column2) VALUES (124.56, 'My character data')
Bulk Insert is done with say INSERT INTO My_Table SELECT(Column1, SUM(Column2)) FROM Original_Table GROUP BY Column1
Update is done with say, UPDATE My_Table set column2 = column2 + 256 WHERE column1 = ' my data'
Fast path updates can be done with INSERT INTO My_Table_Copy SELECT Column1, Column6*1.05 FROM My_Table.
MERGE INTO Employee_Table USING VALUES(200000, NULL, 'Jones') AS E(Emp, Dept, Las)
ON Employee_No = Emp
WHEN MATCHED THEN
UPDATE set salary = sal
WHEN NOT MATCHED THEN
INSERT VALUES(Emp, dep, las);

Sunday, January 19, 2014

interview questions
If a table has students and another table has classes offered at a school, how do we find students who are not enrolled in any class.
We maintain another table StudentClass with foreign keys and non-null values StudentID and ClassID.
SELECT s.name FROM StudentClass sc
RIGHT OUTER JOIN Student s
ON s.ID = sc.StudentID
WHERE sc.ClassID = NULL
If we wanted a count of all students enrolled in more than one class
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
HAVING NumClasses > 0
if we wanted to exclude a particular Class ID say 1
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
WHERE sc.ClassID IS NOT NULL AND sc.ClassID <> 1
HAVING NumClasses > 0

Saturday, January 18, 2014

In this post, we talk about temporal tables create function in Teradata
There are three types of temporal tables.
- valid time temporal tables
- transaction time temporal tables
- bi-temporal tables that are a combination of both.
These tables are created with the PERIOD data type that introduces a begin and end timestamp.
The only difference is that the ValidTime can be a date or a timestamp. The Transaction Time has to be a timestamp.
The timestamp changes on update both before and after updates as in the case of Bi-Temporal tables. The non-sequenced ValidTime gives the data as the way things were while the current valid time gives the way the data is.
We will now look at how joins work internally in Teradata.
Teradata requires that for two rows to be joined together, both rows are physically on the same AMP.
One or both tables are redistributed or the smaller table is duplicated across all AMPs
Based on the column for the join, the matching rows from different AMPs are brought together on the same AMP. A volatile table is used for this purposes. On all joins, the matching rows must be on the same AMP and hashing is used to bring the rows together.
If two tables were to be joined thousand times a day and given that there will be the same primary index on the PK/FK join condition, there will be no data movement because the rows will already be hashed to the same AMP. The EXPLAIN operator shows this as Row Hash Match join. In fact, it is preferable to have a where clause with the primary index where this is not the case. The volatile tables are usually deleted on logoff.
There are three types of temporary tables:
1)Derived Table
This exists only within a query and is materialized by the SELECT statement inside a query.
The space is used on the users spool space and it is deleted at the end of the query.
Derived tables are usually specified within a parenthesis in the SQL or with a WITH clause. Columns can be aliased and can default to normal columns. Most derived tables are used join to other tables.
2)The Volatile tables are created by the user and materialized with an INSERT/SELECT
The space is used on the users spool space and the table and data are deleted when the user logs off.
The ON COMMIT DELETE ROWS can be explicitly specified so that the volatile table is deleted at the end of the query.
The HELP command shows the volatile tables.
The volatile tables can have primary index such that when used for joins between tables, there is no movement of data.
Statistics can also be collected on volatile table so as to improve performance. For example, we can have this for all non-unique primary indexes (NuPI), unique primary indexes of small tables i.e with less than thousand rows per AMP, columns that appear frequently in WHERE search conditions and non-indexed columns used in joins, and partitioning column of a PPI table.
3)Then there are global temporary tables
Here the table definition is specified by the user and the table definition is permanent. It is materialized with an INSERT/SELECT.
The space is used on the user's TEMP space. When the user logs off the session, the data is deleted, but the table definition stays. This can be shared by many users who can then populate the same table but with their own copy.

Friday, January 17, 2014

We return to the discussion on Teradata.
Teradata has an admin user called DBC. This comes with each system.
User DBC has always been at the top of the Teradata hierarchy. An admin user is explicitly provisioned to facilitate addition of other users. Some space is reserved for user DBA. Space on the disk for each AMP is divided into PERM space and SPOOL space. PERM space is reserved for tables. Secondary indexes, join indexes and permanent journals are also stored in PERM. SPOOL is used for user queries and answer sets.
Disks are not shared between AMPs.
Every user is assigned spool space so they can submit SQL and retrieve answer sets. If the user exceeds this space, the query is aborted.
The spool limits are so that long queries can be terminated and individual users don't hog the system.
Each user's spool limit is actually done per AMP. If the space is exceeded on any AMP, the query is aborted.
The DBC data dictionary shows how much space is reserved. When the space per AMP shows skew, typically this is due to NULL data.
The access is usually granted to users to read views which are in turn granted access to read tables.
A database or a user can be assigned PERM space and spool space.
Objects that take up PERM space include Table rows, fallback tables, secondary index subtables, stored procedures, user defined functions, permanent journals etc.
Access rights can be granted to user. There are three types of access rights:
automatic rights which are default and granted to owners of created tables.
Implicit rights which are ownership rights and are implicit to parents and anybody above in the user hierarchy,
Explicit rights which are privileges a user explicitly grants to another user with the grant to statement.
These rights may be explicitly removed
Profiles and roles are different. Profiles are for people and roles are for rights. A profile can be assigned to a group of users so that they operate under a common set of attributes. for example a setting the can be used to override that of the users.
Account strings can be expanded to include logon time, date, hour, session ID, number etc.
The Teradata Active System Management TASM controls the query traffic. User accounts are given priority and rules setup to control system resources.
Secondary index can be created. If it is unique, it can be hashed.
A USI is a two AMP operation because the first AMP is assigned to read the subtable and the second the base table. Records are looked up in both via binary searches.
A non-unique secondary Index table is AMP_Local.
A value ordered NUSI can also be created.

Today we take a short break from our discussion on Teradata
Using Kullback-Leibler divergence to extract keywords from a single document:
This approach talks about a simple way to extract keywords using term weighting.
Terms are weighted based on the Kullback-Leibler divergence D(avg probability of term t/q) from the background distribution q.
The terms are evaluated one by one and if their divergence is above a cutoff, they are selected.
Initially, we calculate the expected probability of a term by counting its occurrence in the document. We could later refine this to be a lookup against a dictionary of words and their expected probabilities as determined independently from a corpus of text.
We calculate nw as the total number of terms where w appears.
and pw as the total number of terms where w appears divided by the total number of terms in the document
and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] .
By setting an adjustable threshold for the cutoff, we extract variable number of keywords.
pseudo-code in python syntax for finding KLD from a single document:
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]

wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]

fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in lemma])

The KLD is given as D(p||q) = Sigma-t-from1ton[pt.log(pt/qt)]
where pt is for the distribution consisting of only the one keyword and is a constant = 1 for that term
qt is the distribution nt/n for that term
In our case we don't need to aggregate the KLD component pt.log(pt/qt) since we consider only one term. I take this back, the distribution is based on the same set of terms for both p and q. When a term doesn't appear in p or q, then it is given an epsilon probability.
we select the terms where pt.log(pt/qt) = KLD(p|q) > 1

KLD-term = 1 * log (1/(nt/n) for term in lemma)]) if we considered only one term
KLD-term = 1 * log (1/(nt/n)) + Sum(epsilon.log(epsilon/(nt'/n))) for all other terms t'

Alternatively, we could use Bigi's equation as

KLD(d, t) = KLD(d,t) / KLD(d,0)

KLD(d,t) = SUM((P(t,d) – P(t,t)) x log(P(t,d)/P(t,t)))

= (Nt/n – 1) * log(nt/n) if we considered only one term
= (Nt/n – 1) * log(nt/n) + SUM(((nt/n)-epsilon) x log((nt/n)/epsilon)) for all terms

Note that this considers the occurrence and non-occurrence of terms but not their density for which we want to add weights in the probability computation above.