Cluster computing

Monday, January 20, 2014

There are three types of joins : Nested Loop Joins, Hash Joins and Sort-Merge joins.
In a nested loop join, one of the tables is scanned and for every row, a look up is performed on the second table and the matching rows are retrieved. The tables can be labeled as outer and inner tables. The inner table lookup can be fast based on the presence of indexes. Let us take two tables employees and contractors, employees has 100,000 rows and contractors are 40,000 rows.
Then a table scan on the contractors together with a clustered index seek on employees is an example of a nested join.
The scan count of the employees table is 100,000 and that of the contractors is 1 for a join on id.
In a hash join, a hash table is created in memory for one of the inputs called the build input and searched for matching values from the other input, the probe input. If the build input does not fit in memory, it is partitioned recursively until it fits in memory. The probe input is hashed and used for matches. Hash joins are better when there is a significant difference between the size of the tables.
In a hash join of the example tables, the contractors table is used as the build input and the employees table is used with a clustered index scan.
The scan count of the employees table is 1 and that of the contractors is also 1.
In a sort merge join, the tuples of one input is examined, then the second sorted input is then scanned until a row with a match is found and then this is repeated for every row of the first input. Here the larger input dominates the cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.
This type of join requires both inputs to be in memory as opposed to hash join, where only a smaller input is in memory. The merge join is more efficient if the inputs are already sorted. Note that inequality joins are not handled by this method.
For the given example, the sorting is done with the ids of the contractor table and it is scanned together with clustered index scan of the employees table.
The scan count for the employees table is 1 and that for the contractor table is also 1.

In Teradata, the SQL Syntax is very similar to anywhere else. For example, the TOP command can be used with OrderBy. The TOP WITH TIES can be used to see the ties together. This is the same as Rank or denserank operation. However, in Teradata, top cannot be used with distinct and qualify, with and with by and sample. By the way, GroupBy is preferred over distinct for the queries which return same results. This is because Distinct skews data where as groupby is faster.
Also note that when using where clauses, it is preferable to specify IS NULL or IS NOT NULL explicitly. This is particularly true when testing for the exclusion of some column values in the result set. Paranthesis in Where clause can be used to change the order of precedence. The IN and NOT IN operators can also help with inclusion and exclusion. Note that the NOT=ALL works just like NOT IN operator.
The IN list is an excellent way to look up multiple column values. BETWEEN is inclusive and BETWEEN works for character data. LIKE can be used for string comparision. It can take % and _ as wildcard characters. Like, Like all and Like any can be used in conjunction with conditions.
Data Manipulation commands are also similar. For example we can give INSERT INTO My_Table (Column1, Column2) VALUES (124.56, 'My character data')
Bulk Insert is done with say INSERT INTO My_Table SELECT(Column1, SUM(Column2)) FROM Original_Table GROUP BY Column1
Update is done with say, UPDATE My_Table set column2 = column2 + 256 WHERE column1 = ' my data'
Fast path updates can be done with INSERT INTO My_Table_Copy SELECT Column1, Column6*1.05 FROM My_Table.
MERGE INTO Employee_Table USING VALUES(200000, NULL, 'Jones') AS E(Emp, Dept, Las)
ON Employee_No = Emp
WHEN MATCHED THEN
UPDATE set salary = sal
WHEN NOT MATCHED THEN
INSERT VALUES(Emp, dep, las);

Sunday, January 19, 2014

interview questions
If a table has students and another table has classes offered at a school, how do we find students who are not enrolled in any class.
We maintain another table StudentClass with foreign keys and non-null values StudentID and ClassID.
SELECT s.name FROM StudentClass sc
RIGHT OUTER JOIN Student s
ON s.ID = sc.StudentID
WHERE sc.ClassID = NULL
If we wanted a count of all students enrolled in more than one class
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
HAVING NumClasses > 0
if we wanted to exclude a particular Class ID say 1
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
WHERE sc.ClassID IS NOT NULL AND sc.ClassID <> 1
HAVING NumClasses > 0

Saturday, January 18, 2014

In this post, we talk about temporal tables create function in Teradata
There are three types of temporal tables.
- valid time temporal tables
- transaction time temporal tables
- bi-temporal tables that are a combination of both.
These tables are created with the PERIOD data type that introduces a begin and end timestamp.
The only difference is that the ValidTime can be a date or a timestamp. The Transaction Time has to be a timestamp.
The timestamp changes on update both before and after updates as in the case of Bi-Temporal tables. The non-sequenced ValidTime gives the data as the way things were while the current valid time gives the way the data is.
We will now look at how joins work internally in Teradata.
Teradata requires that for two rows to be joined together, both rows are physically on the same AMP.
One or both tables are redistributed or the smaller table is duplicated across all AMPs
Based on the column for the join, the matching rows from different AMPs are brought together on the same AMP. A volatile table is used for this purposes. On all joins, the matching rows must be on the same AMP and hashing is used to bring the rows together.
If two tables were to be joined thousand times a day and given that there will be the same primary index on the PK/FK join condition, there will be no data movement because the rows will already be hashed to the same AMP. The EXPLAIN operator shows this as Row Hash Match join. In fact, it is preferable to have a where clause with the primary index where this is not the case. The volatile tables are usually deleted on logoff.
There are three types of temporary tables:
1)Derived Table
This exists only within a query and is materialized by the SELECT statement inside a query.
The space is used on the users spool space and it is deleted at the end of the query.
Derived tables are usually specified within a parenthesis in the SQL or with a WITH clause. Columns can be aliased and can default to normal columns. Most derived tables are used join to other tables.
2)The Volatile tables are created by the user and materialized with an INSERT/SELECT
The space is used on the users spool space and the table and data are deleted when the user logs off.
The ON COMMIT DELETE ROWS can be explicitly specified so that the volatile table is deleted at the end of the query.
The HELP command shows the volatile tables.
The volatile tables can have primary index such that when used for joins between tables, there is no movement of data.
Statistics can also be collected on volatile table so as to improve performance. For example, we can have this for all non-unique primary indexes (NuPI), unique primary indexes of small tables i.e with less than thousand rows per AMP, columns that appear frequently in WHERE search conditions and non-indexed columns used in joins, and partitioning column of a PPI table.
3)Then there are global temporary tables
Here the table definition is specified by the user and the table definition is permanent. It is materialized with an INSERT/SELECT.
The space is used on the user's TEMP space. When the user logs off the session, the data is deleted, but the table definition stays. This can be shared by many users who can then populate the same table but with their own copy.

Friday, January 17, 2014

We return to the discussion on Teradata.
Teradata has an admin user called DBC. This comes with each system.
User DBC has always been at the top of the Teradata hierarchy. An admin user is explicitly provisioned to facilitate addition of other users. Some space is reserved for user DBA. Space on the disk for each AMP is divided into PERM space and SPOOL space. PERM space is reserved for tables. Secondary indexes, join indexes and permanent journals are also stored in PERM. SPOOL is used for user queries and answer sets.
Disks are not shared between AMPs.
Every user is assigned spool space so they can submit SQL and retrieve answer sets. If the user exceeds this space, the query is aborted.
The spool limits are so that long queries can be terminated and individual users don't hog the system.
Each user's spool limit is actually done per AMP. If the space is exceeded on any AMP, the query is aborted.
The DBC data dictionary shows how much space is reserved. When the space per AMP shows skew, typically this is due to NULL data.
The access is usually granted to users to read views which are in turn granted access to read tables.
A database or a user can be assigned PERM space and spool space.
Objects that take up PERM space include Table rows, fallback tables, secondary index subtables, stored procedures, user defined functions, permanent journals etc.
Access rights can be granted to user. There are three types of access rights:
automatic rights which are default and granted to owners of created tables.
Implicit rights which are ownership rights and are implicit to parents and anybody above in the user hierarchy,
Explicit rights which are privileges a user explicitly grants to another user with the grant to statement.
These rights may be explicitly removed
Profiles and roles are different. Profiles are for people and roles are for rights. A profile can be assigned to a group of users so that they operate under a common set of attributes. for example a setting the can be used to override that of the users.
Account strings can be expanded to include logon time, date, hour, session ID, number etc.
The Teradata Active System Management TASM controls the query traffic. User accounts are given priority and rules setup to control system resources.
Secondary index can be created. If it is unique, it can be hashed.
A USI is a two AMP operation because the first AMP is assigned to read the subtable and the second the base table. Records are looked up in both via binary searches.
A non-unique secondary Index table is AMP_Local.
A value ordered NUSI can also be created.

Today we take a short break from our discussion on Teradata
Using Kullback-Leibler divergence to extract keywords from a single document:
This approach talks about a simple way to extract keywords using term weighting.
Terms are weighted based on the Kullback-Leibler divergence D(avg probability of term t/q) from the background distribution q.
The terms are evaluated one by one and if their divergence is above a cutoff, they are selected.
Initially, we calculate the expected probability of a term by counting its occurrence in the document. We could later refine this to be a lookup against a dictionary of words and their expected probabilities as determined independently from a corpus of text.
We calculate nw as the total number of terms where w appears.
and pw as the total number of terms where w appears divided by the total number of terms in the document
and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] .
By setting an adjustable threshold for the cutoff, we extract variable number of keywords.
pseudo-code in python syntax for finding KLD from a single document:
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]

wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]

fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in lemma])

The KLD is given as D(p||q) = Sigma-t-from1ton[pt.log(pt/qt)]
where pt is for the distribution consisting of only the one keyword and is a constant = 1 for that term
qt is the distribution nt/n for that term
In our case we don't need to aggregate the KLD component pt.log(pt/qt) since we consider only one term. I take this back, the distribution is based on the same set of terms for both p and q. When a term doesn't appear in p or q, then it is given an epsilon probability.
we select the terms where pt.log(pt/qt) = KLD(p|q) > 1

KLD-term = 1 * log (1/(nt/n) for term in lemma)]) if we considered only one term
KLD-term = 1 * log (1/(nt/n)) + Sum(epsilon.log(epsilon/(nt'/n))) for all other terms t'

Alternatively, we could use Bigi's equation as

KLD(d, t) = KLD(d,t) / KLD(d,0)

KLD(d,t) = SUM((P(t,d) – P(t,t)) x log(P(t,d)/P(t,t)))

= (Nt/n – 1) * log(nt/n) if we considered only one term
= (Nt/n – 1) * log(nt/n) + SUM(((nt/n)-epsilon) x log((nt/n)/epsilon)) for all terms

Note that this considers the occurrence and non-occurrence of terms but not their density for which we want to add weights in the probability computation above.

Thursday, January 16, 2014

We now look at Partition Primary Index tables in the Teradata server.
Each table in Teradata has a primary Index unless its a NoPI table. Partitioning tells the AMPs to sort the table rows by partition first, and then sort the rows of a table with the Row-ID. Partitioning is a way to prevent FULL table scans.
Partitions can be specified with Range_N options such as for creating a partition each day/week/month/year.
The query would look like CREATE TABLE Order_Table_PPI
(
Order_Number INTEGER NOT NULL,
Order_Date Date
) PRIMARY INDEX (Order_Number)
Partition by Range_N(OrderDate BETWEEN date '2013-01-01' AND DATE '2013-12-31'
EACH INTERVAL '7' DAY);
this would create about 52 partitions. Or the partitions could be based on each partition size.
Partitions can be of arbitrary number. Teradata supports a very very large number.
Older data and newer data in PPI can be combined.
Partitions can be nested too. The top partition is usually the only one that can be altered. Upto 15 levels of Range_N or Case_N partitions can be nested.
A primary index must be non-unique on all partitioned tables unless the primary index column is also used to define the partition. Most PPI tables are defined as NUPI.
Character based PPI are also common.
Alter Table Order_Table to current with insert into Order_table_History
Columnar table partitions are also used often. Columnar tables must be NoPI tables. Since the partitions are along columns, the columns are placed inside their own container. All containers have the same amount of rows. Each container looks like a small table for I/O purposes. Columnar tables make sense when users query only certain data.
In a columnar table when a row is deleted, it is not physically deleted but just marked deleted.
Columnar can move just one container to memory. Containers match up perfectly to rebuild a row.
Containers can have more than one columns.