Cluster computing

Sunday, January 19, 2014

interview questions
If a table has students and another table has classes offered at a school, how do we find students who are not enrolled in any class.
We maintain another table StudentClass with foreign keys and non-null values StudentID and ClassID.
SELECT s.name FROM StudentClass sc
RIGHT OUTER JOIN Student s
ON s.ID = sc.StudentID
WHERE sc.ClassID = NULL
If we wanted a count of all students enrolled in more than one class
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
HAVING NumClasses > 0
if we wanted to exclude a particular Class ID say 1
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
WHERE sc.ClassID IS NOT NULL AND sc.ClassID <> 1
HAVING NumClasses > 0

Saturday, January 18, 2014

In this post, we talk about temporal tables create function in Teradata
There are three types of temporal tables.
- valid time temporal tables
- transaction time temporal tables
- bi-temporal tables that are a combination of both.
These tables are created with the PERIOD data type that introduces a begin and end timestamp.
The only difference is that the ValidTime can be a date or a timestamp. The Transaction Time has to be a timestamp.
The timestamp changes on update both before and after updates as in the case of Bi-Temporal tables. The non-sequenced ValidTime gives the data as the way things were while the current valid time gives the way the data is.
We will now look at how joins work internally in Teradata.
Teradata requires that for two rows to be joined together, both rows are physically on the same AMP.
One or both tables are redistributed or the smaller table is duplicated across all AMPs
Based on the column for the join, the matching rows from different AMPs are brought together on the same AMP. A volatile table is used for this purposes. On all joins, the matching rows must be on the same AMP and hashing is used to bring the rows together.
If two tables were to be joined thousand times a day and given that there will be the same primary index on the PK/FK join condition, there will be no data movement because the rows will already be hashed to the same AMP. The EXPLAIN operator shows this as Row Hash Match join. In fact, it is preferable to have a where clause with the primary index where this is not the case. The volatile tables are usually deleted on logoff.
There are three types of temporary tables:
1)Derived Table
This exists only within a query and is materialized by the SELECT statement inside a query.
The space is used on the users spool space and it is deleted at the end of the query.
Derived tables are usually specified within a parenthesis in the SQL or with a WITH clause. Columns can be aliased and can default to normal columns. Most derived tables are used join to other tables.
2)The Volatile tables are created by the user and materialized with an INSERT/SELECT
The space is used on the users spool space and the table and data are deleted when the user logs off.
The ON COMMIT DELETE ROWS can be explicitly specified so that the volatile table is deleted at the end of the query.
The HELP command shows the volatile tables.
The volatile tables can have primary index such that when used for joins between tables, there is no movement of data.
Statistics can also be collected on volatile table so as to improve performance. For example, we can have this for all non-unique primary indexes (NuPI), unique primary indexes of small tables i.e with less than thousand rows per AMP, columns that appear frequently in WHERE search conditions and non-indexed columns used in joins, and partitioning column of a PPI table.
3)Then there are global temporary tables
Here the table definition is specified by the user and the table definition is permanent. It is materialized with an INSERT/SELECT.
The space is used on the user's TEMP space. When the user logs off the session, the data is deleted, but the table definition stays. This can be shared by many users who can then populate the same table but with their own copy.

Friday, January 17, 2014

We return to the discussion on Teradata.
Teradata has an admin user called DBC. This comes with each system.
User DBC has always been at the top of the Teradata hierarchy. An admin user is explicitly provisioned to facilitate addition of other users. Some space is reserved for user DBA. Space on the disk for each AMP is divided into PERM space and SPOOL space. PERM space is reserved for tables. Secondary indexes, join indexes and permanent journals are also stored in PERM. SPOOL is used for user queries and answer sets.
Disks are not shared between AMPs.
Every user is assigned spool space so they can submit SQL and retrieve answer sets. If the user exceeds this space, the query is aborted.
The spool limits are so that long queries can be terminated and individual users don't hog the system.
Each user's spool limit is actually done per AMP. If the space is exceeded on any AMP, the query is aborted.
The DBC data dictionary shows how much space is reserved. When the space per AMP shows skew, typically this is due to NULL data.
The access is usually granted to users to read views which are in turn granted access to read tables.
A database or a user can be assigned PERM space and spool space.
Objects that take up PERM space include Table rows, fallback tables, secondary index subtables, stored procedures, user defined functions, permanent journals etc.
Access rights can be granted to user. There are three types of access rights:
automatic rights which are default and granted to owners of created tables.
Implicit rights which are ownership rights and are implicit to parents and anybody above in the user hierarchy,
Explicit rights which are privileges a user explicitly grants to another user with the grant to statement.
These rights may be explicitly removed
Profiles and roles are different. Profiles are for people and roles are for rights. A profile can be assigned to a group of users so that they operate under a common set of attributes. for example a setting the can be used to override that of the users.
Account strings can be expanded to include logon time, date, hour, session ID, number etc.
The Teradata Active System Management TASM controls the query traffic. User accounts are given priority and rules setup to control system resources.
Secondary index can be created. If it is unique, it can be hashed.
A USI is a two AMP operation because the first AMP is assigned to read the subtable and the second the base table. Records are looked up in both via binary searches.
A non-unique secondary Index table is AMP_Local.
A value ordered NUSI can also be created.

Today we take a short break from our discussion on Teradata
Using Kullback-Leibler divergence to extract keywords from a single document:
This approach talks about a simple way to extract keywords using term weighting.
Terms are weighted based on the Kullback-Leibler divergence D(avg probability of term t/q) from the background distribution q.
The terms are evaluated one by one and if their divergence is above a cutoff, they are selected.
Initially, we calculate the expected probability of a term by counting its occurrence in the document. We could later refine this to be a lookup against a dictionary of words and their expected probabilities as determined independently from a corpus of text.
We calculate nw as the total number of terms where w appears.
and pw as the total number of terms where w appears divided by the total number of terms in the document
and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] .
By setting an adjustable threshold for the cutoff, we extract variable number of keywords.
pseudo-code in python syntax for finding KLD from a single document:
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]

wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]

fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in lemma])

The KLD is given as D(p||q) = Sigma-t-from1ton[pt.log(pt/qt)]
where pt is for the distribution consisting of only the one keyword and is a constant = 1 for that term
qt is the distribution nt/n for that term
In our case we don't need to aggregate the KLD component pt.log(pt/qt) since we consider only one term. I take this back, the distribution is based on the same set of terms for both p and q. When a term doesn't appear in p or q, then it is given an epsilon probability.
we select the terms where pt.log(pt/qt) = KLD(p|q) > 1

KLD-term = 1 * log (1/(nt/n) for term in lemma)]) if we considered only one term
KLD-term = 1 * log (1/(nt/n)) + Sum(epsilon.log(epsilon/(nt'/n))) for all other terms t'

Alternatively, we could use Bigi's equation as

KLD(d, t) = KLD(d,t) / KLD(d,0)

KLD(d,t) = SUM((P(t,d) – P(t,t)) x log(P(t,d)/P(t,t)))

= (Nt/n – 1) * log(nt/n) if we considered only one term
= (Nt/n – 1) * log(nt/n) + SUM(((nt/n)-epsilon) x log((nt/n)/epsilon)) for all terms

Note that this considers the occurrence and non-occurrence of terms but not their density for which we want to add weights in the probability computation above.

Thursday, January 16, 2014

We now look at Partition Primary Index tables in the Teradata server.
Each table in Teradata has a primary Index unless its a NoPI table. Partitioning tells the AMPs to sort the table rows by partition first, and then sort the rows of a table with the Row-ID. Partitioning is a way to prevent FULL table scans.
Partitions can be specified with Range_N options such as for creating a partition each day/week/month/year.
The query would look like CREATE TABLE Order_Table_PPI
(
Order_Number INTEGER NOT NULL,
Order_Date Date
) PRIMARY INDEX (Order_Number)
Partition by Range_N(OrderDate BETWEEN date '2013-01-01' AND DATE '2013-12-31'
EACH INTERVAL '7' DAY);
this would create about 52 partitions. Or the partitions could be based on each partition size.
Partitions can be of arbitrary number. Teradata supports a very very large number.
Older data and newer data in PPI can be combined.
Partitions can be nested too. The top partition is usually the only one that can be altered. Upto 15 levels of Range_N or Case_N partitions can be nested.
A primary index must be non-unique on all partitioned tables unless the primary index column is also used to define the partition. Most PPI tables are defined as NUPI.
Character based PPI are also common.
Alter Table Order_Table to current with insert into Order_table_History
Columnar table partitions are also used often. Columnar tables must be NoPI tables. Since the partitions are along columns, the columns are placed inside their own container. All containers have the same amount of rows. Each container looks like a small table for I/O purposes. Columnar tables make sense when users query only certain data.
In a columnar table when a row is deleted, it is not physically deleted but just marked deleted.
Columnar can move just one container to memory. Containers match up perfectly to rebuild a row.
Containers can have more than one columns.

We continue our discussion on Teradata by examining the AMPs. Rows are stored in Data blocks which are stored in cylinders. AMPs store the data block header, the data block and the row reference array for each table. AMP may have many cylinders and each cylinder may have many data blocks. An AMPs master index is used to find the right cylinder. The master index is maintained in memory.
An AMP's master index may point to thousands of cylinder indexes. Each cylinder index is on that cylinder and is usually not in FSG cache.
When the AMPs data blocks are split, the cylinders grow. The blocks can be maximum size of 255 or 2047 sectors (1MB) and the maximum row size can be 64,255 bytes.
Seeing how the data is arranged in an AMPs disk, let us understand how the Teradata stores data in blocks and moves it inside FSG Cache.
Cylinders are arranged in tracks on the disk. The hottest cylinders are arranged on the external perimeter so that they can be read fast in one disk spin. The coolest cylinders are stored at the center where they may be rarely accessed.
When data is loaded into a table, it is automatically deemed hot since the table is most likely staging. There can be an option to specify the intial temperature with a session setting called Query_Band
Cylinders are also used for Perm, Spool, Temp data and Journals.
There is a free cylinder list where new data is inserted if the block cannot be accommodated in an existing cylinder.
Note that the full table scan now has these detailed steps from the organization we have presented:
The user runs a query.
The PE checks, the syntax and access rights and generates a plan
The plan is full table scan so its broadcasted to all the BYNETs
Each AMP already has their Master Index inside their memory
Each AMP brings in the table header into FSG Cache.
Each AMP then reads the Master index to find the cylinder that holds the data blocks
Each AMP reads the cylinder index by bringing it into FSG Cache
The Table Header and the data blocks are both in FSG Cache
The AMP places the information inside spool and reports to PE
The PE then requests the spools and delivers the MasterIndex, cylinderindex, table header and data block to the user.

Wednesday, January 15, 2014

We mentioned hashing of rows in Teradata. We also sort the rows based on this row id. AMPs sort their rows by Row-ID to Group like Data . Since the Row-IDs are sorted, the AMPs use binary search. NULL values all hash to the same AMP. Note that this means data is skewed. So handling NULLs can be avoided by having Unique Primary Index or a multi-column primary index. A Unique Primary Index will spread the data perfectly evenly. A multi-column can overcome the data skewing by NULL values.
A full table scan is performed by the participation of all AMPs. All AMPs read all of their rows because there is no Primary Index.
When data is loaded, each AMP holds a portion of the rows for every table. When queries are run, each AMP simultaneously retrieves the rows. The table header and data rows are stored on each AMP. The table header is created on each AMP when the table is created.
Each AMP stores the rows they own inside a data block. Data is read into each AMPs cache called the First System Generating Cache (FSG Cache). When the header and the data is in memory, the AMP can read and process.
When a proper index is chosen, a single AMP needs to move only one block of data into the FSG cache. When the rows are inserted into the block and the block grows, the block is split. If a block moves into the FSG Cache
Synchronized Scan (Sync Scan) is another feature. When there are multiple users reading different blocks, the system shares reads between multiple scans at the block level. New query joins scans at the current scan point.
The system wraps around after reading all the blocks so the late comer gets the blocks they missed. Scan could have also started before the query ran.
Teradata has a measure for data temperature i.e how hot or frequently used it is. The hottest data is kept in memory automatically. There is no DBA intervention or application changes required. While FSG Cache is used to move data from disk temporarily, this intelligent memory doesn't purge the data in memory but keeps it there. The data can stay there for days.
Both are complimentary to each other. Most frequently queried tables, dictionaries and anything that requires sub-second data are all valid candidates. Data deemed very hot stays in each AMPs intelligent memory.
The physical database is also designed to help with queries.
First, the primary index is chosen such that it is most often used by users to find rows.
Second, the primary index will reduce the data transfers from disk.
Third, the secondary indexes will limit the number of AMPs used and data blocks transferred. Upto 32 secondary indexes can be used.
Fourth, the tables are partitioned to sort the rows horizontally.
Fifth, the tables can be partitioned vertically so that the transfer of columnar blocks can be efficient.
Sixth, the modeled tables can be denormalized.