Cluster computing

Friday, January 17, 2014

We return to the discussion on Teradata.
Teradata has an admin user called DBC. This comes with each system.
User DBC has always been at the top of the Teradata hierarchy. An admin user is explicitly provisioned to facilitate addition of other users. Some space is reserved for user DBA. Space on the disk for each AMP is divided into PERM space and SPOOL space. PERM space is reserved for tables. Secondary indexes, join indexes and permanent journals are also stored in PERM. SPOOL is used for user queries and answer sets.
Disks are not shared between AMPs.
Every user is assigned spool space so they can submit SQL and retrieve answer sets. If the user exceeds this space, the query is aborted.
The spool limits are so that long queries can be terminated and individual users don't hog the system.
Each user's spool limit is actually done per AMP. If the space is exceeded on any AMP, the query is aborted.
The DBC data dictionary shows how much space is reserved. When the space per AMP shows skew, typically this is due to NULL data.
The access is usually granted to users to read views which are in turn granted access to read tables.
A database or a user can be assigned PERM space and spool space.
Objects that take up PERM space include Table rows, fallback tables, secondary index subtables, stored procedures, user defined functions, permanent journals etc.
Access rights can be granted to user. There are three types of access rights:
automatic rights which are default and granted to owners of created tables.
Implicit rights which are ownership rights and are implicit to parents and anybody above in the user hierarchy,
Explicit rights which are privileges a user explicitly grants to another user with the grant to statement.
These rights may be explicitly removed
Profiles and roles are different. Profiles are for people and roles are for rights. A profile can be assigned to a group of users so that they operate under a common set of attributes. for example a setting the can be used to override that of the users.
Account strings can be expanded to include logon time, date, hour, session ID, number etc.
The Teradata Active System Management TASM controls the query traffic. User accounts are given priority and rules setup to control system resources.
Secondary index can be created. If it is unique, it can be hashed.
A USI is a two AMP operation because the first AMP is assigned to read the subtable and the second the base table. Records are looked up in both via binary searches.
A non-unique secondary Index table is AMP_Local.
A value ordered NUSI can also be created.

Today we take a short break from our discussion on Teradata
Using Kullback-Leibler divergence to extract keywords from a single document:
This approach talks about a simple way to extract keywords using term weighting.
Terms are weighted based on the Kullback-Leibler divergence D(avg probability of term t/q) from the background distribution q.
The terms are evaluated one by one and if their divergence is above a cutoff, they are selected.
Initially, we calculate the expected probability of a term by counting its occurrence in the document. We could later refine this to be a lookup against a dictionary of words and their expected probabilities as determined independently from a corpus of text.
We calculate nw as the total number of terms where w appears.
and pw as the total number of terms where w appears divided by the total number of terms in the document
and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] .
By setting an adjustable threshold for the cutoff, we extract variable number of keywords.
pseudo-code in python syntax for finding KLD from a single document:
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]

wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]

fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in lemma])

The KLD is given as D(p||q) = Sigma-t-from1ton[pt.log(pt/qt)]
where pt is for the distribution consisting of only the one keyword and is a constant = 1 for that term
qt is the distribution nt/n for that term
In our case we don't need to aggregate the KLD component pt.log(pt/qt) since we consider only one term. I take this back, the distribution is based on the same set of terms for both p and q. When a term doesn't appear in p or q, then it is given an epsilon probability.
we select the terms where pt.log(pt/qt) = KLD(p|q) > 1

KLD-term = 1 * log (1/(nt/n) for term in lemma)]) if we considered only one term
KLD-term = 1 * log (1/(nt/n)) + Sum(epsilon.log(epsilon/(nt'/n))) for all other terms t'

Alternatively, we could use Bigi's equation as

KLD(d, t) = KLD(d,t) / KLD(d,0)

KLD(d,t) = SUM((P(t,d) – P(t,t)) x log(P(t,d)/P(t,t)))

= (Nt/n – 1) * log(nt/n) if we considered only one term
= (Nt/n – 1) * log(nt/n) + SUM(((nt/n)-epsilon) x log((nt/n)/epsilon)) for all terms

Note that this considers the occurrence and non-occurrence of terms but not their density for which we want to add weights in the probability computation above.

Thursday, January 16, 2014

We now look at Partition Primary Index tables in the Teradata server.
Each table in Teradata has a primary Index unless its a NoPI table. Partitioning tells the AMPs to sort the table rows by partition first, and then sort the rows of a table with the Row-ID. Partitioning is a way to prevent FULL table scans.
Partitions can be specified with Range_N options such as for creating a partition each day/week/month/year.
The query would look like CREATE TABLE Order_Table_PPI
(
Order_Number INTEGER NOT NULL,
Order_Date Date
) PRIMARY INDEX (Order_Number)
Partition by Range_N(OrderDate BETWEEN date '2013-01-01' AND DATE '2013-12-31'
EACH INTERVAL '7' DAY);
this would create about 52 partitions. Or the partitions could be based on each partition size.
Partitions can be of arbitrary number. Teradata supports a very very large number.
Older data and newer data in PPI can be combined.
Partitions can be nested too. The top partition is usually the only one that can be altered. Upto 15 levels of Range_N or Case_N partitions can be nested.
A primary index must be non-unique on all partitioned tables unless the primary index column is also used to define the partition. Most PPI tables are defined as NUPI.
Character based PPI are also common.
Alter Table Order_Table to current with insert into Order_table_History
Columnar table partitions are also used often. Columnar tables must be NoPI tables. Since the partitions are along columns, the columns are placed inside their own container. All containers have the same amount of rows. Each container looks like a small table for I/O purposes. Columnar tables make sense when users query only certain data.
In a columnar table when a row is deleted, it is not physically deleted but just marked deleted.
Columnar can move just one container to memory. Containers match up perfectly to rebuild a row.
Containers can have more than one columns.

We continue our discussion on Teradata by examining the AMPs. Rows are stored in Data blocks which are stored in cylinders. AMPs store the data block header, the data block and the row reference array for each table. AMP may have many cylinders and each cylinder may have many data blocks. An AMPs master index is used to find the right cylinder. The master index is maintained in memory.
An AMP's master index may point to thousands of cylinder indexes. Each cylinder index is on that cylinder and is usually not in FSG cache.
When the AMPs data blocks are split, the cylinders grow. The blocks can be maximum size of 255 or 2047 sectors (1MB) and the maximum row size can be 64,255 bytes.
Seeing how the data is arranged in an AMPs disk, let us understand how the Teradata stores data in blocks and moves it inside FSG Cache.
Cylinders are arranged in tracks on the disk. The hottest cylinders are arranged on the external perimeter so that they can be read fast in one disk spin. The coolest cylinders are stored at the center where they may be rarely accessed.
When data is loaded into a table, it is automatically deemed hot since the table is most likely staging. There can be an option to specify the intial temperature with a session setting called Query_Band
Cylinders are also used for Perm, Spool, Temp data and Journals.
There is a free cylinder list where new data is inserted if the block cannot be accommodated in an existing cylinder.
Note that the full table scan now has these detailed steps from the organization we have presented:
The user runs a query.
The PE checks, the syntax and access rights and generates a plan
The plan is full table scan so its broadcasted to all the BYNETs
Each AMP already has their Master Index inside their memory
Each AMP brings in the table header into FSG Cache.
Each AMP then reads the Master index to find the cylinder that holds the data blocks
Each AMP reads the cylinder index by bringing it into FSG Cache
The Table Header and the data blocks are both in FSG Cache
The AMP places the information inside spool and reports to PE
The PE then requests the spools and delivers the MasterIndex, cylinderindex, table header and data block to the user.

Wednesday, January 15, 2014

We mentioned hashing of rows in Teradata. We also sort the rows based on this row id. AMPs sort their rows by Row-ID to Group like Data . Since the Row-IDs are sorted, the AMPs use binary search. NULL values all hash to the same AMP. Note that this means data is skewed. So handling NULLs can be avoided by having Unique Primary Index or a multi-column primary index. A Unique Primary Index will spread the data perfectly evenly. A multi-column can overcome the data skewing by NULL values.
A full table scan is performed by the participation of all AMPs. All AMPs read all of their rows because there is no Primary Index.
When data is loaded, each AMP holds a portion of the rows for every table. When queries are run, each AMP simultaneously retrieves the rows. The table header and data rows are stored on each AMP. The table header is created on each AMP when the table is created.
Each AMP stores the rows they own inside a data block. Data is read into each AMPs cache called the First System Generating Cache (FSG Cache). When the header and the data is in memory, the AMP can read and process.
When a proper index is chosen, a single AMP needs to move only one block of data into the FSG cache. When the rows are inserted into the block and the block grows, the block is split. If a block moves into the FSG Cache
Synchronized Scan (Sync Scan) is another feature. When there are multiple users reading different blocks, the system shares reads between multiple scans at the block level. New query joins scans at the current scan point.
The system wraps around after reading all the blocks so the late comer gets the blocks they missed. Scan could have also started before the query ran.
Teradata has a measure for data temperature i.e how hot or frequently used it is. The hottest data is kept in memory automatically. There is no DBA intervention or application changes required. While FSG Cache is used to move data from disk temporarily, this intelligent memory doesn't purge the data in memory but keeps it there. The data can stay there for days.
Both are complimentary to each other. Most frequently queried tables, dictionaries and anything that requires sub-second data are all valid candidates. Data deemed very hot stays in each AMPs intelligent memory.
The physical database is also designed to help with queries.
First, the primary index is chosen such that it is most often used by users to find rows.
Second, the primary index will reduce the data transfers from disk.
Third, the secondary indexes will limit the number of AMPs used and data blocks transferred. Upto 32 secondary indexes can be used.
Fourth, the tables are partitioned to sort the rows horizontally.
Fifth, the tables can be partitioned vertically so that the transfer of columnar blocks can be efficient.
Sixth, the modeled tables can be denormalized.

We looked at Teradata's PE tasks. We will now look into what the AMP does.
The AMP maintains the following:
storing and retrieving rows from their assigned disks.
locking the tables and the rows
sort rows and do all aggregation
handle all join processing
handle all space management and space accounting
The AMP executes the plan from the PE. With a where clause there's only 1 AMP retrieve.
By using the keyword EXPLAIN, we can see the parsing engine's plan. A Non-unique Primary Index will have duplicates grouped together on the same AMP. 64 columns can be grouped together in a multicolumn primary index.
A table can have one primary index. If the table is specified with NO PRIMARY INDEX, it will receive no primary index. A full table scan is likely on a table with no primary index.
When we type EXPLAIN with a sql query on a table with no primary index, it can show the full table scan by utilizing all-AMPs retrieve. When the primary index is omitted, the first column is used by default.
The NoPI tables are designed to be staging tables. Data from a mainframe or server can be loaded onto Teradata quickly with perfect distribution. Data is moved from staging tables and not queried so an index is not necessary.
In Teradata, a row's primary index value is hashed and the output (row hash) is stored with the Row_ID. Since the index creates the same hash value, the row hash retrieves the same row again. There is only one Hashing Formula.
The HashMap determines which AMP will own the Row. The RowHash is used to lookup the AMP in the HashMap. Different rows are hashed to different AMPs. The HashMap gives a global perspective of all the rows.
If we were to create an index on a column that allowed duplicate values, Each AMP will place a uniqueness value as a subidentifier after the RowHash to track duplicate values.
The Row-Hash and the uniqueness value make up the Row-ID.
For the unique primary index, the Row-ID is 1. Each AMP sorts their rows by the Row-ID.

Tuesday, January 14, 2014

In this post we talk about Teradata - it's architecture and SQL as read from the book by Tom Coffing and William Coffing. Teradata is about parallel processing. With each query, a single step is performed in parallel by each AMP. A Teradata system consists of a series of AMPs that work in parallel. Teradata systems usually involve several AMPs for greater ROI. They usually start small and grow.
Nexus is the tool used to load the logical model and display the tables and views visually. It has a Super Join Builder that guides joins and builds the SQL on the joins specified. Nexus works with Teradata and Hadoop.
Data Models are inexpensive to be tested and altered. ERWin from CA is a data modeling tool. It allows you to model your tables, views and the physical model. When the data model is shared with the business users, there is more insight and its use is facilitated with the tool building the SQL.
Teradata's advantages include using Massively Parallel Processing (MPP) to analyze very large data.
In a table of trillion rows, Teradata can find a single record in under a second with just one AMP. Teradata has sophisticated query traffic control with rules, times, delays, green lights to query and red lights to wait. Longer duration and volume queries are not placed in the same category as critical queries.
Teradata gives users their own views of the query processing with Viewpoint.
The Teradata environment enables a variety of queries from large massive joins, sub second tactical queries, partitioned tables for range queries, columnar designs for specific columns combined with secondary indexes, joins and hash indexes etc.
The rows of data are spread across the AMPs so that each user query can be executed on similar number of rows by each AMP. In parallel processing we distribute the work for each AMP. Communication is done over BYNET. The Parsing Engine (PE) takes the Users' SQL and builds a plan for each AMP.
All Teradata tables spread their rows equally across the AMPs. Teradata systems can add AMPs for linear scalability. Teradata can scale to arbitrary size.
AMPs and PEs are called virtual processors. Each SMP node has both processors and typically many of both. SMP stands for symmetric multiprocessing. Each node is attached via a network to a disk farm. Each AMP in the node is assigned its own virtual disk.
When nodes are connected to the BYNETs, they become part of one large MPP system. Each node addition brings in additional PEs, AMPs, and disks.
Nodes, disks and BYNETs are usually kept inside what is called a Cabinet. Gateway and Channel drive software runs as processes One controls the communication with the LAN and the other connects the mainframe. The Access module processors (AMPs) and PEs live in memory and are controlled by a Parallel Database Extension (PDE) usually running on a Linux operating system. While BYNETs enable communication between nodes, there's also a boardless BYNET within the nodes. PEs and AMPs can communicate over this boardless BYNET. Each PE can manage upto 120 individual sessions. When a user logs into a Teradata, a PE will log them and manage the session. The PE is also responsible to check the SQL syntax, creates the EXPLAIN, check the security and builds a plan. The PE uses the COLLECTED STATISTICS to build the best plan. The PE delivers the final answer set to the user.