Cluster computing

Thursday, January 16, 2014

We continue our discussion on Teradata by examining the AMPs. Rows are stored in Data blocks which are stored in cylinders. AMPs store the data block header, the data block and the row reference array for each table. AMP may have many cylinders and each cylinder may have many data blocks. An AMPs master index is used to find the right cylinder. The master index is maintained in memory.
An AMP's master index may point to thousands of cylinder indexes. Each cylinder index is on that cylinder and is usually not in FSG cache.
When the AMPs data blocks are split, the cylinders grow. The blocks can be maximum size of 255 or 2047 sectors (1MB) and the maximum row size can be 64,255 bytes.
Seeing how the data is arranged in an AMPs disk, let us understand how the Teradata stores data in blocks and moves it inside FSG Cache.
Cylinders are arranged in tracks on the disk. The hottest cylinders are arranged on the external perimeter so that they can be read fast in one disk spin. The coolest cylinders are stored at the center where they may be rarely accessed.
When data is loaded into a table, it is automatically deemed hot since the table is most likely staging. There can be an option to specify the intial temperature with a session setting called Query_Band
Cylinders are also used for Perm, Spool, Temp data and Journals.
There is a free cylinder list where new data is inserted if the block cannot be accommodated in an existing cylinder.
Note that the full table scan now has these detailed steps from the organization we have presented:
The user runs a query.
The PE checks, the syntax and access rights and generates a plan
The plan is full table scan so its broadcasted to all the BYNETs
Each AMP already has their Master Index inside their memory
Each AMP brings in the table header into FSG Cache.
Each AMP then reads the Master index to find the cylinder that holds the data blocks
Each AMP reads the cylinder index by bringing it into FSG Cache
The Table Header and the data blocks are both in FSG Cache
The AMP places the information inside spool and reports to PE
The PE then requests the spools and delivers the MasterIndex, cylinderindex, table header and data block to the user.

Wednesday, January 15, 2014

We mentioned hashing of rows in Teradata. We also sort the rows based on this row id. AMPs sort their rows by Row-ID to Group like Data . Since the Row-IDs are sorted, the AMPs use binary search. NULL values all hash to the same AMP. Note that this means data is skewed. So handling NULLs can be avoided by having Unique Primary Index or a multi-column primary index. A Unique Primary Index will spread the data perfectly evenly. A multi-column can overcome the data skewing by NULL values.
A full table scan is performed by the participation of all AMPs. All AMPs read all of their rows because there is no Primary Index.
When data is loaded, each AMP holds a portion of the rows for every table. When queries are run, each AMP simultaneously retrieves the rows. The table header and data rows are stored on each AMP. The table header is created on each AMP when the table is created.
Each AMP stores the rows they own inside a data block. Data is read into each AMPs cache called the First System Generating Cache (FSG Cache). When the header and the data is in memory, the AMP can read and process.
When a proper index is chosen, a single AMP needs to move only one block of data into the FSG cache. When the rows are inserted into the block and the block grows, the block is split. If a block moves into the FSG Cache
Synchronized Scan (Sync Scan) is another feature. When there are multiple users reading different blocks, the system shares reads between multiple scans at the block level. New query joins scans at the current scan point.
The system wraps around after reading all the blocks so the late comer gets the blocks they missed. Scan could have also started before the query ran.
Teradata has a measure for data temperature i.e how hot or frequently used it is. The hottest data is kept in memory automatically. There is no DBA intervention or application changes required. While FSG Cache is used to move data from disk temporarily, this intelligent memory doesn't purge the data in memory but keeps it there. The data can stay there for days.
Both are complimentary to each other. Most frequently queried tables, dictionaries and anything that requires sub-second data are all valid candidates. Data deemed very hot stays in each AMPs intelligent memory.
The physical database is also designed to help with queries.
First, the primary index is chosen such that it is most often used by users to find rows.
Second, the primary index will reduce the data transfers from disk.
Third, the secondary indexes will limit the number of AMPs used and data blocks transferred. Upto 32 secondary indexes can be used.
Fourth, the tables are partitioned to sort the rows horizontally.
Fifth, the tables can be partitioned vertically so that the transfer of columnar blocks can be efficient.
Sixth, the modeled tables can be denormalized.

We looked at Teradata's PE tasks. We will now look into what the AMP does.
The AMP maintains the following:
storing and retrieving rows from their assigned disks.
locking the tables and the rows
sort rows and do all aggregation
handle all join processing
handle all space management and space accounting
The AMP executes the plan from the PE. With a where clause there's only 1 AMP retrieve.
By using the keyword EXPLAIN, we can see the parsing engine's plan. A Non-unique Primary Index will have duplicates grouped together on the same AMP. 64 columns can be grouped together in a multicolumn primary index.
A table can have one primary index. If the table is specified with NO PRIMARY INDEX, it will receive no primary index. A full table scan is likely on a table with no primary index.
When we type EXPLAIN with a sql query on a table with no primary index, it can show the full table scan by utilizing all-AMPs retrieve. When the primary index is omitted, the first column is used by default.
The NoPI tables are designed to be staging tables. Data from a mainframe or server can be loaded onto Teradata quickly with perfect distribution. Data is moved from staging tables and not queried so an index is not necessary.
In Teradata, a row's primary index value is hashed and the output (row hash) is stored with the Row_ID. Since the index creates the same hash value, the row hash retrieves the same row again. There is only one Hashing Formula.
The HashMap determines which AMP will own the Row. The RowHash is used to lookup the AMP in the HashMap. Different rows are hashed to different AMPs. The HashMap gives a global perspective of all the rows.
If we were to create an index on a column that allowed duplicate values, Each AMP will place a uniqueness value as a subidentifier after the RowHash to track duplicate values.
The Row-Hash and the uniqueness value make up the Row-ID.
For the unique primary index, the Row-ID is 1. Each AMP sorts their rows by the Row-ID.

Tuesday, January 14, 2014

In this post we talk about Teradata - it's architecture and SQL as read from the book by Tom Coffing and William Coffing. Teradata is about parallel processing. With each query, a single step is performed in parallel by each AMP. A Teradata system consists of a series of AMPs that work in parallel. Teradata systems usually involve several AMPs for greater ROI. They usually start small and grow.
Nexus is the tool used to load the logical model and display the tables and views visually. It has a Super Join Builder that guides joins and builds the SQL on the joins specified. Nexus works with Teradata and Hadoop.
Data Models are inexpensive to be tested and altered. ERWin from CA is a data modeling tool. It allows you to model your tables, views and the physical model. When the data model is shared with the business users, there is more insight and its use is facilitated with the tool building the SQL.
Teradata's advantages include using Massively Parallel Processing (MPP) to analyze very large data.
In a table of trillion rows, Teradata can find a single record in under a second with just one AMP. Teradata has sophisticated query traffic control with rules, times, delays, green lights to query and red lights to wait. Longer duration and volume queries are not placed in the same category as critical queries.
Teradata gives users their own views of the query processing with Viewpoint.
The Teradata environment enables a variety of queries from large massive joins, sub second tactical queries, partitioned tables for range queries, columnar designs for specific columns combined with secondary indexes, joins and hash indexes etc.
The rows of data are spread across the AMPs so that each user query can be executed on similar number of rows by each AMP. In parallel processing we distribute the work for each AMP. Communication is done over BYNET. The Parsing Engine (PE) takes the Users' SQL and builds a plan for each AMP.
All Teradata tables spread their rows equally across the AMPs. Teradata systems can add AMPs for linear scalability. Teradata can scale to arbitrary size.
AMPs and PEs are called virtual processors. Each SMP node has both processors and typically many of both. SMP stands for symmetric multiprocessing. Each node is attached via a network to a disk farm. Each AMP in the node is assigned its own virtual disk.
When nodes are connected to the BYNETs, they become part of one large MPP system. Each node addition brings in additional PEs, AMPs, and disks.
Nodes, disks and BYNETs are usually kept inside what is called a Cabinet. Gateway and Channel drive software runs as processes One controls the communication with the LAN and the other connects the mainframe. The Access module processors (AMPs) and PEs live in memory and are controlled by a Parallel Database Extension (PDE) usually running on a Linux operating system. While BYNETs enable communication between nodes, there's also a boardless BYNET within the nodes. PEs and AMPs can communicate over this boardless BYNET. Each PE can manage upto 120 individual sessions. When a user logs into a Teradata, a PE will log them and manage the session. The PE is also responsible to check the SQL syntax, creates the EXPLAIN, check the security and builds a plan. The PE uses the COLLECTED STATISTICS to build the best plan. The PE delivers the final answer set to the user.

We discuss PHP today from the LAMP stack. PHP is embedded in HTML. It gets translated to HTML from the server before it reaches the client. PHP is wrapped with <?php , <? or <script language="php">
PHP works with virtually all web server software and most databases. PHP configuration is handled with the php.ini file. Single line comments start with #. Multi line printing starts with a label such as print <<< END or with quotes. PHP is whitespace and case sensitive. The data types include integers, doubles, boolean, NULL, strings etc. Variable scope includes local variables, function parameters, global and static variables. Like C, we have __LINE__, __FILE__, and __FUNCTION__ Loops are similar as well.
Arrays can be numeric, associative and multidimensional. Dot operator is used to concatenate strings. fopen, fclose, fread and filesize are some file methods. functions are declared with the function keyword such as
function writeMessage()
{
echo "Hello";
}
Reference passing can be done with the & operator both in declaration and invocation. Default values for function parameter can be set.


<?php

function cURLcheckBasicFunctions() 

{ 

  if( !function_exists("curl_init") && 

      !function_exists("curl_setopt") && 

      !function_exists("curl_exec") && 

      !function_exists("curl_close") ) return false; 

  else return true; 

} 



// declare

if( !cURLcheckBasicFunctions() ) print_r('UNAVAILABLE: cURL Basic Functions');

$apikey = 'your_api_key_here';

$clientId = 'your_client_id_here';

$clientSecret = 'your_client_secret_here';

$url = 'https://my.api.com?api_key='.$apikey;

$ch = curl_init($url);

$fields = array(

'user' => urlencode('user'));



//url-ify the data for the POST

$fields_string = '';

foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }

rtrim($fields_string, '&');        

curl_setopt($ch, CURLOPT_URL, $url);

// curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ) ; 

// curl_setopt($ch, CURLOPT_USERPWD, $credentials);

// curl_setopt($ch, CURLOPT_SSLVERSION, 3); 

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);                        

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);

curl_setopt($ch, CURLOPT_POST, count($fields));

curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);



//execute post

if(curl_exec($ch) === false)

{

    echo 'Curl error: ' . curl_error($ch);

}

else

{

    echo 'Operation completed without any errors';

}



//close connection

curl_close($ch);

?>

This is a sample REST call using curl and PHP

Monday, January 13, 2014

We continue our discussion on EJB in this post. We would like to talk about persistence.
Persistence can be via Entity beans. Persistence can also be via Hibernate and ORM tools. We will discuss both. Bean persistence can be managed via two ways: Container managed persistence (CMP) and bean managed persistence (BMP). In the CMP, the container manages the persistence of the bean. In the latter, the developer implements the lower level persistence mechanism. Typically the latter is used when the CMP limits have been exceeded.
In CMP, entity bean data is managed automatically by the container with a mechanism of its choosing. A container implemented on top of a RDBMS may manage persistence by storing each bean's data as a row in a table. Serialization may also be builtin.
A persistence manager is used to separate the management of bean relationships from the container. The container manages security, transactions etc while the persistence manager manages different databases via different containers. Using this architecture allows entity beans to become more portable across EJB vendors.
Entity beans are mapped to a database using a bean-persistence manager contract called the abstract persistence schema. The persistence manager implements and executes find method based on EJB QL.
The persistence manager generates a mapping of the CMP objects to a persistent data store object. Persistence data stores can vary from relational databases, flat files, and Enterprise Resource Planning (ERP) The mapping is based on the information provided in the deployment descriptor and the bean's abstract persistence schema.
The CMP entity bean and the persistence manager use a contract to define bean to bean, bean to dependent and even dependent to dependent object relationships within an entity bean. When EJB is deployed, the persistence manager is used to generate an instantiated implementation of the EJB class and its dependents using the XML deployment descriptor and the bean class. The instantiated implementation will include the data access code that will read and write the state of the EJB to the database at runtime. The persistence manager also manages the state. The persistence manager generates subclasses for use with the container instead of the abstract classes defined by the bean provider.
CMP is great for database independence and container specific features. The cons are container supported algorithms, portability to other EJB containers, the developer has no access to the view and the generated SQL is not the most efficient.
BMP pros are that it is container independent, standards based (using EJB and JDBC APIs) , has ability to support nonstandard data types, has flexibility for validation logic, and can take advantage of non-standard SQL features. The cons include the following: it is database specific, requires knowledge of SQL, and takes longer to develop.

In this post, we continue our discussion on Java EJB.
The bean class methods must be declared as public and must not be final or static. The throws clause may define arbitrary application exceptions. and the bean class must not define the finalize() method.
The beans can be specified as stateless, stateful and message-driven. Injecting resource dependency into a variable or setter method is done by the EJB container. PersistenceContext can also be injected using the data type EntityManager and the corresponding imports. Then the entity can be found using the EntityManager object
Person p = em.find(Person.class, 10);
New entities can be added as em.persist(p);
Specifying interceptors are optional. The Interceptors annotation is used to annotate the interceptor class with the bean class. The method in the interceptor class annotated with the AroundInvoke annotation then becomes the interceptor method for the bean class.
The PostConstruct and PreDestroy annotations are for notifications immediately after the dependency injection and before first business method and for releasing resources respectively.
EJB Access can be secured with the DeclareRoles, RolesAllowed, DenyAll, PermitAll and RunAs annotations.
Once the Java source code is compiled, deployment descriptors can be created or edited.
Exception handling can be done by try catch blocks and the methods can mention the exceptions
There are several things to consider for a good java EJB program. Sime of these are:
Modular code
Design patterns
Layering
Performance
Security