Cluster computing

Wednesday, January 22, 2014

In continuation of our discussion on improving performance in SQL Server, we now look at I/O affinity option. This is useful on computers with multiprocessor such as 16 cpus or more. This option only supports disk I/Os and does not support any hardware affinity for individual disks or controllers.
The option is set with sp_configure switch with affinity_mask configuration option to specify which CPUs the threads of the SQL Server should run on. The affinity mask can exclude processors that are to be reserved for operating system processes.
When running an instance of SQL Server on a large enterprise-level multiprocessor computers with more than 16 cpus, the IO_affinity_mask can be used in conjunction with affinity_mask. This option is used to specify which processors are to be used for SQL Server disk operations and which CPUs service the remaining.
Care must be taken to specify just the right number of affinitized cpus for disk IO. This is because we don't want to provide more than what the system needs or we might degrade performance on the other CPUs.
The IO_affinity_mask cannot be changed when the server is running.
Occassionally, we may encounter slow running queries or even suspended queries. This has nothing to do with performance and is more symptomatic of a resource issue such as a deadlock.
To identify a deadlock, you must first obtain log information. SQL server supports different startup parameters, dbcc traceon, and dynamic management views that help with more visibility on locks.
For example, we can add -T1204 and -T3605 trace flags. The former collects information about the process and the resources when the deadlock detection algorithm encounters a deadlock and the latter collects information everytime the deadlock detection algorithm is run. If the server startup is not an option, traceon facilities can be added.
A SQLProfiler trace can also be collected. These can show full statements in addition to execution plans of the statements. A SQLProfiler trace may also have a lock event for deadlock and deadlock chain. Turning on the deadlock trace flags during the occurrence of a deadlock and running a SQL profiler trace should provide the data to troubleshoot the deadlock. However, running the SQLProfiler could change the timing of execution enough to prevent a deadlock. Therefore, the information must be captured with trace flags first and then the profiler should be run.

Tuesday, January 21, 2014

This continues on our discussion to improve SQL Server performance with normalizing logical database design, using efficient index design and using efficient query.
The next step is to use efficient application design. The role that application design plays is critical because clients control the queries, the type of queries and when they are submitted and how the results are processed. This affects the type and duration of the locks, the amount of I/O and CPU load on the server and whether the performance is good or bad.
For this reason, the sooner we correct the application design, the better the performance. At the same time, even for turnkey applications, improvements to performance cannot be done in all cases without changing the application. Queries from clients are often referred to as workloads.
Workloads can use small results sets. This will reduce CPU and I/O load.
Applications should allow cancellation of a query in progress. No application should force reboot.
Applications should process all results to completion.
Always implement a query timeout. Do not allow queries to run indefinitely
Do not use generated SQL unless you have control over the statements or transparency. Query cancellation, query timeout and complete transactional control are still required from such.
Decision support and OLTP queries should be separated as much as possible.
Do not allow orphaned connections.
Use trace feature wherever possible. Isolate the slow query wherever possible.
After the slow query is isolated, do the following:
run it in isolation with minimal buffering and I/O redirection.
set statistics IO on to examine IO consumed by the query.
If the query is nested or wrapped in a stored procedure, extract it.
If there are triggers on the table that can generate I/O when run, silence them
Examine the indexes of the table used by the slow query.
Using the queries previously mentioned, examine the data uniqueness and distribution for each column.
Based on this study, make changes to application, query and index as appropriate. Run the query again to see the improvement.

The following are some of the tips to improve SQL Server performance based on what has worked before:
1)Normalize logical database design:
Reasonable normalization of the logical database design yields best performance. In a normalized database, there's usually a greater number of narrow tables while there's fewer and wider tables in a de-normalized database. When there are many joins to make in a normalized database, it can hurt performance. However, the optimizer is efficient at selecting joins when indexes are available.
Note that normalization brings in fewer and narrower indexes, fewer NULLs and less redundant data, allows more clustered indexes and facilitates sorting and index creation. It also requires less locks because the data locked is lesser. Typically four-way or greater joins is indicative of when there has been too much normalization.
2) Use Effective Index design
Indexes are usually left out of logical database design and is usually associated with physical database design. Indexes can be dropped, added and changed without affecting the database schema. The optimizer chooses the best index. By properly designing the index and the optimizer selecting the best index, we can improve performance. We do this with:
a) Examine the where clause of the SQL queries, because this is the primary focus of the optimizer
Pick out the slow queries and for each column in the where clause, consider a possible candidate for index.
b) use narrow indexes which are more effective than multi-column compound indexes with fewer index levels. Narrower indexes provide better performance over a wide range of queries.
c) use clustered indexes
Update and Delete operations are often accelerated by clustered indexes because they involve a lot of reading. Queries that return a range of values are often improved with a clustered index where as a non-clustered index helps with fewer queries.
d) examine column uniqueness - finding the number of unique values in a column. If they are a handful, there should not be an index. If it is far less than the number of rows, consider a clustered index. If there are more, there can be a non-clustered index.
e) examine data distribution in indexed columns:
A long running query occurs because a column with few unique values is indexed, or a join on such a column is performed. When the data is not distributed properly, it increases the page I/O, typically one page I/O per non-clustered index at which point, it becomes efficient to scan the entire table. Distribution can be found with say group by and having count(*) > 1
The availability of an index helps an optimizer in its job to make a decision whether to use indexed retrieval or not for improved performance.
3) Use efficient query design
Some types of queries are inherently resource intensive such as
Large result sets
IN, NOT IN, and OR queries
highly non-unique where clauses.
!= comparision operators
certain column functions, such as SUM
expressions or data conversions in where clause
local variables in where clause
complex views with Group by
To mitigate the resource usage, consider restricting the result set. For example, we can add a predicate saying where zip = '98052'

Monday, January 20, 2014

There are three types of joins : Nested Loop Joins, Hash Joins and Sort-Merge joins.
In a nested loop join, one of the tables is scanned and for every row, a look up is performed on the second table and the matching rows are retrieved. The tables can be labeled as outer and inner tables. The inner table lookup can be fast based on the presence of indexes. Let us take two tables employees and contractors, employees has 100,000 rows and contractors are 40,000 rows.
Then a table scan on the contractors together with a clustered index seek on employees is an example of a nested join.
The scan count of the employees table is 100,000 and that of the contractors is 1 for a join on id.
In a hash join, a hash table is created in memory for one of the inputs called the build input and searched for matching values from the other input, the probe input. If the build input does not fit in memory, it is partitioned recursively until it fits in memory. The probe input is hashed and used for matches. Hash joins are better when there is a significant difference between the size of the tables.
In a hash join of the example tables, the contractors table is used as the build input and the employees table is used with a clustered index scan.
The scan count of the employees table is 1 and that of the contractors is also 1.
In a sort merge join, the tuples of one input is examined, then the second sorted input is then scanned until a row with a match is found and then this is repeated for every row of the first input. Here the larger input dominates the cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.
This type of join requires both inputs to be in memory as opposed to hash join, where only a smaller input is in memory. The merge join is more efficient if the inputs are already sorted. Note that inequality joins are not handled by this method.
For the given example, the sorting is done with the ids of the contractor table and it is scanned together with clustered index scan of the employees table.
The scan count for the employees table is 1 and that for the contractor table is also 1.

In Teradata, the SQL Syntax is very similar to anywhere else. For example, the TOP command can be used with OrderBy. The TOP WITH TIES can be used to see the ties together. This is the same as Rank or denserank operation. However, in Teradata, top cannot be used with distinct and qualify, with and with by and sample. By the way, GroupBy is preferred over distinct for the queries which return same results. This is because Distinct skews data where as groupby is faster.
Also note that when using where clauses, it is preferable to specify IS NULL or IS NOT NULL explicitly. This is particularly true when testing for the exclusion of some column values in the result set. Paranthesis in Where clause can be used to change the order of precedence. The IN and NOT IN operators can also help with inclusion and exclusion. Note that the NOT=ALL works just like NOT IN operator.
The IN list is an excellent way to look up multiple column values. BETWEEN is inclusive and BETWEEN works for character data. LIKE can be used for string comparision. It can take % and _ as wildcard characters. Like, Like all and Like any can be used in conjunction with conditions.
Data Manipulation commands are also similar. For example we can give INSERT INTO My_Table (Column1, Column2) VALUES (124.56, 'My character data')
Bulk Insert is done with say INSERT INTO My_Table SELECT(Column1, SUM(Column2)) FROM Original_Table GROUP BY Column1
Update is done with say, UPDATE My_Table set column2 = column2 + 256 WHERE column1 = ' my data'
Fast path updates can be done with INSERT INTO My_Table_Copy SELECT Column1, Column6*1.05 FROM My_Table.
MERGE INTO Employee_Table USING VALUES(200000, NULL, 'Jones') AS E(Emp, Dept, Las)
ON Employee_No = Emp
WHEN MATCHED THEN
UPDATE set salary = sal
WHEN NOT MATCHED THEN
INSERT VALUES(Emp, dep, las);

Sunday, January 19, 2014

interview questions
If a table has students and another table has classes offered at a school, how do we find students who are not enrolled in any class.
We maintain another table StudentClass with foreign keys and non-null values StudentID and ClassID.
SELECT s.name FROM StudentClass sc
RIGHT OUTER JOIN Student s
ON s.ID = sc.StudentID
WHERE sc.ClassID = NULL
If we wanted a count of all students enrolled in more than one class
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
HAVING NumClasses > 0
if we wanted to exclude a particular Class ID say 1
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
WHERE sc.ClassID IS NOT NULL AND sc.ClassID <> 1
HAVING NumClasses > 0

Saturday, January 18, 2014

In this post, we talk about temporal tables create function in Teradata
There are three types of temporal tables.
- valid time temporal tables
- transaction time temporal tables
- bi-temporal tables that are a combination of both.
These tables are created with the PERIOD data type that introduces a begin and end timestamp.
The only difference is that the ValidTime can be a date or a timestamp. The Transaction Time has to be a timestamp.
The timestamp changes on update both before and after updates as in the case of Bi-Temporal tables. The non-sequenced ValidTime gives the data as the way things were while the current valid time gives the way the data is.
We will now look at how joins work internally in Teradata.
Teradata requires that for two rows to be joined together, both rows are physically on the same AMP.
One or both tables are redistributed or the smaller table is duplicated across all AMPs
Based on the column for the join, the matching rows from different AMPs are brought together on the same AMP. A volatile table is used for this purposes. On all joins, the matching rows must be on the same AMP and hashing is used to bring the rows together.
If two tables were to be joined thousand times a day and given that there will be the same primary index on the PK/FK join condition, there will be no data movement because the rows will already be hashed to the same AMP. The EXPLAIN operator shows this as Row Hash Match join. In fact, it is preferable to have a where clause with the primary index where this is not the case. The volatile tables are usually deleted on logoff.
There are three types of temporary tables:
1)Derived Table
This exists only within a query and is materialized by the SELECT statement inside a query.
The space is used on the users spool space and it is deleted at the end of the query.
Derived tables are usually specified within a parenthesis in the SQL or with a WITH clause. Columns can be aliased and can default to normal columns. Most derived tables are used join to other tables.
2)The Volatile tables are created by the user and materialized with an INSERT/SELECT
The space is used on the users spool space and the table and data are deleted when the user logs off.
The ON COMMIT DELETE ROWS can be explicitly specified so that the volatile table is deleted at the end of the query.
The HELP command shows the volatile tables.
The volatile tables can have primary index such that when used for joins between tables, there is no movement of data.
Statistics can also be collected on volatile table so as to improve performance. For example, we can have this for all non-unique primary indexes (NuPI), unique primary indexes of small tables i.e with less than thousand rows per AMP, columns that appear frequently in WHERE search conditions and non-indexed columns used in joins, and partitioning column of a PPI table.
3)Then there are global temporary tables
Here the table definition is specified by the user and the table definition is permanent. It is materialized with an INSERT/SELECT.
The space is used on the user's TEMP space. When the user logs off the session, the data is deleted, but the table definition stays. This can be shared by many users who can then populate the same table but with their own copy.