Friday, January 31, 2014

In this post, we look at a few of the examples from the previous post:
    source = job_listings | where salary > industry_average
uses the predicate to filter the results.
    dedup source  sortby -delay
shows the first of few unique sources sorted by the delay field in descending order
    head (action="startup")
returns the first few events until the one matching startup is found
   transaction clientip maxpause=5s
group events that share the same client IP address and have no gaps or pauses longer than five seconds
Results often include duration and event count
The stats calculate statistical values on events grouped by the value of the fields.  
    stats dc(host) returns distinct count of host values
    stats count(eval(method="GET")) as GET by host returns the number of GET requests for each webserver. percentage and range are other functions that can be used with it.
timechart only not applicable to chart or stats
   chart max(delay) over host
returns max(delay) for each value of host
   timechart span=1m avg(CPU) by host
charts the average value of CPU usage each minute for each host.
Filtering, modifying and adding fields can be done with commands such as eval, rex, and lookup.
The eval command calculates the value of a new field based on an existing field.
The rex command is used to create new fields by using regular expressions
The lookup commands add fields based on a lookup table for value lookups
fields can be specified as a set col1-colN format or with wild card characters

Thursday, January 30, 2014

We will now describe the Search Processing Language. We mentioned earlier that Splunk shifts focus from organizing data to useful queries. The end result may only be a few records from a mountain of  original data set. Its the ease of use of a query language to be able to retrieve that result.
Search commands can be separated by pipe operator. This is the well known operator to redirect output of one command as input to another. For example, we could specify the column attributes of the top few rows of an input data set as search | top | fields commands with their qualifiers.
If you are not sure about what to filter on, then we can list all events, group them and even cluster them. There are some mining tips available as well. This method of exploration has been termed 'spelunking' and hence the term for the product.
Some tips for using the search commands include using quotation marks, using the case-insensitivity to arguments specified, boolean logic of AND as default between search commands unless explicitly specified with OR which has higher precedence, using subsearches where one search command is the argument to another search command and specified with square brackets etc.
The common SPL commands include the following:
Sorting results - ordering the results and optionally limiting the number of results with the sort command
filtering results - selecting only a subset of the original set of events and executed with one or more of the following commands: search, where, dedup, head, tail etc.
grouping results - grouping the events based on some pattern as with the transaction command
Reporting results - displaying the summary of the search results such as with top/rare, stats, chart, timechart etc.
Filtering Modifying and Adding fields - this enhances or transforms the results by removing, modifying or adding new fields such as with the fields, replace, eval, rex and lookup events.
Commands often work with about 10,000 events at a time by default unless explicitly overriden to include all. No, there is no support for C like statements as with dtrace. And its not as UI oriented as Instruments. However a variety of arguments can be passed to each of the search commands and its platform agnostic. Perhaps it should support indexing and searching its own logs. These include operators such as startswith, endswith etc  and key-values operands.
Statistical fuctions is available with the stats command  and supports a variety of builtin functions
chart and timechart commands are used with report builders

In this post we look at how to get data into Splunk. Splunk divides raw discrete data into events. When you do a search, it looks for matching events.Events can be visualized as structured data with attributes. It can also be viewed as a set of keyword/value pairs. Since events are timestamped, Splunk's indexes can efficiently retrieve events in time-series order. However, events need to be textual not binary, image, sound or data files. Coredumps can be converted to stacktrace. User can specify custom transformation before indexing. Data sources can include files, network and scripted inputs. Downloading, installing and starting splunk is easy and then when you reach the welcome screen, there's an add data button to import the data. Indexing is unique and efficient in that it associates the time to the words in the event without touching the raw data. With this map of time based words, the index looks up the corresponding events. A stream of data can be divided into individual events. The timestamp field enables Splunk to retrieve events within a time range.
Splunk has a user interface called the Summary Dashboard. It gives you a quick overview of the data. It has a search bar, a time range picker, and a running total of the indexed data, three panels - one each for sources, source types, and hosts. The sources panel shows which sources (files, network or scripted inputs) the data comes from. The source type is the type of the source. The hosts is the hosts the data comes from. The contents of the search dashboard include the following:
Timeline - this indicates the matching events for the search over time.
Fields Sidebar: these are the relevant fields along with the events
Fields discovery switch : This turns automatic field discovery on or off.
Results area: Events are ordered by timestamp and includes raw text for each event including the fields selected in the fields sidebar along with their values.

Wednesday, January 29, 2014

Today I'm going to talk about Splunk.
 And perhaps I will first delve into one of the features. As you probably know Splunk allows great analytics with Machine Data. And it treats data as key value pairs that can be looked up just as niftily and as fast as with any Big Data. This is the crux of the splunk in that it allows search over machine data to find the relevant information when its otherwise difficult to navigate the data due to its volume.  Notice that it eases the transition from organizing data to better query. The queries can be expressed in select form and language.
 While I will go into these in detail including the technical architecture shortly, I want to cover the regex over the data. Regex is powerful because it allows for matching and extracting data. The patterns can be specified separately. They use the same meta characters for describing the pattern as anywhere else.
 The indexer can selectively filter out events based on this Regex. This is specified via two configuration files Props.conf and Transforms.conf - one for configuring Splunks processing properties and another for configuring data transformations.
Props.conf is used for linebreaking multiline events, setting up character set encoding, processing binary files, recognizing timestamps, setting up rules based source type recognition, anonymizing or obfuscating data, routing select data, creating new index time field extractions, creating new search time field extractions and setting up lookup tables for fields from external sources. Transforms.conf is used for configuring similar attributes. All of these require corresponding settings in props.conf
This feature adds a powerful capability to the user by transforming the events, selectively filtering the events and adding enhanced information. Imagine not only working with original data but working on something that can be transformed to more meaningful representations. Such a feature not only helps with search and results but also helps better visualize the data.
 
In today's post, I want to briefly discuss the differences between SOAP and REST based services.
SOAP follows a convention where the data is exchanged as messages and the operations are usually methods on objects. Data can be exchanged in both text and binary. It needs specific tools to inspect the messages.
REST is a lightweight protocol in that it describes resources and operations as a set of predefined verbs usually HTTP verbs. Instead of a big XML format, it can use JSON format. Data can be intercepted by standard http traffic tools such as Fiddler.

Tuesday, January 28, 2014

In this post, I wanted to start on a book for Splunk. I will have to defer that to the next few posts. Instead I want to add a few more items on AWS specifically Amazon EC2 instance store volumes. These are called ephemeral drives. and provide block level storage.  The blocks are preconfigured and pre-attached to the same physical server that hosts the EC2 instance. The EC2 instance determines the size of the storage. In some instances, such as micro instance there may be no instance storage and the Amazon EBS storage may be used instead.
Instances can have very fast HI1 solid state drive.  In contrast, HS1 instances are optimized for very high storage density and sequential IO.
Local instance store volumes are ideal for temporary storage of information  or that which is continually changing. such as buffers, caches, scratch data, and other temporary content, or for data that is replicated across a fleet of instances such as a load balanced pool of servers.
The instance owns the data and storage unlike EBS where store volumes can be attached or detached. High I/O typically uses instance store volumes backed by SSD and are ideally suited for many high performance database workloads such as Cassandra and MongoDB.
 Data warehouses, Hadoop storage nodes, cluster file systems involve a lot of high sequential IO.
 These are better supported by high storage instances. EC2 instance storage have same performance characteristics as Amazon EBS volumes. RAID striping is supported. The bandwidth to the disks is not limited to the network.
Unlike EBS volume data, data on instance stores are persisted only for the lifetime of the instance. For durability, the main concern here is the persistence between instance reboots. The cost of the Amazon EC2 instance includes any local instance store volumes.
Storage volumes are fixed  and defined by the instance type, so scaling the number of store volume is not an option but the overall instance and data can be instantiated multiple times to scale.
Local instance store volumes are tied to a particular Amazon EC2 instance and are fixed in number and size
I'm going to pick up a book on Splunk next but I will complete the discussion on AWS and Amazon S3 (simple storage service) first. We talked about usage patterns, performance, durability and availability. We will look at Cost Model, Scalability and Elasticity, Interfaces and AntiPatterns next. Then we will review Amazon Glacier.
S3 supports virtually an unlimited number of files in a directory. Unlike a disk drive that restricts the size of the data before partitioning, S3 can store unlimited number of bytes.  Objects are stored in a single bucket, and S3 will scale and distribute redundant copies of the information.
In terms of interfaces, standard REST and SOAP based interfaces are provided. These support both management and data operations. Objects are stored in buckets. Objects have unique key. Each object is web based and rather than a file system but has a file system like hierarchy.
There's an SDK available over these interfaces that are more popular in most programming languages.
Where S3 is not the right choice include the following: S3 is not a standalone filesystem
It cannot be queried to retrieve a specific object unless you know the bucket name and key.
It doesn't support rapidly changing data and its also not a backup or archival storage. While it is ideal for websites, it is used to store the static content with the dynamic content stored on EC2.
Amazon Glacier is an extremely low cost storage service for backup and archival. Customers can reliably store their data for as little as 1 cent per gigabyte per month.  You store data in Amazon glacier as archives. Archives are limited to 4 TB but there is no limit on their number.

Monday, January 27, 2014

In this post, we continue our discussion on AWS particularly the storage options. As we have seen, AWS is a flexible cost-effective easy to use cloud computing platform. The different choices for storage are memory based storage such as file caches, object caches, in-memory databases, and RAM, message queues that provide temporary durable storage for data sent asynchronously between computer systems or application components etc. Other options include storage area networking (SAN) where virtual disk LUNs often provide the highest level of disk performance and durability, Direct attached storage where local hard disk drives or arrays residing in each server provide higher performance than a SAN, Network attached storage which provide a file level interface that can be shared across multiple systems. Finally, we have the traditional databases where structured data resides as well as a NoSQL non-relational database or a data warehouse. The backup and archive include non-disk media such as tapes or optical media.

These options differ in the performance, durability and cost as well as in their interfaces. Architects consider all these factors when making choices. Sometimes, these combinations form a hierarchy of data tiers. These Amazon simple storage service is storage for the internet. It could store any amount of data, at any time, from within the compute cloud. or from anywhere on the web. Writing, reading and deleting objects of any sizes is now possible. It is also highly scalable allowing concurrent read write access. Amazon S3 is commonly used for financial transactions and clickstream analytics or media transcoding

The performance of Amazon S3 from within the Amazon EC2 in the same region is fast. It is also built to scale storage, requests and users. To speed access to the relevant data, Amazon S3 is also often paired with a database such as DynamoDB or Amazon RDS. Amazon S3 stores the actual information while the database serves as a  repository for metadata. This metadata can be easily indexed and queried. In this case it helps to locate an object's reference with the help of a query.

Durability is guaranteed via automatic and synchronous saving of data across multiple devices and multiple facilities. Availability for mission critical data is designed for such a high percentage that there is miniscule or no downtime. For non-critical data, reduced redundancy storage in Amazon S3 can be used.

Sunday, January 26, 2014

I tested my KLD clusterer with real data. There are some interesting observations I wanted to share.
First lets consider a frequency distribution of words from some text as follows:
categories, 4
clustering, 3
data, 2
attribute, 1

Next let's consider their co-occurrences as follows:
categories clustering 1
data categories 1
data clustering 2

Since the terms 'categories' and 'clustering ' occur together and with other terms, they should belong to the same cluster. This cluster is based on the KLD divergence measure and the measure can fall in one of the several K-means clusters chosen for the different ranges of such measure.

Furthermore the term 'attribute' occurs only once and does not co-occur. Therefore it must be different from 'categories' and 'clustering' terms and should fall in a cluster different from theirs.

The cluster for the term 'data' is ambiguous. If the K-means clustering chose say two partitions only, then it would be merged with one of the clusters. If the K-means clustering had many partitions, it could end up in its own cluster.

Items are clustered together based on their KLD measure. Similar KLD measures between pairs of terms should end up in the same cluster.
In this post, we will continue the discussion on AWS. The AWS cloud services platform consists of Database, Storage and CDN, Cross service, Analytics, Compute and networking, Deployment and management services and Application Services.
The AWS global physical infrastructure consists of geographical regions, availability zones and edge locations.
The databases offered by AWS include DynamoDB which is a predictable and scalable NoSQL store, an ElastiCache  which is an in-memory cache. an RDS which is a managed relational database and a RedShift which is a managed PetaByte scale data warehouse. The Storage and CDN consist of S3 which is a scalable storage in the cloud, an EBS which is a network attached block device, a CloudFront which is a global content delivery network, a Glacier which is an archive storage in the cloud, a Storage gateway which integrates on-premises IT with cloud storage and  an Import Export which ships large dataset.
The cross-service includes Support, Marketplace which is to buy and sell software apps,  ManagementConsole a UI to manage AWS services, SDKs, IDEKits and CLIs. Analytics include the Elastic MapReduce which is a popular managed Hadoop framework. Kinesis, which is a real-time data processing and Data pipeline which is an orchestration for data driven workflows.
The compute and networking include EC2 - the most popular way to access the large array of virtual servers directly including remote desktop with admin access, VPC which is a virtual private network based on NAT or ipv6 fencing, ELB which is a load balancing service, Workspace which is a virtual desktop in the cloud, AutoScaling which is automatically scale up and down, DirectConnect dedicated network connection to AWS and Route 53 which is a scalable domain name system.
The deployment and management services include CloudFormation which is a templated AWS resource creation, CloudWatch which is a resource and application monitoring, Elastic Beanstalk which is an AWS application container, IAM which is a secure AWS access control, CloudTrail which is a User activity logging, OpsWorks which is a DevOps Application Manangement service and CloudHSM which is a hardware based key storage for compliance.
In this post, we review white paper on AWS from their website.
 AWS solves IT infrastructure needs. Applications have evolved from a desktop centric installation to client/server models, to loosely coupled web services and then to service oriented Applications. This increases the scope and dimension of the infrastructure. Cloud computing now builds on many of the advances such as virtualization, failovers etc. As Gartner mentions, it is known for scalable and elastic IT enabled capabilities as a service to external customers using internet technologies.
The capabilities include compute power, storage, databases, messaging and other building block services that run business applications.
When coupled with a utility style pricing and business model,  cloud computing delivers an enterprise grade IT infrastructure in a reliable timely and cost-effective manner.
Cloud computing is about outsourcing the infrastructure services and keeping it decentralized. Development teams are able to access compute and storage resources on demand. Using AWS, you can request compute power, storage and services in minutes.
AWS is known to be
Flexible - different programming models, operating systems, databases and architectures can be enabled. Application developers don't have to learn new skills. SOA based solutions with heterogeneous implemantations can be enabled.
Cost-effective - with AWS, organizations pay only for what they use. Costs such as power, cooling, real estate, and staff are now taken away from organizations. There is no up-front investment, long term commitment and there is minimal spend.
Scalable and elastic = organizations could quickly add and subtract AWS resources in order to meet customer demand and manage costs. It can handle a spike in traffic or two or more and not hamper normal business operations.
Secure - It provides end to end security and end to end privacy. Confidentiality, integrity, and availability of your data is of the utmost importance to AWS and it is maintaining trust and confidence. AWS takes the following approaches to secure the cloud infrastructure:
certifications and accreditations - in the realm of public sector certifications
physical security - it has many years of experience designing, constructing and operating large scale data centers.
secure services - unauthorized access or usage is restricted without sacrificing the flexibility of customer demand.
privacy - encrypt personal and business data in the AWS cloud.
Experienced - when using AWS, organizations can leverage Amazon's years of experience in this field.
AWS provides a low friction path to cloud computing. Scaling on demand is an advantage.

Saturday, January 25, 2014

In this post, we complete the discussion on Teradata and its SQL. We briefly cover the stored procedures SQL. Teradata follows the SQL conventions so these are similar to writing them elsewhere. However they include comprehensive stored procedure language. They are stored in either DATABASE or USER PERM and are invoked using the CALL statement. It may return one or more values to client as parameter. IN, OUT, and INOUT parameters can be used with the stored procedure.
Triggers can similarly be authored. It is an event driven maintenance operation. When a table is modified, the trigger is executed. The original data modification, the trigger and all subsequent triggers are all part of the same transaction. Triggers can be specified as 'FOR EACH STATEMENT' as well as for each data modification.
The hashing functions on Teradata are also very popular. The HASHROW function is used to produce the 32bit binary Row Hash that is stored as part of the data row. The HashRow function can be executed on data column values. From the above answer of 1, it is a great sign that the data is perfectly distributed or even unique.
The HashBucket function is used to produce the 16 bit binary Hash bucket that is used with the hash map to determine the AMP that should store and retrieve the data row. It can return a maximum of just over 1,000,000 unique values. The HashAmp function returns the identification number of the primary AMP for any hash bucket number. The HashBakAmp function returns the identification number of the Fallback AMP for any Hash Bucket number. A great way to see distribution of primary and fallback rows.
The Batch Teradata query tool has been around for some time. It is a report writer and imports and exports data from a Teradata server one row at a time. Queries can be run interactively or in a batch script. After you logon, you can directly execute queries in interactive mode. The WITH and WITH BY commands are available for totaling and sub-totaling. Scripts can be batched and saved in a text file. This concludes our reading of the book on Teradata

 
Having looked at occurrence and co-occurrence based keyword selection, let's recap: Both the Zipf's law of word occurrences and the refined studies of Booth demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to hypothesize that terms closer to TP can be used as index terms. TP = (-1 + square-root(8 * I + 1))/2 where I represents the number of words with frequency = 1. Alternatively, TP can be found by the first frequency that is not repeated in a non-increasing frequency sorted vocabulary.
 We will now attempt to explain a simple clustering based method for co-occurrence based keyword selection.
We start with a co-occurrence matrix where we count the occurrences of pairwise terms. The matrix is N * M where N is the total number of terms and M is the set of words selected from the keywords with TP above. M << N. We ignore the diagonals of this matrix. The row wise total for a term is the sum of all co-occurrences with the known set for that term.
If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be in the same cluster. Similarity is measured based on Kullback Leibler divergence. Kullback-Leibler divergence is calculated as KLD = sum for all terms p.log (p/q) We use standard k-means clustering method. Here the initial cluster centroids are chosen far from each other and then the terms are assigned to the nearest cluster, their centroids recomputed and the cycle repeated until cluster centers stabilize. we take p and q to be attribute-wise co-occurrences from the matrix we populated and aggregate them to compute the KLD
 The distribution of co-occurring terms in particular is given by
Pz(t)  = weighted-avg attribute-wise Pz(t)
          = Sum-z((nz/n)*(n(t,z)/nz))
          = 1/n (Sum-z(n(t,z))

 

Friday, January 24, 2014


We continue with Vector Space Model next. In this model, we represent documents of a collection by using vectors with weights according to the terms appearing in each document. The similarity between two documents d1 and d2 with respect to query q is used is measured as the cosine of the angle between the documents and the query. A document is said to be closer to the query if its cosine is smaller. Each query is modeled as a vector using the same attribute space as the documents.
Groups of documents that are similar with respect to the users' information needs will be obtained by clustering, the documents representing a collection of points in a multidimensional space where the dimensions are given by the features selected  There are different algorithms to clustering and almost all algorithms can cluster. Their fitness or quality is measured.
Sometimes, it is better to reduce the number of dimensions without significant loss of information. This transformation to a subspace helps with clustering and reducing the computations  Another technique uses a matrix for principal component analysis. In this method, the document and query methods are projected onto a different subspace spanned by k principal components. The matrix helps define the principal values. as for example a covariance matrix.
Notice that reducing the number of dimensions does not shift the origin but only the number of dimensions, so the vectors are spaced apart the same. The principal component analysis on the other hand shifts the origin to the center of the basis vectors  in the subspace so the documents are better differentiated.

def LSI (vectors):
return vectors.transform(vectors.k_largest_singular_values);

def COV(vectors):
return vectors.transform(vectors.subset_principal_components_with_k_highest_values);
In this post, we will begin with a way to differentiate between LSI and PLSI. Latent Semantic Indexing is an approach where we try to associate terms with the same main hidden semantic component as found by co-occurrences in similar context.  LSI uses conditional probabilities.. In  PLSI, we use similar occurrence and co-occurrence of terms but with non-negative weights whose sum is 1. Comparing and clustering  this co-occurrence distributions gives us the latent components  as represented by the centers of the clusters.
We use similar clustering methods with document classification as well. Formally, we use the vector space model. This model represents documents of a collection by using vectors with weights according to the terms appearing in each document.
Vector Space model can be both boolean and term-weighted models. In the boolean model, the presence or occurrence of each term is given a boolean value. In the term-weighted model, the same term may be assigned different weights by different weighting methods such as tf-idf, frequency ratio. When the weights add up to one, they have been normalized. Typically probabilities or conditional probabilities adding up to one are used.
One of the advantages of using vector space model is that it enables relevance ranking of documents of heterogeneous format with respect to user input queries as long as the attributes are well-defined  characteristics of the document.

Thursday, January 23, 2014

In continuation of our discussion on keywords and documents:
We looked at a few key metrics for selecting keywords based on content, frequency and distribution.
However keywords can also be selected based on co-occurrence of similar terms in the same context.
i.e similar meanings will occur with similar neighbors.

This model proposes that the distribution of a word together with other words and phrases is highly indicative of its meaning This method represents the meaning of a word with a  vector where each feature corresponds to a context and its value to the frequency of the word's occurring in that context.
This vector is referred to as the Term profiles. The profile V(t) of the term t is a set of terms from the list T that co-occur in sentences together with term t, that is P(t) = {t' : t' belongs to s(t)}

Once such a representation is built, machine learning techniques can be used to automatically classify or cluster words according to their meaning. However, these are often noisy representations. This is due to say polysemy and synonymy. Synonymy refers to the existence of similar terms that can be used to express an idea or object in most languages. Polysemy refers to the fact that some words have multiple unrelated meanings. If we ignore synonymy, the clusters we create will have many small disjoint clusters while some of these could have been clustered together. If we ignore polysemy, it can lead to clustering of unrelated documents.

Another challenge in using this technique to massive databases is efficient and accurate computation of the largest few singular values and their vectors in the highly sparse matrix. In fact scalability beyond a few hundred thousand documents is difficult.

Note the technique above inherently tries to find the keywords based on the latent semantics and hence is referred to as Latent semantic indexing. The comparable to clustering in this approach is to find the conditional probability distributions such that the observed word distributions can be decomposed into a few of the latent distributions and a noisy remainder.  The latent components are expected to be the same for co-occurring words.

Next, we will also look at a popular way to represent documents in a collection which is with Vector Space Model.

Material read from survey of text mining, papers by Pinto, papers by Wartena

Today we will recap some keyword selection techniques:

In this article, we look at different feature selection algorithms. Feature selection metrics are based on content, frequency and  distribution.

Content based feature selection can be based on say mutual information or information gain.

MI is about how much information feature contains about class and it is given by log P(f,c)/P(f)P(c)

Information Gain measures the number of bits of information obtained about the presence and absence of a class by knowing the presence or absence of a feature.

It is given by Sum-class  Sum-feature  P(g,d) .log P(g,d)/P(g)P(d)

Gain Ratio improves IG by including features with low entropy that may be strongly correlated with a class. It does this by normalizing IG with the entropy of the class.

Term Strength is based on the idea that most valuable features are shared by related documents.
The weight of a feature is defined as the probability of finding it in some document d given that it is also appeared in the document d' similar to d.
To calculate TS for feature f, we should have a threshold between the similarity measure used to judge two words to be sufficiently related. This is done by first deciding how many documents can be related to a given one and then finding the average minimum similarity measure for this number of neighbors over all documents in collection.

Frequency based keyword selection metrics include tf-idf and Transition point.

tf-idf is the term-frequency and inverse document frequency.  This is quite a simple and effective measure for keywords.

Transition Point is about the idea that medium frequency are closely related to the conceptual content of a document. A higher value of weight can be given to each term as its frequency is closer to a frequency named the transition point. This is determined by inspecting the vocabulary frequencies of each text, identifying the lowest frequency from the highest frequency where the frequency is not repeated. The repeating frequencies are more common for low frequency words.
Distribution wise it is debatable whether a term spread out throughout the document can be included or excluded. On one hand its a good candidate for an index and on the other hand, it can be considered overly general.

Wednesday, January 22, 2014

In this post, we revert to the discussion on Teradata:
We will cover help and show commands. Help is a command to list the database objects. For example, we have Help Database, help user, help table, help volatile table, help view, help macro, help procedure, help trigger, help column etc. These display information of the corresponding database objects. The different database objects are aggregate functions (A), combined/analytical functions(B), External stored procedure(E), Function (F), Trigger(G), Method(H), Join Index(I), Journal Table(J), Macro(M), Hash Index(N), Procedure (P), Queue Table (Q), Table Function (R), Ordered Analytical Function (S), Table (T), UDT(U), View(V) and Authorization(X)  There's also corresponding help available for index, statistics, constraint, session, SQL, SPL etc.
Comments can also be added to database objects. These are displayed on help.
Runtime information such as sessions are also displayed. The parsing engine assigns Session.
The show command is used to show the SQL used to create the corresponding database object. For example, the SHOW TABLE <table-name> can be used to create the table statement needed to create the table. Table, view, macro, trigger, procedure can all be shown.
 Aggregates such as count when used with a column specifier ignores null values. When used with a star to denote all columns, includes rows with null values.  The common aggregates are MIN, MAX, AVG, SUM and COUNT. GROUP BY delivers one row per group. Rows can be restricted when used with a predicate and this improves performance. The predicate clause also helps skip calculations in aggregation. The having clause acts as a filter on all aggregates.
 Grouping Sets, rollup and cube are three grouping options.
The Nexus tool could be used for all groupings.
 Joins can be INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN and FULL OUTER JOIN. A PRODUCT JOIN is a Cartesian product and often not intended.
In continuation of our discussion on improving performance in SQL Server, we now look at I/O affinity option. This is useful on computers with multiprocessor such as 16 cpus or more. This option only supports disk I/Os and does not support any hardware affinity for individual disks or controllers.
The option is set with sp_configure switch with affinity_mask configuration option to specify which CPUs the threads of the SQL Server should run on. The affinity mask can exclude processors that are to be reserved for operating system processes.
When running an instance of SQL Server on a large enterprise-level multiprocessor computers with more than 16 cpus, the IO_affinity_mask can be used in conjunction with affinity_mask. This option is used to specify which processors are to be used for SQL Server disk operations and which CPUs service the remaining.
Care must be taken to specify just the right number of affinitized cpus for disk IO. This is because we don't want to provide more than what the system needs or we might degrade performance on the other CPUs.
The IO_affinity_mask cannot be changed  when the server is running.
Occassionally, we may encounter slow running queries or even suspended queries.  This has nothing to do with performance and is more symptomatic of a resource issue such as a deadlock.
To identify a deadlock, you must first obtain log information. SQL server supports different startup parameters, dbcc traceon, and dynamic management views that help with more visibility on locks.
For example, we can add -T1204 and -T3605 trace flags. The former collects information about the process and the resources when the deadlock detection algorithm encounters a deadlock and the latter collects information everytime the deadlock detection algorithm is run. If the server startup is not an option, traceon facilities can be added.
A SQLProfiler trace can also be collected. These can show full statements in addition to execution plans of the statements.  A SQLProfiler trace may also have a lock event for deadlock and deadlock chain. Turning on the deadlock trace flags during the occurrence of a deadlock and running a SQL profiler trace should provide the data to troubleshoot the deadlock. However, running the SQLProfiler could change the timing of execution enough to prevent a deadlock. Therefore, the information must be captured with trace flags first and then the profiler should be run.


Tuesday, January 21, 2014

This continues on our discussion to improve SQL Server performance with normalizing logical database design, using efficient index design and using efficient query.
 The next step is to use efficient application design. The role that application design plays is critical because clients control the queries, the type of queries and when they are submitted and how the results are processed. This affects the type and duration of the locks, the amount of I/O and CPU load on the server and whether the performance is good or bad.
For this reason, the sooner we correct the application design, the better the performance. At the same time, even for turnkey applications, improvements to performance cannot be done in all cases without changing the application. Queries from clients are often referred to as workloads.
Workloads can use small results sets. This will reduce CPU and I/O load.
Applications should allow cancellation of a query in progress. No application should force reboot.
Applications should process all results to completion.
Always implement a query timeout. Do not allow queries to run indefinitely
Do not use generated SQL unless you have control over the statements or transparency. Query cancellation, query timeout and complete transactional control are still required from such.
Decision support and OLTP queries should be separated as much as possible.
Do not allow orphaned connections.
Use trace feature wherever possible. Isolate the slow query wherever possible.
After the slow query is isolated, do the following:
run it in isolation with minimal buffering and I/O redirection.
set statistics IO on to examine IO consumed by the query.
If the query is nested or wrapped in a stored procedure, extract it.
If there are triggers on the table that can generate I/O when run, silence them
Examine the indexes of the table used by the slow query.
Using the queries previously mentioned, examine the data uniqueness and distribution for each column.
Based on this study, make changes to application, query and index as appropriate. Run the query again to see the improvement.
 
The following are some of the tips to improve SQL Server performance based on what has worked before:
1)Normalize logical database design:
Reasonable normalization of the logical database design yields best performance. In a normalized database, there's usually a greater number of narrow tables while there's fewer and wider tables in a de-normalized database. When there are many joins to make in a normalized database, it can hurt performance. However, the optimizer is efficient at selecting joins when indexes are available.
Note that normalization brings in fewer and narrower indexes, fewer NULLs and less redundant data, allows more clustered indexes and facilitates sorting and index creation. It also requires less locks because the data locked is lesser. Typically four-way or greater joins is indicative of when there has been too much normalization.
2) Use Effective Index design
Indexes are usually left out of logical database design and is usually associated with physical database design. Indexes can be dropped, added and changed without affecting the database schema. The optimizer chooses the best index. By properly designing the index and the optimizer selecting the best index, we can improve performance. We do this with:
a) Examine the where clause of the SQL queries, because this is the primary focus of the optimizer
Pick out the slow queries and for each column in the where clause, consider a possible candidate for index.
b) use narrow indexes which are more effective than multi-column compound indexes with fewer index levels. Narrower indexes provide better performance over a wide range of queries.
c) use clustered indexes
Update and Delete operations are often accelerated by clustered indexes because they involve a lot of reading. Queries that return a range of values are often improved with a clustered index where as a non-clustered index helps with fewer queries.
d) examine column uniqueness - finding the number of unique values in a column. If they are a handful, there should not be an index. If it is far less than the number of rows, consider a clustered index. If there are more, there can be a non-clustered index.
e) examine data distribution in indexed columns:
A long running query occurs because a column with few unique values is indexed, or a join on such a column is performed. When the data is not distributed properly, it increases the page I/O, typically one page I/O per non-clustered index at which point, it becomes efficient to scan the entire table. Distribution can be found with say group by and having count(*) > 1
The availability of an index helps an optimizer in its job to make a decision whether to use indexed retrieval or not for improved performance.
3) Use efficient query design
Some types of queries are inherently resource intensive such as
Large result sets
IN, NOT IN, and OR queries
highly non-unique where clauses.
!= comparision operators
certain column functions, such as SUM
expressions or data conversions in where clause
local variables in where clause
complex views with Group by
To mitigate the resource usage, consider restricting the result set. For example, we can add a predicate saying where zip = '98052'

Monday, January 20, 2014

There are three types of joins : Nested Loop Joins, Hash Joins and Sort-Merge joins.
In a nested loop join, one of the tables is scanned and for every row, a look up is performed on the second table and the matching rows are retrieved. The tables can be labeled as outer and inner tables. The inner table lookup can be fast based on the presence of indexes. Let us take two tables employees and contractors, employees has 100,000 rows and contractors are 40,000 rows.
Then a table scan on the contractors together with a clustered index seek on employees is an example of a nested join.
The scan count of the employees table is 100,000 and that of the contractors is 1 for a join on id.
In a hash join, a hash table is created in memory for one of the inputs called the build input and searched for matching values from the other input, the probe input.  If the build input does not fit in memory, it is partitioned recursively until it fits in memory.  The probe input is hashed and used for matches. Hash joins are better when there is a significant difference between the size of the tables.
In a hash join of the example tables, the contractors table is used as the build input and the employees table is used with a clustered index scan.
The scan count of the employees table is 1 and that of the contractors is also 1.
In a sort merge join, the tuples of one input is examined, then the second sorted input is then scanned until a row with a match is found and then this is repeated for every row of the first input. Here the larger input dominates the cost. In a merge join, the inputs are sorted  on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.
This type of join requires both inputs to be in memory as opposed to hash join, where only a smaller input is in memory. The merge join is more efficient if the inputs are already sorted.  Note that inequality joins are not handled by this method.
For the given example, the sorting is done with the ids of the contractor table and it is scanned together with clustered index scan of the employees table.
The scan count for the employees table is 1 and that for the contractor table is also 1.
In Teradata, the SQL Syntax is very similar to anywhere else. For example, the TOP command can be used with OrderBy. The TOP WITH TIES can be used to see the ties together. This is the same as Rank or denserank operation.  However, in Teradata, top cannot be used with distinct and qualify, with and with by and sample.  By the way, GroupBy is preferred over distinct for the queries which return same results. This is because Distinct skews data where as groupby is faster.
Also note that when using where clauses, it is preferable to specify IS NULL or IS NOT NULL explicitly. This is particularly true when testing for the exclusion of some column values in the result set.  Paranthesis in Where clause can be used to change the order of precedence. The IN and NOT IN operators can also help with inclusion and exclusion. Note that the NOT=ALL works just like NOT IN operator.
The IN list is an excellent way to look up multiple column values. BETWEEN is inclusive and BETWEEN works for character data. LIKE can be used for string comparision. It can take % and _ as wildcard characters. Like, Like all and Like any can be used in conjunction with conditions.
Data Manipulation commands are also similar. For example we can give INSERT INTO My_Table (Column1, Column2) VALUES (124.56, 'My character data')
Bulk Insert is done with say INSERT INTO My_Table SELECT(Column1, SUM(Column2)) FROM Original_Table GROUP BY Column1
Update is done with say, UPDATE My_Table set column2  = column2 + 256 WHERE column1 = ' my data'
Fast path updates can be done with INSERT INTO My_Table_Copy SELECT Column1, Column6*1.05 FROM My_Table.
MERGE INTO Employee_Table USING VALUES(200000, NULL, 'Jones') AS E(Emp, Dept, Las)
ON Employee_No = Emp
WHEN MATCHED THEN
UPDATE set salary = sal
WHEN NOT MATCHED THEN
INSERT VALUES(Emp, dep, las); 

Sunday, January 19, 2014

interview questions
If a table has students and another table has classes offered at a school, how do we find students who are not enrolled in any class.
We maintain another table StudentClass with foreign keys and non-null values StudentID and ClassID.
SELECT s.name FROM StudentClass sc
RIGHT OUTER JOIN Student s
ON s.ID = sc.StudentID
WHERE sc.ClassID = NULL
If we wanted a count of all students enrolled in more than one class
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
HAVING NumClasses > 0
if we wanted to exclude a particular Class ID say 1
SELECT s.name, count(*) as NumClasses
FROM StudentClass sc
INNER JOIN Student s
ON s.ID = sc.StudentID
GROUP BY s.name
WHERE sc.ClassID IS NOT NULL AND sc.ClassID <> 1
HAVING NumClasses > 0

Saturday, January 18, 2014

In this post, we talk about temporal tables create function in Teradata
There are three types of temporal tables.
- valid time temporal tables
- transaction time temporal tables
- bi-temporal tables that are a combination of both.
These tables are created with the PERIOD data type that introduces a begin and end timestamp.
 The only difference is that the ValidTime can be a date or a timestamp. The Transaction Time has to be a timestamp.
The timestamp changes on update both before and after updates as in the case of Bi-Temporal tables. The non-sequenced ValidTime gives the data as the way things were while the current valid time gives the way the data is.
We will now look at how joins work internally in Teradata.
Teradata requires that for two rows to be joined together, both rows are physically on the same AMP.
One or both tables are redistributed or the smaller table is duplicated across all AMPs
Based on the column for the join, the matching rows from different AMPs are brought together on the same AMP. A volatile table is used for this purposes. On all joins, the matching rows must be on the same AMP and hashing is used to bring the rows together.
If two tables were to be joined thousand times a day and given that there will be the same primary index on the PK/FK join condition, there will be no data movement because the rows will already be hashed to the same AMP. The EXPLAIN operator shows this as Row Hash Match join. In fact, it is preferable to have a where clause with the primary index where this is not the case. The volatile tables are usually deleted on logoff.
There are three types of temporary tables:
1)Derived Table
This exists only within a query and is materialized by the SELECT statement inside a query.
The space is used on the users spool space and it is deleted at the end of the query.
Derived tables are usually specified within a parenthesis in the SQL or with a WITH clause. Columns can be aliased and can default to normal columns. Most derived tables are used join to other tables.
2)The Volatile tables are created by the user and materialized with an INSERT/SELECT
The space is used on the users spool space and the table and data are deleted when the user logs off.
The ON COMMIT DELETE ROWS can be explicitly specified so that the volatile table is deleted at the end of the query.
The HELP command shows the volatile tables.
The volatile tables can have primary index such that when used for joins between tables, there is no movement of data.
Statistics can also be collected on volatile table so as to improve performance. For example, we can have this for all non-unique primary indexes (NuPI), unique primary indexes of small tables i.e with less than thousand rows per AMP, columns that appear frequently in WHERE search conditions and non-indexed columns used in joins, and partitioning column of a PPI table.
3)Then there are global temporary tables
Here the table definition is specified by the user and the table definition is permanent. It is materialized with an INSERT/SELECT.
The space is used on the user's TEMP space. When the user logs off the session, the data is deleted, but the table definition stays. This can be shared by many users who can then populate the same table but with their own copy.
 

Friday, January 17, 2014

We return to the discussion on Teradata.
Teradata has an admin user called DBC. This comes with each system.
 User DBC has always been at the top of the Teradata hierarchy. An admin user is explicitly provisioned to facilitate addition of other users. Some space is reserved for user DBA. Space on the disk for each AMP is divided into PERM space and SPOOL space. PERM space is reserved for tables. Secondary indexes, join indexes and permanent journals are also stored in PERM. SPOOL is used for user queries and answer sets.
 Disks are not shared between AMPs.
Every user is assigned spool space so they can submit SQL and retrieve answer sets. If the user exceeds this space, the query is aborted.
The spool limits are so that long queries can be terminated and individual users don't hog the system.
Each user's spool limit is actually done per AMP. If the space is exceeded on any AMP, the query is aborted.
 The DBC data dictionary shows how much space is reserved. When the space per AMP shows skew, typically this is due to NULL data.
The access is usually granted to users to read views which are in turn granted access to read tables.
A database or a user can be assigned  PERM space  and spool space.
Objects that take up PERM space include Table rows, fallback tables, secondary index subtables, stored procedures, user defined functions, permanent journals etc.
Access rights can be granted to user. There are three types of access rights:
automatic rights which are default and granted to owners of created tables.
Implicit rights which are ownership rights and are implicit to parents and anybody above in the user hierarchy,
Explicit rights which are privileges a user explicitly grants to another user with the grant to statement.
These rights may be explicitly removed
Profiles and roles are different. Profiles are for people and roles are for rights. A profile can be assigned to a group of users so that they operate under a common set of attributes. for example a setting the can be used to override that of the users.
Account strings can be expanded to include logon time, date, hour, session ID, number etc.
The Teradata Active System Management TASM controls the query traffic. User accounts are given priority and rules setup to control system resources.
Secondary index can be created. If it is unique, it can be hashed.
A USI is a two AMP operation because the first AMP is assigned to read the subtable and the second the base table. Records are looked up in both via binary searches.
A non-unique secondary Index table is AMP_Local.
A value ordered NUSI can also be created.
Today we take a short break from our discussion on Teradata
Using Kullback-Leibler divergence to extract keywords from  a single document:
This approach talks about a simple  way to extract keywords using term weighting.
Terms are weighted based on the Kullback-Leibler divergence D(avg probability of term t/q) from the background distribution q.
The terms are evaluated one by one and if their divergence is above a cutoff, they are selected.
Initially, we calculate the expected probability of a term by counting its occurrence in the document. We could later refine this to be a lookup against a dictionary of words and their expected probabilities as determined independently from a corpus of text.
We calculate nw as the total number of terms where w appears.
and pw as the total number of terms where w appears divided by the total number of terms in the document
and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] .
By setting an adjustable threshold for the cutoff, we extract variable number of keywords.
pseudo-code in python syntax for finding KLD from a single document:
 tokens = nltk.word_tokenize(document)
 words = sorted(set(tokens))
 porter = nltk.PorterStemmer()
 words = [porter.stem(t) for t in words]

 wnl = nltk.WordNetLemmatizer()
 lemma = [wnl.lemmatize(t) for t in words]

fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in lemma])

The KLD is given as D(p||q) = Sigma-t-from1ton[pt.log(pt/qt)]
where pt is for the distribution consisting of only the one keyword and is a constant = 1 for that term
qt is the distribution nt/n for that term
In our case we don't need to aggregate the KLD component pt.log(pt/qt) since we consider only one term. I take this back, the distribution is based on the same set of terms for both p and q. When a term doesn't appear in p or q, then it is given an epsilon probability.
we select the terms where pt.log(pt/qt) = KLD(p|q) > 1

KLD-term = 1 * log (1/(nt/n) for term in lemma)]) if we considered only one term
KLD-term = 1 * log (1/(nt/n)) + Sum(epsilon.log(epsilon/(nt'/n))) for all other terms t'

Alternatively, we could use Bigi's equation as

KLD(d, t) = KLD(d,t) / KLD(d,0)

KLD(d,t) = SUM((P(t,d) – P(t,t)) x log(P(t,d)/P(t,t)))

=                  (Nt/n – 1) * log(nt/n) if we considered only one term
=         (Nt/n – 1) * log(nt/n) + SUM(((nt/n)-epsilon) x log((nt/n)/epsilon)) for all terms

Note that this considers the occurrence and non-occurrence of terms but not their density for which we want to add weights in the probability computation above.

Thursday, January 16, 2014

We now look at Partition Primary Index tables in the Teradata server.
 Each table in Teradata has a primary Index unless its a NoPI table. Partitioning tells the AMPs to sort the table rows by partition first, and then sort the rows of a table with the Row-ID. Partitioning is a way to prevent FULL table scans.
Partitions can be specified with Range_N options such as for creating a partition each day/week/month/year.
The query would look like CREATE TABLE Order_Table_PPI
(
Order_Number INTEGER NOT NULL,
Order_Date Date
) PRIMARY INDEX (Order_Number)
Partition by Range_N(OrderDate BETWEEN date '2013-01-01' AND DATE '2013-12-31'
EACH INTERVAL '7' DAY);
this would create about 52 partitions. Or the partitions could be based on each partition size.
Partitions can be of arbitrary number. Teradata supports a very very large number.
Older data and newer data in PPI can be combined.
Partitions can be nested too. The top partition is usually the only one that can be altered. Upto 15 levels of Range_N or Case_N partitions can be nested.
A primary index must be non-unique on all partitioned tables unless the primary index column is also used to define the partition. Most PPI tables are defined as NUPI.
Character based PPI are also common.
Alter Table Order_Table to current with insert into Order_table_History
Columnar table partitions are also used often. Columnar tables must be NoPI tables. Since the partitions are along columns, the columns are placed inside their own container. All containers have the same amount of rows. Each container looks like a small table for I/O purposes. Columnar tables make sense when users query only certain data.
In a columnar table when a row is deleted, it is not physically deleted but just marked deleted.
Columnar can move just one container to memory. Containers match up perfectly to rebuild a row.
Containers can have more than one columns.
We continue our discussion on Teradata by examining the AMPs. Rows are stored in Data blocks which are stored in cylinders. AMPs store the data block header, the data block and the row reference array for each table. AMP may have many cylinders and each cylinder may have many data blocks. An AMPs master index is used to find the right cylinder. The master index is maintained in memory.
An AMP's master index may point to thousands of cylinder indexes. Each cylinder index is on that cylinder and is usually not in FSG cache.
When the AMPs data blocks are split, the cylinders grow. The blocks can be maximum size of 255 or 2047 sectors (1MB) and the maximum row size can be 64,255 bytes.
Seeing how the data is arranged in an AMPs disk, let us understand how the Teradata stores data in blocks and moves it inside FSG Cache.
Cylinders are arranged in tracks on the disk. The hottest cylinders are arranged on the external perimeter so that they can be read fast in one disk spin. The coolest cylinders are stored at the center where they may be rarely accessed.
When data is loaded into a table, it is automatically deemed hot since the table is most likely staging. There can be an option to specify the intial temperature with a session setting called Query_Band
Cylinders are also used for Perm, Spool, Temp data and Journals.
There is a free cylinder list where new data is inserted if the block cannot be accommodated in an existing cylinder.
Note that the full table scan now has these detailed steps from the organization we have presented:
The user runs a query.
The PE checks, the syntax and access rights and generates a plan
The plan is full table scan so its broadcasted to all the BYNETs
Each AMP already has their Master Index inside their memory
Each AMP brings in the table header into FSG Cache.
Each AMP then reads the Master index to find the cylinder that holds the data blocks
Each AMP reads the cylinder index  by bringing it into FSG Cache
The Table Header and the data blocks are both in FSG Cache
The AMP places the information inside spool and reports to PE
The PE then requests the spools and delivers the MasterIndex, cylinderindex, table header and data block to the user.



 

Wednesday, January 15, 2014

We mentioned hashing of rows in Teradata. We also sort the rows based on this row id. AMPs sort their rows by Row-ID to Group like Data . Since the Row-IDs are sorted, the AMPs use binary search. NULL values all hash to the same AMP. Note that this means data is skewed. So handling NULLs can be avoided by having Unique Primary Index or a multi-column primary index. A Unique Primary Index will spread the data perfectly evenly. A multi-column can overcome the data skewing by NULL values.
 A full table scan is performed by the participation of all AMPs. All AMPs read all of their rows because there is no Primary Index.
When data is loaded, each AMP holds a portion of the rows for every table. When queries are run, each AMP simultaneously retrieves the rows. The table header and data rows are stored on each AMP. The table header is created on each AMP when the table is created.
Each AMP stores the rows they own inside a data block. Data is read into each AMPs cache called the First System Generating Cache (FSG Cache). When the header and the data is in memory, the AMP can read and process.
When a proper index is chosen, a single AMP needs to move only one block of data into the FSG cache. When the rows are inserted into the block and the block grows, the block is split. If a block moves into the FSG Cache
Synchronized Scan (Sync Scan) is another feature. When there are multiple users reading different blocks, the system shares reads between multiple scans at the block level. New query joins scans at the current scan point.
 The system wraps around after reading all the blocks so the late comer gets the blocks they missed. Scan could have also started before the query ran.
Teradata has a measure for data temperature i.e how hot or frequently used it is. The hottest data is kept in memory automatically. There is no DBA intervention or application changes required. While FSG Cache is used to move data from disk temporarily, this intelligent memory doesn't purge the data in memory but keeps it there. The data can stay there for days.
Both are complimentary to each other. Most frequently queried tables, dictionaries and anything that requires sub-second data are all valid candidates.  Data deemed very hot stays in each AMPs intelligent memory.
The physical database is also designed to help with queries.
First, the primary index is chosen such that it is most often used by users to find rows.
Second, the primary index will reduce the data transfers from disk.
Third, the secondary indexes will limit the number of AMPs used and data blocks transferred. Upto 32 secondary indexes can be used.
Fourth, the tables are partitioned to sort the rows horizontally.
Fifth, the tables can be partitioned vertically so that the transfer of columnar blocks can be efficient.
Sixth, the modeled tables can be denormalized.
 
We looked at Teradata's PE tasks. We will now look into what the AMP does.
The AMP maintains the following:
storing and retrieving rows from their assigned disks.
locking the tables and the rows
sort rows and do all aggregation
handle all join processing
handle all space management and space accounting
The AMP executes the plan from the PE. With a where clause there's only 1 AMP retrieve.
By using the keyword EXPLAIN, we can see the parsing engine's plan. A Non-unique Primary Index will have duplicates grouped together on the same AMP. 64 columns can be grouped together in a multicolumn primary index.
A table can have one primary index. If the table is specified with NO PRIMARY INDEX, it will receive no primary index. A full table scan is likely on a table with no primary index.
 When we type EXPLAIN with a sql query on a table with no primary index, it can show the full table scan by utilizing all-AMPs retrieve. When the primary index is omitted, the first column is used by default.
The NoPI tables are designed to be staging tables. Data from a mainframe or server can be loaded onto Teradata quickly with perfect distribution. Data is moved from staging tables and not queried so an index is not necessary.
In Teradata, a row's primary index value is hashed and the output (row hash) is stored with the  Row_ID. Since the index creates the same hash value, the row hash retrieves the same row again. There is only one Hashing Formula.
The HashMap determines which AMP will own the Row. The RowHash is used to lookup the AMP in the HashMap. Different rows are hashed to different AMPs. The HashMap gives a global perspective of all the rows.
If we were to create an index on a column that allowed duplicate values, Each AMP will place a uniqueness value as a subidentifier after the RowHash to track duplicate values.
The Row-Hash and the uniqueness value make up the Row-ID.
 For the unique primary index, the Row-ID is 1. Each AMP sorts their rows by the Row-ID.

Tuesday, January 14, 2014

In this post we talk about Teradata - it's architecture and SQL as read from the book by Tom Coffing and William Coffing. Teradata is about parallel processing. With each query, a single step is performed in parallel by each AMP. A Teradata system consists of a series of AMPs that work in parallel. Teradata systems usually involve several AMPs for greater ROI. They usually start small and grow.
Nexus is the tool used to load the logical model and display the tables and views visually. It has a Super Join Builder that guides joins and builds the SQL on the joins specified. Nexus works with Teradata and Hadoop.
Data Models are inexpensive to be tested and altered. ERWin from CA is a data modeling tool. It allows you to model your tables, views and the physical model.  When the data model is shared with the business users, there is more insight and its use is facilitated with the tool building the SQL.
Teradata's advantages include using Massively Parallel Processing (MPP) to analyze very large data.
In a table of trillion rows, Teradata can find a single record in under a second with just one AMP. Teradata has sophisticated query traffic control with rules, times, delays, green lights to query and red lights to wait. Longer duration and volume queries are not placed in the same category as critical queries.
Teradata gives users their own views of the query processing with Viewpoint.
The Teradata environment enables a variety of queries from large massive joins, sub second tactical queries, partitioned tables for range queries, columnar designs for specific columns combined with secondary indexes, joins and hash indexes etc.
 The rows of data are spread across the AMPs so that each user query can be executed on similar number of rows by each AMP. In parallel processing we distribute the work for each AMP. Communication is done over BYNET.  The Parsing Engine (PE) takes the Users' SQL and builds a plan for each AMP.
All Teradata tables spread their rows equally across the AMPs. Teradata systems can add AMPs for linear scalability. Teradata can scale to arbitrary size.
AMPs and PEs are called virtual processors. Each SMP node has both processors and typically many of both. SMP stands for symmetric multiprocessing. Each node is attached via a network to a disk farm. Each AMP in the node is assigned its own virtual disk.
When nodes are connected to the BYNETs, they become part of one large MPP system. Each node addition brings in additional PEs, AMPs, and disks.
Nodes, disks and BYNETs are usually kept inside what is called a Cabinet. Gateway and Channel drive software runs as processes One controls the communication with the LAN and the other connects the mainframe. The Access module processors (AMPs) and PEs live in memory and are controlled by a Parallel Database Extension (PDE) usually running on a Linux operating system. While BYNETs enable communication between nodes, there's also a boardless BYNET within the nodes. PEs and AMPs can communicate over this boardless BYNET. Each PE can manage upto 120 individual sessions.  When a user logs into a Teradata, a PE will log them and manage the session. The PE is also responsible to check the SQL syntax, creates the EXPLAIN, check the security and builds a plan. The PE uses the COLLECTED STATISTICS to build the best plan. The PE delivers the final answer set to the user.
 
We discuss PHP today from the LAMP stack. PHP is embedded in HTML. It gets translated to HTML from the server before it reaches the client. PHP is wrapped with <?php , <? or <script language="php">
PHP works with virtually all web server software and most databases. PHP configuration is handled with the php.ini file. Single line comments start with #. Multi line printing starts with a label such as print <<< END or with quotes. PHP is whitespace and case sensitive. The data types include integers, doubles, boolean, NULL, strings etc. Variable scope includes local variables, function parameters, global and static variables. Like C, we have __LINE__, __FILE__, and __FUNCTION__  Loops are similar as well.
Arrays can be numeric, associative and multidimensional. Dot operator is used to concatenate strings. fopen, fclose, fread and filesize are some file methods. functions are declared with the function keyword such as
function writeMessage()
{
echo "Hello";
}
Reference passing can be done with the & operator both in declaration and invocation. Default values for function parameter can be set.
<?php
function cURLcheckBasicFunctions()
{
  if( !function_exists("curl_init") &&
      !function_exists("curl_setopt") &&
      !function_exists("curl_exec") &&
      !function_exists("curl_close") ) return false;
  else return true;
}
// declare
if( !cURLcheckBasicFunctions() ) print_r('UNAVAILABLE: cURL Basic Functions');
$apikey = 'your_api_key_here';
$clientId = 'your_client_id_here';
$clientSecret = 'your_client_secret_here';
$url = 'https://my.api.com?api_key='.$apikey;
$ch = curl_init($url);
$fields = array(
'user' => urlencode('user'));
//url-ify the data for the POST
$fields_string = '';
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string, '&');        
curl_setopt($ch, CURLOPT_URL, $url);
// curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC ) ;
// curl_setopt($ch, CURLOPT_USERPWD, $credentials);
// curl_setopt($ch, CURLOPT_SSLVERSION, 3);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);                        
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_POST, count($fields));
curl_setopt($ch, CURLOPT_POSTFIELDS, $fields_string);
//execute post
if(curl_exec($ch) === false)
{
    echo 'Curl error: ' . curl_error($ch);
}
else
{
    echo 'Operation completed without any errors';
}
//close connection
curl_close($ch);
?>
This is a sample REST call using curl and PHP



Monday, January 13, 2014

We continue our discussion on EJB in this post. We would like to talk about persistence.
Persistence can be via Entity beans. Persistence can also be via Hibernate and ORM tools.  We will discuss both. Bean persistence can be managed via two ways: Container managed persistence (CMP) and bean managed persistence (BMP). In the CMP, the container manages the persistence of the bean. In the latter, the developer implements the lower level persistence mechanism. Typically the latter is used when the CMP limits have been exceeded.
In CMP, entity bean data is managed automatically by the container with a mechanism of its choosing. A container implemented on top of a RDBMS may manage persistence by storing each bean's data as a row in a table. Serialization may also be builtin.
A persistence manager is used to separate the management of bean relationships from the container. The container manages security, transactions etc while the persistence manager manages different databases via different containers. Using this architecture allows entity beans to become more portable across EJB vendors.
Entity beans are mapped to a database using a bean-persistence manager contract called the abstract persistence schema. The persistence manager implements and executes find method based on EJB QL.
The persistence manager generates a mapping of the CMP objects to a persistent data store object.  Persistence data stores can vary from relational databases, flat files, and Enterprise Resource Planning (ERP) The mapping is based on the information provided in the deployment descriptor and the bean's abstract persistence schema.
The CMP entity bean and the persistence manager use a contract to define bean to bean, bean to dependent and even dependent to dependent object relationships within an entity bean. When EJB is deployed, the persistence manager is used to generate an instantiated implementation of the EJB class and its dependents using the XML deployment descriptor and the bean class. The instantiated implementation will include the data access code that will read and write the state of the EJB to the database at runtime. The persistence manager also manages the state. The persistence manager generates subclasses for use with the container instead of the abstract classes defined by the bean provider.
CMP is great for database independence and container specific features. The cons are container supported algorithms, portability to other EJB containers, the developer has no access to the view and the generated SQL is not the most efficient.
 BMP pros are that it is container independent, standards based (using EJB and JDBC APIs) , has ability to support nonstandard data types, has flexibility for validation logic, and can take advantage of non-standard SQL features. The cons include the following: it is database specific, requires knowledge of SQL, and takes longer to develop.
In this post, we continue our discussion on Java EJB.
The bean class methods must be declared as public and must not be final or static. The throws clause may define arbitrary application exceptions. and the bean class must not define the finalize() method.
The beans can be specified as stateless, stateful and message-driven. Injecting resource dependency into a variable or setter method is done by the EJB container.   PersistenceContext can also be injected using the data type EntityManager and the corresponding imports. Then the entity can be found using the EntityManager object
Person p = em.find(Person.class, 10);
New entities can be added as em.persist(p);
Specifying interceptors are optional. The Interceptors annotation is used to annotate the interceptor class with the bean class. The method in the interceptor class annotated with the AroundInvoke annotation then becomes the interceptor method for the bean class.
The PostConstruct and PreDestroy annotations are for notifications immediately after the dependency injection and before first business method  and for releasing resources respectively.
EJB Access can be secured with the DeclareRoles, RolesAllowed, DenyAll, PermitAll and RunAs annotations.
Once the Java source code is compiled, deployment descriptors can be created or edited.
Exception handling can be done by try catch blocks and the methods can mention the exceptions
There are several things to consider for a good java EJB program. Sime of these are:
Modular code
Design patterns
Layering
Performance
Security

Sunday, January 12, 2014

This blog post is on study material for Sun certification for J2EE:
Architecture considers the functions of components, their interfaces, their interactions and components. The architecture specification helps with the basis of for application design and implementation steps. This book mentions that flexible systems minimize the need to adapt by maximizing their range of normal situations. In J2EE environment, there could be a JNDI agent - Java Naming and Directory interface agent who knows what systems elements are present, where they are and what services they offer.
Classes and Interfaces for an Enterprise JavaBeans Component include the following: Home(EJBHome) Interface, Remote(EJBObject) Interface, XML deployment descriptor, Bean class, and the Context objects. The Home interface provides the lifecycle operations (Create, remove, find) for an EJB. The JNDI agent is used by the client to locate an EJBHome object.
The remote object interface provides access to the business methods within the EJB. An EJBObject represents a client view of the EJB. The EJBObject is a proxy for the EJB. It exposes the application related interfaces for the object but not the interfaces that allows the container to manage and control the object. The container implements the state management, transaction control, security and persistence services transparently. For each EJB instance, there is a SessionContext object and an EntityContext Object. The context object is used to co-ordinate transactions, security persistence and other system services.
package examples
public interface Service {
   public void sayBeanExample();
 }
@Stateless
@TransactionAttribute(NEVER)
@Remote({examples.Service.class})
@ExcludeDefaultInterceptors
public class ServiceBean
  implements Service
{
   public void sayBeanExample() {
    System.out.println("Hello From Service Bean!");
  }
 }
The import statements are used to import say the metadata annotations, the InvocationContext that maintains state between interceptors etc. The @Stateful specifies that the EJB is of type stateful, the @Remote interface specifies the name of the remote interface, the @EJB annotation is used for dependency injection and specifies the dependent "ServiceBean" stateless session bean context. The @Interceptors and @ExcludeClassInterceptors specifies that the bean is associated with an Interceptor class and that the interceptors methods should not fire  for the annotated method respectively. The @PreDestroy method is used for cleanup
We looked at a Node.js backbone solution from the book JumpStart Node.js. The book continues with an example for a real-time trades appearing in the browser. Here we add a function to store the exchange data every time it changes. The first task is to store the data, then its transformed and sent to the client.
Instead of transmitting the entire data, it is transformed. When the client makes the initial request, we transmit this data. Client side filters could change. Hence its better to use templates.  We can use JQuery's get function to retrieve the template and send an initial 'requestData' message to the server so that the initial data can be sent.
As before, we use the initialize to call render function. We iterate through all the models and render each row individually with a separate view for the real-time trades. With the static template, this is now easier to render than when string was used.  With the data loaded, its easier to handle just the updates with a separate method.
Heroku can be used to deploy the Node.Js application.
Express.js supports both production and development settings. One  of the differences between the two is the settings that handle errors. In development, we want as much error information as possible while in production we lock it down.
We also provide a catchall handling any request not processed by prior routes. 
When it comes to hosting the application, there are several options available - such as IaaS, PaaS, or Amazon's EC2. However, the cost of all this convenience is the loss of control. In general, this is not a problem and the convenience is far worth it.
For those choosing to deploy on either a dedicated server or EC2, it is better to use an option that frequently restarts the application upon any crash or file change. A node-supervisor helps in this regard but for production it is better to use the package forever since it has minimal overhead.
Version control and Heroku deployment should go together in order that we can do rollbacks on unstable change. With incremental changes, 'git push heroku master' could then become a habit.
We did not cover Socket.IO and scoop.it

 

Saturday, January 11, 2014

Today I'm going to read from a book called Jumpstart Node.js by Don Nguyen.
Just a quick reminder that the Node.js is a platform to write server side applications. It achieves high throughput via non-blocking I/O and a single threaded event-loop. Node.js contains a built-in HTTP server library so Apache or Lightpd is not required.
The book cites an application WordSquared as an introduction to applications in Node.js. This is an online realtime infinite game of Scrabble.
Node.js is available from GitHub via a package manager
Over the http server is a framework called Connect that provides support for cookies, sessions, logging and compression, to name a few.On top of Connect is Express which has support for routing templates and view rendering engine.
Node.js is minimalistic. Access to web server files is provided via fs module. express and routes are available via express and routes module.
Node.js allows callback functions and this is used widely since Node.js is asynchronous. for example
setTimeout(function(){console.log('example');}, 500); The line after this statement is executed immediately while the example is rendered after the timeout.
Node.js picks up changes to code only on restart. This can become tedious after a while, so a node supervisor is installed to automatically restart the system upon changes to file.
MongoLab is a cloud based NoSQL provider and can come in useful for applications requiring a database.
Backbone is the  MVC framework for the Node.js. It can be combined with the Node.js framework to provide a rich realtime user interface.
To create a custom stock ticker, a filter for the code of the stock could be implemented. When the user submits a request, Backbone makes a request to the server-side API. The data is placed into a model on the client side. Subsequent changes are made to the model and bindings specify how these changes should be reflected in the user interface. To display the view, we have an initialize  and setVisibility function. In the initialize function, a change in the model is bind-ed to the setVisibility function. In the latter we query the properties and set the view accordingly.  When the filtering is applied, the stock list is thus updated.
In the previous post, we examined some tools on linux to troubleshoot system issues. We continue our discussion with high cpu utilization issues. One approach is to read logs. Another approach is to take a core dump and restart the process.The ps and kill command comes very useful to take a dump. By logs we mean performance counters logs. For linux, this could come with sar tool or vmstat tool that can run in sampling mode. The logs help identify which core in a multicore processor is utilized and if there's any processor affinity to the code of the process running on that core.
User workloads is also important to analyze if present. High cpu utilization could be triggered by a workload. This is important to identify not merely because the workload will give insight into which component is being utilized but also because the workload also gives an idea of how to reproduce the problem deterministically. Narrowing down the scope of the occurrence throws a lot of light into the underlying issue with the application such as knowing when the problem occurs, which components are likely affected, what's on the stack and what frame is likely on the top of the stack. If there are deterministic steps to reproduce the problem, we can repeatedly trigger the situation for a better study. In such cases the frame, the source code in terms of module, file and line can be identified. Then a resolution can be found.
Memory utilization is also a very common issue. There are two approaches here as well. One approach is to have instrumentation either via linker or via trace to see the call sequences to identify memory allocations. Another approach is to use external tools to capture stacktraces at all allocations so that the application memory footprint can show which allocations have not been freed and the corresponding offending code. Heap allocations are generally tagged to identify memory corruption issues. They work on the principle that the tags at the beginning and the end of an allocation are not expected to be overwritten by the process code since the allocations are wrapped by tags by the tool. Any write access on the tags is likely from a memory corrupting code and a stack trace at such a time will point to the code path. This is very useful for all sizes of allocations and de-allocations.
Leaks and corruptions are two different syndromes that need to be investigated and resolved differently.
 In the case of leaks, a codepath may continuously leak memory when invoked. Tagging all allocations and capturing the stack at such allocations or reducing the scope to a few components and tracking the objects created by the component can give an insight into which object or allocation is missed. Corruption on the other hand is usually indeterministic and can be caused by such things as timing issues. The place of corruption may also be random. Hence, it's important to identify from the pattern of corruption which component is likely involved and whether there can be minimal instrumentation introduced to track all such objects that have such a memory footprint.