Cluster computing

Monday, January 6, 2014

The red-black tree insert is very much like a tree insert except that its colored red before fix up.
The red-black tree delete considers four cases corresponding to
the fix up sibling w is red => color w to black and left-rotate
x's sibling w is black and both of w's children are black => color w to red.
x's sibling w is black and w's left child is red and right child is black => exchange color of w and its left child and right-rotate
x's sibling w is black and w's right child is red => change w's color and its parent and perform a left rotation
Now on to networking technologies:
SNMP - manages states such as address translation tables, routing tables, TCP connection states etc using MIB
Resolution occurs with different levels of identifiers : domain names, IP addresses, and physical network addresses. First, users specify domain names when interacting with the application. Second, application engages DNS to translate domain names to IP address and lastly IP engages ARP to translate the next hop IP address to physical address.
TLS session involves Client and Server communication where server sends certificate to client with its public key, client sends session keys, initialization vectors etc with encryption using the public key, server decrypts messages with its private key.
A certificate is a document with a digital signature and is signed by a Certification Authority. Keyed MD5 produces a cryptographic checksum for a message as m + MD5(m + k)
Public Key Authentication happens with A sending E(x, Public-B) to B and B sending back the decrypted x.
Kerberos provides a third party authentication by initiating, intercepting and closing the handshake.
Transmission Control Protocol provides ordered reliable error free transmission with flow control and congestion management. This it does with sequence numbers, sliding window and window scaling. Sequence numbers, selective acknowledgements, receive window fields and persist timers help manage the flow.

Sunday, January 5, 2014

I'm preparing for an interview so I will post frequently as a recap of the things I revised.
A revisit of the Active Directory configurations and DNS and networking technologies.
Active Directory site topology and replication -
Replication usually from single master server to subordinate servers
Active directory offers multimaster replication; avoids single point of failure
KCC tool sets up and manages the replication connections.
KCC uses two modes - intrasite and intersite.
intrasite is designed to create a minimum latency ring topology between DCs
the intersite uses a spanning tree algorithm with site link metrics.
Replications flows are setup between sites and DFS shares.
By default there's one site created automatically.
Multiple sites can be defined for a single location to segregate resources.
AD sites are defined in terms of a collection of well-connected AD subnets.
Site links connect and DC uses them to cover additional ones including the current site for user logons.
If not all site links are available, bridges are used instead.
Naming contexts are replicated by a domain controller by maintaining a high watermark table
- one each for schema, configuration and domain NCs.
This is based on the highest USNs of the updates.
Conditional Forwarding, delegation options and Dynamic DNS.
CF is the feature that lets name resolution for an ip address to be passed other than the local dns
DNS servers can be primary or secondary
primary stores all the records
secondary gets the contents from primary
The contents of a zone file are stored hierarchically
This structure can be replicated among all the DCs.
It is updated via LDAP operations or DDNS (must have AD integration)
A common misconfiguration issues is the island issue when ip address for a DNS changes
and it is updated only locally. To do a global update instead, they must point to a root server other than themselves.
Delegation options are granted to DNS servers or DCs.
Simple is when DNS namespaces are delegated to DCs and DC hosts a DNS zone.
The records in a DNS server as opposed to DC are autonomously managed.
DNS servers need to allow DDNS by DC
DC does DDNS to prevent updates to the DNS records in the server.
Support and maintenance is minimal with DDNS.
Standalone AD is used to create test or lab networks.
A forest is created, a DC is assigned, DNS Service is installed.
DNS zone is added, unresolved requests are forwarded to an existing corporate server
The primary DNS for all clients point to the DC.
Background loading of DNS Zones makes it even easier to load DNS zones while keeping the zone available for dns updates / queries.

Algorithms and data structures:
1) Quicksort - defined as
Partition
Quicksort one side
Quicksort other side

Partition works something like this
x is the value of the partition candidate A[r] in A[p,r]
i,j indexes are maintained
j iterates from first to the last but one
i lags behind j
i and j bound the values higher than the partition candidate x

Radix sort - based on significant digits starting from right to left.

Insertion sort - think sorted list or arranging a deck of cards.
Merge sort - -
Mergesort A,p,q,r
Mergesort A,q +1, r
Merge A,p,q,r
bottom up merge and at each step sort the contents on merge
for k from p to r
if L[i] < R[j]
A[k] = L[i] i = i + 1
else
A[k] = R[j] j = j + 1

HeapSort O(nlogn)
uses a heap
Parent(i) = i/2
Left(i) = 2i
Right(i) = 2i + 1
for i from length(a)/2 downto 1
do Max-Heapify(A,i)
Max-Heapify is recursive

Tree-Successor : return minimum on the right sub tree or keep climbing the parents until the given node is descended from the left

Tree - predecessor : return maximum on left subtree or keep climbing until the given node is descended from the right.

Tree-delete uses tree-successor.
Tree-Insert walk down the tree to find the value less than the key, then insert there
Tree - Delete depends on how many children the target z has. if z has no children, we just remove it. If z has only one child, we splice out z If z has two children, we splice out its successor y which has at most one child.
Red-black tree insert and delete is even more interesting.

We cover data warehouse design review checklist in this post. Design reviews are very helpful for ensuring quality in the operational environment. It identifies errors before coding and saves costs. A design review considers such things as transaction performance, batch window adequacy, system availability, capacity, project readiness, user requirements satisfaction. The benefits of a design review become obvious when there is less code churn. A design review is applicable to both operational systems and data warehouse but there are some differences.
In the data warehouse case, it is not built using SDLC as with operational systems.
In the operational environment, development is done one application at a time. In the data warehouse environment, they are built a subject area at a time.
In the operational environment, there are firm requirements whereas in the data warehouse environment, the processing requirements are not known at the outset of DSS development.
In the operational environment, transaction response time is critical.
In the operational environment, the input comes from external systems
In the operational environment, the data is current-value where as in the warehouse, its time-variant.
A design review in the data warehouse is done as soon as a major subject area has been designed.
Participants of a design review typically include data administration, database administration, programmers, DSS analysts, end users other than DSS analysts, operations, system support and management. End users and DSS analysts matter more than others.
The design review could table any item for discussion especially the controversial ones. The design review helps with ensuring the success.
The data warehouse design review should result in the following:
A list of the issues encountered, and recommendations for actions
A documentation of where the system is in the design, as of the moment of the review.
A list of action items that are specific and precise.
Typically a review includes both a facilitator as well as a recorder. The facilitator is not the leader so that the review can have maximum input. The facilitator brings in an external perspective and can offer criticism constructively.
There items on the checklist for the design review includes all of the points discussed above and more. The complete list is available in building the data warehouse book.

Saturday, January 4, 2014

We continue our post on the data warehouses with a discussion on the end user community. The end users have a lot of say in how the data warehouse shapes. They have a lot of diversity so we recognize four types - the farmers, the explorers, the miners and the tourists. The farmer is the most predominant type of user found in the data warehouse environment. This is a predictable user in that the queries submitted by the user are short, go directly for the data, recur on the same time of the week and is usually successful on finding the data.
The explorer is the user community that does not know what he or she wants and hence takes more time and more volume of data to search. This user covers a lot of data and typically does not know what he or she wants before the exploration process begins. The exploration proceeds in a heuristic mode. In many cases, the exploration looks for something and never finds but there are also cases when the discoveries are specially interesting.
The miner is the user community that digs into piles of data to test assertions. Assertions are tested based on their strength from the data. Usually this user community uses statistical tools. The miner may work closely with the explorer. The explorer creates assertions and hypothesis and the miner may determine their strength. Usually this community has to have mathematical skills.
The tourist is the user community that knows what to find where. This user has a breadth of knowledge as opposed to the depth of the knowledge. This user is familiar with both formal and informal systems. He or she knows the metadata and the indexes, the structured data and the unstructured data, the source code and how to read and interpret it.
There are different types of data targeted by these end users. If data existed in different bands of probability of their use in the data warehouse, the farmers would be very predictable and target only the top small band of this data while the explorers would reach all over the data.
Cost justification and ROI analysis could be described for these user communities as follows:
The farmer's value and probability of success is very high. His queries are useful in decision support. The explorers success rate is not that high although his finds are much more valuable than the regular queries performed by the farmers. The warehouse therefore should present the ROI from farmers community instead of the explorers.

Friday, January 3, 2014

We cover Corporate Information Compliance and data warehousing.
Corporate Information Compliance is a legal compliance. Some examples are Sarbanes Oaxley, Basel II, and HIPAA. These compliance rules were brought about because corporations engaged in accounting fraud such as Enron, WorldCom and global crossings. The Sarbanes Oaxley Act for instance was introduced to enforce proper and honest accounting.
Data warehouse plays an important role in these compliance. When a compliance is implemented, it is done with financial transactions and corporate communications. The financial transactions are subjected to completeness, legitimacy of routing/classification and then separated into past and present data The past data makes it way to the Data warehouse. On the corporate communications side, the communications and compliance terms and phrases are fed into a sorter that builds a word phrase context and together with a simple index and actual messages are pushed into the data warehouse.
The two basic activities for the corporations are
to comply with financial requirements and controls and
to comply with organizational communications aspect of regulation.
Financial compliance deals with recording, procedures and reporting of financial transactions. These translate to a whole set of approvals and formalization of procedures that corporations usually had never encountered. The compliance would scrutinize not just at the micro level but at the macro level. The financial transactions must be cared for at both the micro and the macro levels.
Most corporations start with the present aspect of the financial transactions audit. They use mini-audits up front to make sure the systems comply. Once the audit and the procedures are finished, the data makes its way into the warehouse.
However the compliance requires looking at past data. Since the warehouse holds the historical data, granular data, and integrated data, it becomes useful for financial auditing. When looked at broadly the corporate finances deals with the two aspects of what and why.
The what is answered by the details of all financial transactions - amount, from, date, control number, classification etc. Questions such as whether all transactions are included, are the transactions recorded at the lowest level of granularity, is the relevant information for each financial transaction recorded properly ? is the recording accurate and have the transactions been classified ? are covered.
The one difference between data stored for compliance and the data for warehouse is that the former is seldom used but both are large.
Another difference is that the data for compliance cannot be lost because the audits are always needed. The data in the warehouse doesn't have that much sensitivity to loss.
Another differences is the responsiveness to queries. For warehouse the query responsiveness can range widely. For audit queries, these are usually completed in days.
Yet another difference between data warehouse data and data stored for compliance is content. The length of time that compliance data needs to be stored depends on legislations, company comfort, and physical storage.
The why of the financial transactions are about activities that take place before a transaction occurs. These include things such as proposals, commitments, terms, delivery and guarantee.

Thursday, January 2, 2014

This post returns back to the discussion on data warehouse starting with the cost justification and return on investment for a data warehouse. We look at the macro level for cost justification before we compare the micro level. The first refers to a discussion at a high level such as what were the increase in profits or stock price. However macro level is affected by many factors and not just by improvements in warehouse. So specific association may have to be determined.
For the micro level cost justification, each data pull from operational systems and integration is compared against the ease of use from a data warehouse. Information from the legacy environment is hard to obtain where data may not be proper, undocumented APIs may be involved, guesses have to be made, and the process in general is very convoluted. Even when it isn't, there's still integration involved. Further, the data may not all be available at the same time. A staging area may also be required. and finally, a report might be published.
The difference in the cost of information with or without a data warehouse is the basis for its cost justification.
When we look at the steps involved in building a warehouse, they are similar to what's been just described above, with the difference that there are far less redundancies and more efficiency. Hence the cost of building a warehouse should be lesser than the same opeartions without it. Further this cost of building a warehouse is one time but the operations may need to be performed every now and then in its absence.
There is also very little time required for getting information from the data warehouse. This savings in time also translate to savings in cost.
Also the speed of information is also appreciated for decision support. Sometimes this is critical for new business. There is a time up to which the information may be very valuable and a point of time after it which it may even be worthless. This is called the time value of information and is also helpful in recognizing the significance of the warehouse albeit difficult to quantify.
Integrated information is best available from the warehouse. For example, customer centric data may be very helpful in exploring new opportunities.
The historical data is also a real value. It becomes another dimension in the usefulness of a data warehouse.
Thus we see that a better way to proceed with a cost-justification is at a micro level
The data warehouse doesn't provide real time information but something near that can be provided by an operational data store. The data flows between the ODS and the warehouse in a bidirectional manner. Profile records are often created and placed in the ODS.

Wednesday, January 1, 2014

This is a short break to do UI automation testing with iOS. Here we look at some sample code (untested) for automating Notes app after we set it in Instruments.

var testName = "Notes application testing";
UIALogger.logStart(testName);

UIALogger.logMessage("Create a new note");
var target = app.localTarget();
var app = UIATarget.localTarget().frontMostApp(); // assumes Notes has been specified in Instruments and starts on a fresh page
var page = app.textViews[0];
var sentence = "The big brown bear ate the brownies in the big paper bag.";
var added = "";

UIALogger.logMessage("Repeat entry of : The big brown bear ate the brownies in the big paper bag.");
for( var i = 0; i < 6; i++)
{
if (i == 0) app.navigationBar().buttons()["+"].tap();
else app.textFields[0].tap();
saveSentence(app, page, sentence);
added.concat(sentence); 
}

UIALogger.logMessage("Save a log of the inserted lines in the note.");
saveLog(target, app, page);

UIALogger.logMessage("Retrieve the note created in #5 and confirm the lines are exactly the same.");
if (page.value.indexOf(added) != -1) {
 
    UIALogger.logPass(testName);
 
}
 
else {
 
    UIALogger.logFail(testName);
 
}

UIALogger.logMessage("Delete the note and confirm it is no longer stored in Notes.");
deleteNotes(target, app, sentence, testName);

UIALogger.logMessage("Test completed.");

UIATarget.onAlert = function onAlert(alert) {
 
    alert.buttons()["Delete Note"].tap();
 
    return true;
 
}

function saveSentence(target, app, page, sentence)
{
page = app.textViews[0];
UIALogger.logMessage("Open Notes");
target.delay(1); 
page.tap(); 
UIALogger.logMessage("The new note must have the following line in it - The big brown bear ate the brownies in the big paper bag.");
page.setValue(page.value + sentence) ;
UIALogger.logMessage("Close Notes");
app.navigationBar().buttons()["Done"].tap();
app.navigationBar().buttons()["Notes"].tap();
}

function saveLog(target, app, page)
{
target.delay(1); 
app.textFields[0].tap();
page = app.textViews[0];
page.tap(); 
page.setValue(page.value + page.logElementTree()) ;
app.navigationBar().buttons()["Done"].tap();
app.navigationBar().buttons()["Notes"].tap();
app.textFields[0].tap();
page = app.textViews[0];
}

function deleteNotes(target, app, title, testName)
{
target.delay(1);
UIATarget.localTarget().deactivateAppForDuration(10);
app.navigationBar().buttons()["Delete"].tap();
app.navigationBar().buttons()["Notes"].tap();
if (app.textFields[0].value == title)
 UIALogger.logFail(testName);
}