Cluster computing

Wednesday, January 8, 2014

A recap of on five progressive steps by Barry Wise to database normalization
We start out with an example where we store user's name, company, company address, and personal urls - say url1 and url2
The zero form is when all of this is in a single table and no normalization has occurred.
The first normal form is achieved by
1) eliminating repeating groups in individual tables
2) creating a separate table for each set of data
3) identify each set of related data with a primary key
so this yields a table where the user information is repeated for each url so url field limitation is solved
The Second normal form is achieved by
1) Creating separate tables for a set of values that apply to multiple records
2) Relate these tables with a foreign key
Basically, we break the url values into a separate table so we can add more in the future
The third normal form is achieved by
1) eliminating fields that do not depend on the key
Company name and address have nothing to do with the user id, so they are broken off into their own table
The fourth and higher form depend on data relationships involving one-one, one-to-many and many-to-many.
The Fourth normal form is
1) In many to many relationship, independent entities cannot be stored in the same table.
To get many users related to many urls, we define a url_relations where they user id and url id are paired.
The next normal form is the Fifth normal form which suggests that
1) The original table must be reconstructed from the tables into which it has been broken down. This is a way to check that no new columns have been added.
As always, remember that denormalization has its benefits as well.
Also, Litt's tips additionally mentions the following :
1) create a table for each list. More than likely every list will have additional information
2) create non-meaningful identifiers.
This is to make sure that business rule changes do not affect the primary identifier

Barry Wise on five progressive steps to database normalization
We start out with an example where we store user's name, company, company address, and personal urls - say url1 and url2
The zero form is when all of this is in a single table and no normalization has occured.
The first normal form is achieved by
1) eliminating repeating groups in individual tables
2) creating a separate table for each set of data
3) identify each set of related data with a primary key
so this yields a table where the user information is repeated for each url so url field limitation is solved
The Second normal form is achieved by
1) Creating separate tables for a set of values that apply to multiple records
2) Relate these tables with a foreign key
Basically, we break the url values into a separate table so we can add more in the future
The third normal form is achieved by
1) eliminating fields that do not depend on the key
Company name and address have nothing to do with the user id, so they are broken off into their own table
The fourth and higher form depend on data relationships involving one-one, one-to-many and many-to-many.
The Fourth normal form is
1) In many to many relationship, independent entities cannot be stored in the same table.
To get many users related to many urls, we define a url_relations where they user id and url id are paired.
The next normal form is the Fifth normal form which suggests that
1) The original table must be reconstructed from the tables into which it has been broken down. This is a way to check that no new columns have been added.
As always, remember that denormalization has its benefits as well.
Also, Litt's tips additionally mentions the following :
1) create a table for each list. More than likely every list will have additional information
2) create non-meaningful identifiers.
This is to make sure that business rule changes do not affect the primary identifier

Some T-SQL queries
SELECT t.name as tour_name, COUNT(*)
FROM Upfall u INNER JOIN trip t
on u.id = t.stop
GROUP BY t.name
HAVING COUNT(*) > 6
Aggregate funcions - AVG(), MAX(), MIN(), MEDIAN(), COUNT(), STDEV(), SUM(), VARIANCE()

--Summarizing rows with rollup
SELECT t.name AS tour_name, c.name as county_name COUNT(*) as falls_count
FROM upfall u INNER JOIN trip t
ON U.id = t.stop
INNER JOIN county c ON u.county_id = c.id
GROUP BY t.name, c.name with ROLLUP

SELECT t.name as tour_name,
c.name as county_name
COUNT(*) as falls_count
GROUPING(t.name) as n1 -- test null from cube
GROUPING(t.name) as n2 -- test null from cube
from upfall u INNER JOIN trip t
ON u.id = t.stop
INNER JOIN county c
ON u.county_id = c.id
WHERE t.name = 'Munising'
GROUP BY t.name,c.name WITH CUBE

--RECURSIVE QUERIES
WITH recursiveGov
(level, id, parent_id, name, type) AS
(SELECT 1, parent.id, parent.parent_id, parent.name, parent.type
FROM gov_unit parent
WHERE parent.parent_id IS NULL
UNION ALL
SELECT parent.level + 1, child.id, child.parent_id, child.name, child.type
FROM recursiveGov parent, gov_unit child
WHERE child.parent_id = parent.id)
SELECT level, id, parent_id, name, type
FROM recursiveGov

CREATE TABLE COUNTRY(
ID int identity (1,1),
NAME varchar(15) NOT NULL,
CODE Varchar(2) DEFAULT 'CA'
CONSTRAINT code_not_null NOT NULL
CONSTRAINT code_check
CHECK (country IN ('CA', 'US')),
indexed_name VARCHAR(15),
CONSTRAINT country_pk
PRIMARY KEY(ID)
CONSTRAINT country_fk01
FOREIGN KEY (name,code)
REFERENCES parent_example (name,country),
CONSTRAINT country_u01
UNIQUE(name,country)
CONSTRAINT country_index_upper
CHECK(indexed_name = UPPER(name))
);

Tuesday, January 7, 2014

This post is about one of the interview questions and is a coding problem :
bool IsMatch(string input, string pattern);
string input can be "ABCDBDXYZ"
string pattern can be "A*B?D*Z"
* and ? are the usual wild card for 0 or more and only one char respectively.
Here are some possible implementation:
bool IsMatch(string input, string pattern)
{
return Regex.Matches(input, pattern).Count > 0;
}

A brief review of programming interviews exposed book.
Bitwise operations - OR(any), AND(both) and XOR(same=0,different=1)
Shift operations - Base 2 right shift => divide by 2 and left shift => multiply by 1
Rectangle overlap is written as a.ul.x <= b.lr.x && a.ul.y >= b.lr.y && a.lr.x >= b.ul.x && a.lr.y <= b.ul.y
union{
int theInteger
char singleByte;
} endianTest;
endianTest.theInteger = 1
return endianTest.singleByte;
Permutations proceeds with an array of booleans for each element that denotes whether it is used
Combinations proceeds with varying start
Tackle graphical and spatial problems with pictures and over time.
Tackle Nodes and lists with previous, current and next variables during traversal.
Trees and Graphs are best tackled with traversals.

Monday, January 6, 2014

The red-black tree insert is very much like a tree insert except that its colored red before fix up.
The red-black tree delete considers four cases corresponding to
the fix up sibling w is red => color w to black and left-rotate
x's sibling w is black and both of w's children are black => color w to red.
x's sibling w is black and w's left child is red and right child is black => exchange color of w and its left child and right-rotate
x's sibling w is black and w's right child is red => change w's color and its parent and perform a left rotation
Now on to networking technologies:
SNMP - manages states such as address translation tables, routing tables, TCP connection states etc using MIB
Resolution occurs with different levels of identifiers : domain names, IP addresses, and physical network addresses. First, users specify domain names when interacting with the application. Second, application engages DNS to translate domain names to IP address and lastly IP engages ARP to translate the next hop IP address to physical address.
TLS session involves Client and Server communication where server sends certificate to client with its public key, client sends session keys, initialization vectors etc with encryption using the public key, server decrypts messages with its private key.
A certificate is a document with a digital signature and is signed by a Certification Authority. Keyed MD5 produces a cryptographic checksum for a message as m + MD5(m + k)
Public Key Authentication happens with A sending E(x, Public-B) to B and B sending back the decrypted x.
Kerberos provides a third party authentication by initiating, intercepting and closing the handshake.
Transmission Control Protocol provides ordered reliable error free transmission with flow control and congestion management. This it does with sequence numbers, sliding window and window scaling. Sequence numbers, selective acknowledgements, receive window fields and persist timers help manage the flow.

Sunday, January 5, 2014

I'm preparing for an interview so I will post frequently as a recap of the things I revised.
A revisit of the Active Directory configurations and DNS and networking technologies.
Active Directory site topology and replication -
Replication usually from single master server to subordinate servers
Active directory offers multimaster replication; avoids single point of failure
KCC tool sets up and manages the replication connections.
KCC uses two modes - intrasite and intersite.
intrasite is designed to create a minimum latency ring topology between DCs
the intersite uses a spanning tree algorithm with site link metrics.
Replications flows are setup between sites and DFS shares.
By default there's one site created automatically.
Multiple sites can be defined for a single location to segregate resources.
AD sites are defined in terms of a collection of well-connected AD subnets.
Site links connect and DC uses them to cover additional ones including the current site for user logons.
If not all site links are available, bridges are used instead.
Naming contexts are replicated by a domain controller by maintaining a high watermark table
- one each for schema, configuration and domain NCs.
This is based on the highest USNs of the updates.
Conditional Forwarding, delegation options and Dynamic DNS.
CF is the feature that lets name resolution for an ip address to be passed other than the local dns
DNS servers can be primary or secondary
primary stores all the records
secondary gets the contents from primary
The contents of a zone file are stored hierarchically
This structure can be replicated among all the DCs.
It is updated via LDAP operations or DDNS (must have AD integration)
A common misconfiguration issues is the island issue when ip address for a DNS changes
and it is updated only locally. To do a global update instead, they must point to a root server other than themselves.
Delegation options are granted to DNS servers or DCs.
Simple is when DNS namespaces are delegated to DCs and DC hosts a DNS zone.
The records in a DNS server as opposed to DC are autonomously managed.
DNS servers need to allow DDNS by DC
DC does DDNS to prevent updates to the DNS records in the server.
Support and maintenance is minimal with DDNS.
Standalone AD is used to create test or lab networks.
A forest is created, a DC is assigned, DNS Service is installed.
DNS zone is added, unresolved requests are forwarded to an existing corporate server
The primary DNS for all clients point to the DC.
Background loading of DNS Zones makes it even easier to load DNS zones while keeping the zone available for dns updates / queries.

Algorithms and data structures:
1) Quicksort - defined as
Partition
Quicksort one side
Quicksort other side

Partition works something like this
x is the value of the partition candidate A[r] in A[p,r]
i,j indexes are maintained
j iterates from first to the last but one
i lags behind j
i and j bound the values higher than the partition candidate x

Radix sort - based on significant digits starting from right to left.

Insertion sort - think sorted list or arranging a deck of cards.
Merge sort - -
Mergesort A,p,q,r
Mergesort A,q +1, r
Merge A,p,q,r
bottom up merge and at each step sort the contents on merge
for k from p to r
if L[i] < R[j]
A[k] = L[i] i = i + 1
else
A[k] = R[j] j = j + 1

HeapSort O(nlogn)
uses a heap
Parent(i) = i/2
Left(i) = 2i
Right(i) = 2i + 1
for i from length(a)/2 downto 1
do Max-Heapify(A,i)
Max-Heapify is recursive

Tree-Successor : return minimum on the right sub tree or keep climbing the parents until the given node is descended from the left

Tree - predecessor : return maximum on left subtree or keep climbing until the given node is descended from the right.

Tree-delete uses tree-successor.
Tree-Insert walk down the tree to find the value less than the key, then insert there
Tree - Delete depends on how many children the target z has. if z has no children, we just remove it. If z has only one child, we splice out z If z has two children, we splice out its successor y which has at most one child.
Red-black tree insert and delete is even more interesting.