Sunday, June 30, 2013

How replication conflicts are resolved ?
There can be conflicts during replication cycle. For example, server A creates an object with a particular name at roughly the same time that Server B creates an object. with the same name. The conflict reconciliation process kicks in at the next replication cycle. The server looks for the version numbers of the updates and whichever is higher wins the conflict. If the version numbers are same, whichever attribute was changed at a later time wins the conflict. If the timestamps are equal, the guids are checked and whichever is higher wins.
If an object is moved to a parent that is now deleted, that object is placed in the lost and found` container.
 If  two objects with same RDN are created, then after the usual resolution process mentioned above, one of the conflicting attribute is modified with a known unique value. Between three servers, when a server receives updates from the other two, the conflict resolution is worked out  at this server and repeated at the other two when they request the updates from this server.
 Active Directory uses DNS for name resolution.  WINS is no longer supported.  DNS is a hierarchical name resolution system.  DNS is one of the largest directory services.  DNS nomenclature involves Zones, resource records  and dynamic DNS. Zones are delegated portions of the DNS namespace that a name server maintains. A resource record is the unit of information in DNS. A zone is essentially a collection of resource records. Common record types include address record, pointer record, alias record, mail exchange record, name server record, start of authority record and service record. The last one is used by DC and AD clients to locate servers for a particular service. Service records are dependent on address records. A list of all computers running a service is maintained as A records. 

Saturday, June 29, 2013

Active Directory replication continued ...
When replicating a naming context, a domain controller maintains a high watermark table to pick up where it left off. . There is one table for every naming context which totals three if we include schema, configuration and domain NCs. Each table stores the highest USN of the updates so that only new information is requested.
This is different from the Up-to-dateness vector  which is another table that the DC maintains to assist in efficient  replication of a naming context by removing redundancies in replication and endless replication loops. The two tables can be used together to improve the efficiency in replication.
By filtering out the same changes from multiple sources, only the updates that have not been made yet are done. This is called propagation dampening.  Thus we have seen that the Active Directory is split into separate naming contexts each of which is replicated independently and that within each naming context, a variety of metadata is held. Update entries consist of originating-DSA-GUID,originating-USN and a timestamp indicating the last successful replication with the originating domain controller. These values are updated only during a replication cycle.
 As an example of replication, we take the following example from the book :Step 1) a user is created on DC A Step2) That object is replicated to DC B. Step 3) DC B is subsequently modified and Step 4) the new changes to that object are replicated back to DC A. The Active Directory database transaction representing  step 1 consists of a USN  and the timestamp. The replication of the originating write to a DC B allocates a different USN and the users USNCreated and USNChanged attributes are updated. In Step 3) the password change for the user on DC B also modifies this USNChanged attribute for the user. In addition the password attribute is modified and the corresponding USN and timestamp updated. The step 4 is similar to step 2. A change transaction is issued and the attributes updated.  To look at how the replication occurs, we look at the following five steps : Step 1) Replication with a partner is initiated Step 2) the partner works out what updates to send. Step 3) The partner sends the updates to the initiating server. Step 4) The initiating server processes the updates and Step 5) The initiating server checks whether it is up to date.

Friday, June 28, 2013

Active Directory Site topology and replication

While data is usually replicated from a single master server to subordinate servers, Active Directory offers multimaster replication. The single-master replication has following drawbacks: it has a single point of failure, there's geographic distance from master to clients performing the updates, and less efficient replication due to single originating location of updates. With multimaster replication, you can avoid these but you have to first create a site topology and define how the domain controllers replicate with each other. The Knowledge Consistency Checker tool sets up and manages the replication connections. Subnets are added to site with 32 bit or 128bit IP addressing to determine relative locations on a network. AD sites are defined in terms of a collection of well-connected AD subnets. Replication flows are setup between sites and DFS shares or DCs are located using sites.  Sites are used to perform DNS queries via the DC locator service which finds the nearest DC or the global catalog. Most members of a domain dynamically determine their site when they start up. By default, there's one site created automatically. Multiple sites can be defined for a single location to segregate resources.
Site links allow you to define what sites are connected to each other and the cost associated. Site links are used for replication and can support IP or SMTP. The default replication happens via IP but in some cases where the connectivity is poor or unreliable. These site links help a DC to determine which  other sites to cover in addition to its own site for someone to logon to that site. As a trivia, if the network is not fully routed i.e. not all site links are available, then bridges have to be defined between sites. Connection objects specify which DC replicate with other DCs and are generally managed by the DC themselves. It isn't always possible to allow AD to manage all of these connections.
 Knowledge Consistency Checker tool automatically maintains and generates the connection objects and knows how to replicate them and when. It uses two algorithms one called intrasite and another called intersite. The intrasite is designed to create a minimal latency ring topology that guarantees no more than three hops between any two DCs in the site. The intersite on the other hand, tries to keep the sites connected via a spanning tree algorithm so that replication can occur and uses the site link metrics to make the connections. RepAdmin is a commandline tool for administering replication. Replmon is a graphical utility for managing and monitoring replication.
 Replication is done with either update sequence number (USN) or timestamps. Each DC maintains its highest combined USN for all naming contexts.
Constructing a web image graph requires uses the VIPS algorithm  which separates out the web page into visual blocks that are organized in a tree. Different blocks are related to different topics, so we use the hyperlinks from block to page, rather than from page to page. The page to block and block to page relationships are extracted first. The block to page matrix  Z with dimensions nxk is constructed first  and the matrix has values for a cell as the inverse of the number of pages to which a block links or zero otherwise. The page to block matrix X with dimensions kxn is populated with a normalized importance value based on the size and the distance from the center or zero otherwise. These two matrix are combined to form a new web page graph W = ZX.  Let Y define a block to image matrix with dimension nxm such that each block contains the inverse of the number of images contained in the image block or zero otherwise. Using link level analysis, the block graph is defined as Wb = (1-t)ZX +TU/D where t is a constant and D is a diagonal matrix. The diagonal matrix is populated with zero if block i and block j are contained in two different web pages, otherwise it is set to the default degree of coherence from VIPS algorithm. The block graph attempts to define the probability of jumping from block a to block b. Then the image graph is constructed Wi = (Y-Transposed)WbY. This image graph better reflects semantic relationships between images and can be used with the data mining techniques discussed earlier.
 
A frequently occurring interview question is to print the nodes of a tree using the breadth first traversal. The breadth first traversal of a tree involves visiting each node from left to right at each level on the tree starting from the root and descending to the leaves. The order in which each sibling occurs is the order in which their parents occur. Dijkstra's algorithm uses this to order the visits in the breadth first traversal by maintaining a queue. The root is added to the queue. There are no more nodes at this level. When the root is dequeued, its siblings if any are added to the queue. When one of the siblings is dequeue, its siblings are enqueued. When we dequeue the last node at any level and add its siblings to queue, the next node to be dequeued is the first node of the next level. We repeat this until the queue is empty. The queue is empty when all of the nodes have been added and removed. Since the visits are ordered the time complexity is linear. 

        static void PrintBreadthNodes(BinaryTreeNode root)
        {
            if (root == null) return;
            var list = new List<BinaryTreeNode?>(){null};           
            list.Add(root);
            while (list.Count > 0)
            {
                var item = list.First();
                list.RemoveAt(0);
                if (item != null)
                {
                    Console.Write(item);
                    if (item.left != null) list.Add(item.left);
                    if (item.right != null) list.Add(item.right);
                }
                else
                {
                    if (list.Count == 0) return;
                    list.Add(null);
                    Console.WriteLine();
                }
            }
        }

Thursday, June 27, 2013

Mining multimedia data on the web
Multimedia can be embedded on the web pages and associated with link information. These texts and link information can be regarded as features. Using some web page layout mining techniques, a web page can be partitioned into a set of semantic blocks, then the block that contains the semantic data can be considered a whole. Searching and organizing multimedia data can be considered as searching the multimedia as a whole. VIPS discussed earlier can help identify the surrounding text and this text can be used to build an image index. Then image search is partially completed using traditional text search techniques. To construct a web-image search, in addition to the block to page and page to block relations, a block to image relation is also used.
For the automatic classification of web documents, different vendors maintain their own taxonomy of web document and new documents are classified based on this taxonomy. Otherwise the procedures described earlier for keyword based document classification and keyword based association analysis are also used here.
We discussed web usage  and structure based mining, and from the previous post, we now discuss web usage mining. Here web log records is used to discover who accesses what pages and their access patterns. This can give clues to who the potential customers are, enhance the quality and delivery of internet information services to end users and improve performance. Web log records can quickly grow in size and can amount to huge data size. There is a lot of information that can be mined but valid but making sure its valid and reliable is equally important.  So we begin with cleaning, condensing and transforming as preprocessing methods. Next with the available URL, time, IP address and web page content information, a weblog database is populated. This is used with a very typical data warehouse based multidimensional OLAP analysis. Then we use association mining as discussed earlier to find patterns and trends of web access. For example, user browsing sequences of web pages can be helpful to improve their web experience. Systems can be improved with better web caching, web page prefetching, web page swapping, understanding the nature of web traffic and understanding the user reaction and motivation. Web sites can improve themselves based on the access pattern and they are referred to as adaptive sites.
Web usage together with web content and web structure together help with web page ranking. Hence the quality of search is improved, because search is contextualized and personalized.
Having talked about the VIPS algorithm and structural relationships, it may be interesting to note that we extract page to block and block to page relationship to construct a graph model with page graph and block graph. Based on this graph model, new link analysis algorithms capable of discovering the intrinsic semantic structure of the web. The block to page (link structure) and the page to block(page layout) relationships are used in block level link analysis. The block to page relationship is obtained from link analysis, using a matrix for distances, which gives more accurate and robust representation of the link structures of the web. The page to block relationships are obtained from the page layout analysis which can be segmented into blocks. Then we construct the block graph and the image graph in turn. This image graph better reflects the semantic relationships between the images. 

Wednesday, June 26, 2013

Mining the world wide web
The world wide web serves as a huge, widely distributed global information service center for a variety of market such as finance, consumer, education, government, e-commerce, and many others. In addition to the content on the webpage, web page access, usage information and hyperlinks also provide additional sources for data mining.  The challenges are : 1) The web is too huge for effective data warehousing and data mining. 2) The web pages are complex, lack a unifying structure and vary quite a bit. 3) The web is a highly dynamic information source. 4) The web serves a broad diversity of user communities. 5) The bulk of the information is considered irrelevant. Web search engines serve up resources on the internet and they usually index the web pages and store huge keyword based indices. However, such approach has had limitations. First a topic can contains hundreds of thousands of documents. This can lead to a number of document entries returned by a search engine. Second relevant documents may not be retrieved because they don't contain the keywords.  Since keyword based web search engine is not sufficient for web resource discovery, web mining has to address search on web structures, rank the contents, discover patterns etc. This is typically categorized as web content mining, web structure mining, and web usage mining.  Web pages are supposed to have a DOM tree structure. and have a layout that's generally shared by many. Hence page segmentation based on Vision is a common technique. The page layout is taken and the blocks are extracted while finding the separators so that the web page can be represented as a semantic tree. Search engines also automatically identify authoritative web pages. This is done based on the collective endorsement of web pages by way of hyperlinks pointing from one page to another. These hyper links infer the notion of authority. Also, a so called hub can provide a collection of links to authorities. Hub pages may not be prominent, or there may exist few links pointing to them. but they can be mined to find authoritative pages. This algorithm called HITS (Hyperlinked Induced topic search involves using the query terms to collect a starting set, also called the root set, Since many of the pages are presumably relevant, the root set can be expanded to a base set based on a threshold of additional pages to include. And a weight propagation method is initiated. Finally the HITS algorithm outputs a short list of pages with large hub weights. To workaround the size of the hub pages, the weights of the links are adjusted and the hub pages are broken down into smaller units. Google's page rank uses something similar. Other newer algorithms include block-level link analysis and block to page relationship. Next we will look at mining multimedia data on the web.