Saturday, January 31, 2015

Today we continue our discussion on the detector distribution algorithm. If there are N intelligent disks, then they are clustered into k  categories as follows:
Step 1: Initialize the N disks into individual clusters of their own with the N centers pointing to the disks themselves. K = N
Step 2: stop the execution if K clusters have been formed.
Step 3: merge shortest distance clusters into a new cluster and adjust the center.
Step 4: use this new cluster and decrement K to repeat with step 2.
The shortest distance between clusters is computed as the fraction of the number of detectors stored in the top layer over K. These detectors are randomly selected separate from the K -categories  and stored in the top access control module. The  rest is stored in the lower access control module.
Let us take a closer look at the Access control function distribution.This is not computed by just the metadata server but also by the intelligent disks thereby avoiding a single point of failure and performance bottleneck.  However the number type detectors are generated only by the metadata server.This helps with the accuracy of the access request inspection.  The detectors are distributed among top and lower access control modules. The two layers can inspect access request in parallel. The lower layer stores a smaller number of detectors so that it doesn't get in the way of IO. During detector generation, all sub-string in the legal access request are extracted one time and converted to a single integer value. The numerical values are then indexed with a B-Tree. This avoids the conversion of the same substring from different sources (access requests) and improves the algorithm in detector selection. The numerical interval found this way avoids the mistake in judging a  legal access request as an abnormal one. Anything that lies outside this numerical range is not a detector or a valid access request.
The move from comparing binary strings to numerical values reduces the inspection overhead considerably.Further with clustering, these detectors are distributed in such a way that it improves the accuracy of the inspection without the overhead.

Friday, January 30, 2015

Today we continue our discussion on Two layered access control in Storage Area Networks. We mentioned the matching rule used in mature detector selection and access request inspection. The substrings in the initial detector and the legal access request are compared bit by bit. This accounts for most of the cost in terms of time and space. The substring in the detector can be extracted and converted to only single integer value. The matching threshold, say r, is defined for the length of the substring. This is then used in the two main processes in access request inspection : analyzing legal access request and the integer value selection for the number type detector.  The extraction and conversion to a single integer value is done using the formulae which takes the location of the substring relative to the left in the binary string and enumerates the possible choices for the length of the substring and for the next segment unto the length of the binary string enumerates the possible choices by setting the current bit and the permutations possible with  the rest and then cumulating this for the length equal to that of the substring. This calculation of permutations helps us come up with a unique integer index. All these integer values are in the range of 0 to the number of permutations possible with the remainder of the binary string from r. This is a one-dimensional limited interval and can be efficiently indexed with a B-Tree. The B-Tree is looked up for an entry that is not the same as any number type detector. We now review the detector distribution algorithm. This was designed keeping in mind the discordance between the metadata server and the intelligent disk such as the processing capacity and function. They may yet co-operate on a single access request.  The negative selection algorithm was improved based on this relations between the metadata server and the intelligent disk. The processing capacity of the metadata server was strong and the overhead of the access control would cause loss of I/O performance. The processing capacity of the intelligent disk  was poor and the overhead of access control would make large loss of I/O performance. The strategy used was to divide the file into several segments and stored in the different intelligent disk.So that any one lower access control in these intelligent disks can complete access control function for this file. By the file segmentation information, a shortest distance algorithm is used to cluster the intelligent disk.
#codingexercise
Double GetAlternateEvenNumberRangeSumProductpower()(Double [] A, int n)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSumProductPower(n);
}
One more
#codingexercise

Double GetAlternateEvenNumberRangeSqrtSumProductpower()(Double [] A, int n)
{

if (A == null) return 0;

Return A.AlternateEvenNumberRangeSqRtSumProductPower(n);

}

Now back to the discussion we were having on the detector distribution algorithm. The majority of the detectors were in the metadata server and the remaining in the access control module of the intelligent disk. Data segmentation strategy was used in the storage area network, One file would be divided into several segments and stored in different disk. The access control module for any one disk can make the access control decision for the file. Since disks share the same file descriptors for common files, they can be clustered based on a similarity measure that works out as the fraction of the common file descriptors to the total file descriptor. If the top layer access control module stores num detectors, then assuming we cluster into k categories , we select num/k detectors and store them in the top layer. The rest is distributed in the lower access control module in the corresponding category.

Thursday, January 29, 2015

Today we continue to discuss the paper Two layered access control for Storage Area Network by Tao, DeJiao, ShiGuang. In this paper they describe it in the context of an artificial immune algorithm that can efficiently detect abnormal access. The two layers consist of a central metadata server which forms the top layer and the intelligent disks that may number 1 to n which forms the lower layer  Detectors are used in both layers that receive, analyze and inspect the access request. The inspection is the new addition to the tasks of the metadata server or intelligent disk and it intercepts the access request prior to the execution of the command and the return of the data. The immune algorithm first does a negative selection to generate detectors for access request inspection. If there is a match, the request is considered abnormal and denied. The algorithm also decides the generation and distribution of detector to intercept the access requests.  We now look at the detector generation algorithm. It uses the notion of an antigen and a detector both are represented by binary strings. The latter represents a space vector. All non-repeating binary strings are generated as the initial detector. The initial detectors that did not match any of the illegal access request were selected to be a mature detector. There are more than one detector generation algorithms namely Enumeration generation algorithm, linear generation algorithm, and greedy generation algorithm. In these algorithms, the initial detector are enumerated and the mature detectors are randomly selected.  They have large time and space overhead. For selecting mature detectors and for inspecting access requests, matching is done based on a set of matching rules. These can be r-contiguous matching rules, r-chunk matching rules and Hamming distance matching rule.  Matching involves comparing binary substrings unto r bits between the detectors and the legal access request All substrings with more than r bits in legal access request was traversed and there was no index for them. This study used the Hamming distance matching rule. The binary string matching could be improved. The number type detector and the matching threshold of r bits is defined for the length of the substring. The substring in the detector is converted to a single integer value. Access request inspection then involves analyzing legal access request and the integer value selection for number type detector.
#codingexercise
Double GetAlternateEvenNumberRangePowerRtProductpower()(Double [] A, int n, int m)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangecubePowerRtProductPower(n,m);
}

Wednesday, January 28, 2015

Today we continue our discussion on Access Control. We were reviewing RBAC models specifically  IRBAC2000 model. We now review MDRBAC model We noticed that it was an improvement over IRBAC2000 because it introduced the notion of time restricted mappings  and minimized the role mapping restrictions. Still it suffered from the problem that each participant of the grid had to now implement this model.
We next look at the model of CBAC. This is also a mutual operation model is based on the premise that the associated multi-domain can be dynamically defined. The dynamic relationship is called an alliance. Information can be exchanged on an alliance relationship. The shortcoming of this model is that the authorization is not clear because the dynamic relationship only helps with the role mapping.
These are some of the models that were presented in the paper.
We next look at two layered access control for Storage Area Network as written by Tao et al. This is a slightly different topic from the access control models we have been discussing but it is informative to take a look at access control in storage area networking. First access control is relevant in storage area networking and second, there is an immunity algorithm in this case. However, it incurs a large space and time overhead which has performance implications for large I/O. The structure of two layered access control is already given. The top layer being the layer that maintains metadata and the lower layer maintaining the disk. The distribution strategy for two layer access control is presented. The top layer generates all the detectors and the preserves a majority of them. The lower layer maintains a small number of detectors. The network access request is inspected with the help of the top layer access control module. The problem of protecting  the storage area network contains several parts such as data and communication encryption, certification and access control.To prevent the illegal request and pass the valid request are the two main functions. Numerical detectors are used where their indices are found using a B-Tree.  The detectors are used to inspect the access request. If the detector matches the access request, the control module will deny access to the request. The distribution of the detector is the main concern here.
#codingexercise
Double GetAlternateEvenNumberRangecubeRtProductpower()(Double [] A, int n)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangecubeRtProductPower(n);
}

Tuesday, January 27, 2015

#codingexercise
Double GetAlternateEvenNumberRangecubeRtProductSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangecubeRtProductSquares();
}
Today we continue the discussion on Access Control.  We alluded to how RBAC does not adequately address a grid computing environment.  We now look at a few specific models of RBAC.
IRBAC2000 was proposed by Kapadia. Its called a role-based multi domain mutual operation model. By the dynamic mapping of roles among domains, it can solve the mutual operation between two reciprocal domains to  some extent. Roles are mapped to the corresponding local system definitions.  The mapping is dynamic and this provides flexibility.  The shortcoming is that once the mapping is done, access cannot be restricted further because it could violate the mutually exclusive roles. Say if it got mapped to one role and then we added another mutually exclusive role to the same external access, then our system would be confused.
The MDRBAC was introduced to solve this specific problem. It introduces notions such as domain agency, role mapping with time attribute, and minimized role mapping  restriction and it is applied in the access control among reciprocal domains.
The time attribute is helpful to restrict the duration for the mapping is made so that the mappings can be renewed with another or refreshed.
#codingexercise
Double GetAlternateEvenNumberRangecubeRtProduct()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangecubeRtProduct();
}

Monday, January 26, 2015

Today we continue discussing RBAC model. We discussed that it could implement both DAC and MAC. It is based on the premise of roles. User migrations are therefore easier to handle. Further, a user has to be assigned a role, the role has to be active and authorized. Permission for the object must also be authorized.
RBAC supports three security principles i.e the minimum authority principle, the responsibility separation principle, and the data abstraction principle. The minimum authority principle means that the system only assigns minimum running authority to the role, and the responsibility separation principle means that mutually exclusive roles can be activated simultaneously to complete one task. The data abstraction principle means that authority is abstracted so that it does not specify explicit operations such as read, write, create, delete etc.
RBAC became popular in the enterprise world for its ease of deployment and the controls it gave. For example, the user, role, access authority, role class, mutual exclusion, and restriction of roles simplified deployment and management. RBAC provides flexibility, convenience and security.
Access Control in the grid computing environment differs from the enterprise access control in that the there is no more a centralized entity that can support a unified and central access control mechanism. Grid computing might involve peer to peer networks or other distributed technologies where decentralized multi domain management mode may be better suited. Therefore the access control strategy should be studied based on the traditional access control model.
#codingexercise
Double GetAlternateEvenNumberRangeSqRtProductSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtProductSquares();
}

Sunday, January 25, 2015

We continue our reading on the Study of Access Control Model in Information Security by Qing-Hai,Ying et al. Today we review the mention for Role Based Access Control (RBAC). DAC and MAC was not flexible to handle business changes such as adding, canceling, and merging departments, employee promotions, duty changes, etc. Instead role based access control was favored. In RBAC, authorization is mapped to roles. A user can take different roles. This effectively handles changes in the organization. Since users are not assigned rights directly but only acquire it with roles, management of individual user rights becomes a matter of assigning appropriate roles to user's accounts. The roles are classified based on the set of stabilized duties and responsibilities in the management. There are three primary rules for RBAC.
Role assigment - A subject can exercise a permission only if the subject has been selected or assigned a role.
Role authorization - A subject's active role must be authorized for the subject. i.e User cannot take any or all roles.
Permission authorization - A subject can exercise a permission only if the permission is authorized for the subject's active role. i.e the user can exercise only those permissions assigned to the role.
Roles can be hierarchical in which a higher level role assumes all that comes with the lower level role.
With a hierarchical role and constraints, an RBAC can be controlled to simulate a Lattice Based Access Control. Thus RBAC can also be used to implement DAC and MAC.
A constraint places a restrictive rule on the potential inheritance of permissions from opposing roles.
RBAC is not the same as ACLs.  RBAC differs from access control lists in that RBAC assigns permissions to specific operations with meaning in the organization, rather than to low level data objects. An ACL may control how whether a file can be read or written but it cannot say how the file can be changed. RBAC lends more meaning to the operations in the organization. It can be used to achieve a Separation of Duties which ensures that two or more people must be involved in authorizing critical operations SoD is used where no individual should be able to effect a breach of security through dual privilege.

Saturday, January 24, 2015

After reviewing the DAC data structures, today we read the Mandatory Access Control. Because DAC could not effectively control the information flow direction of the system, a Lattice model was proposed which was extended to form the MAC strategy.  It can be considered stricter than DAC. Each user and resource is attributed with a security class. User could not change his or her security class and only the admin can. Access is granted based on the security class of the user and the security class of the resource. The users security level must be greater than or equal to that of the object. The security class are top secret (TS), secret (S), confidential(C), restricted(U) and is progressively less stricter in that order. When two objects are to be accessed by the user, the objects are combined to form a third object and its security level is set as the meet of the levels of the two individual objects, hence the use of a lattice. A lattice is a partially ordered set in which every two elements have a unique and least upper bound as well as a unique and greatest lower bound. An example is say natural numbers ordered by divisibility where the lowest join is the least common multiple and the greatest meet is the greatest common divisor. Every non-empty finite subset of a lattice has a join or a meet. Using the commutative, associative, absorption and idempotent laws, a lattice can be used to combine two elements.  A bounded lattice is one where the join is 0 and the meet is 1. By governing the information flow to be acyclic and single-direction, the model attempts to prevent security holes arising from conflicting and permissive policy. But the lattice model is a coarse security control scheme. It lacks flexibility and is not convenient in practice. The deployment and documentation of the security class can be confusing. It is generally applied in scenarios where the security rules are definite and the flow is relatively fixed.
For example, in the military, the Bell-LaPadula model is often used. It is based on several properties. The star property also called the containment level, which states that an untrustworthy user can only write to objects whose security level is greater than or equal to their own. This means that information leakage is prevented where someone with a high clearance shares it with others. It includes what is known as a simple property where a user can read data only if the data's security level is as sensitive or less than the clearance level. It also includes a tranquility property which states that the security level of an object cannot be changed while it is in use by the computer system.
#codingexercise
Double GetAlternateEvenNumberRangeProductCubes()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProductCubes();
}

Friday, January 23, 2015

Today  we discuss the paper on the Study of the Access Control Model in information security by Qing hai, Ying et al.  This paper compares and contrasts different access control mechanism, models and instruments specifically access control lists, access control capabilities list, mandatory access control policy, role-based control model and access control in Grid environment. They discuss this in the context of network security. Access control here is about the principals involved, their access and the permissions associated with a resource. The paper talks about three different modes, the discretionary access control (DAC), the mandatory access control (MAC) and the role based access control model (RBAC). DAC permits legal users to access the regulated objects by the identity of the user or user group. In addition users may delegate their authority to other users in a discretionary manner.  This used to exist on all flavors of UNIX, NT/Server etc. The system used to identify the user and then limit access to resources that the user can have access to.The resources that the user can access can be change by any member of a privileged user group.  It is implemented using an access control matrix, an access control list, and an access control capabilities list. The access control matrix is a two-dimensional matrix representing principals such as users, programs and user-agents versus the resources such as documents and services.The cells are filled with authorization permission. This matrix is very flexible to implement DAC. At the same time it suffers from the downsides that it cannot be transmitted and it may affect performance if its size is too big. Space and speed may degrade as the matrix grows. An Access Control List is a linked list of permissions for each principal against a resource.  This is probably the most prevalent mechanism and it is simple, convenient and practical.  An Access Control Capabilities list is also a linked list that subjoins the users list with the objects list so that for a given user, its ACCL describes the objects it has permissions to. Note that ACL was about an object and the users that have access to it. An ACCL determines the capabilities of a  user. Capability can be transferred. The capabilities list is generally considered insecure because by transferring capabilities the resource is not consulted and may lead to unauthorized access of a resource. Both lists suffer from the problem that they can grow to be arbitrarily large depending on the number of users and resources.

Thursday, January 22, 2015

We were discussing the Hearst paper on Text Data Mining. In the section on LINDI project, we talked about a facility for users to build and reuse sequences of query operations. In the gene example, this lets the user to specify a sequence of operations to one co-expressed gene and then iterate this sequence over  list of other co-expressed genes, The interface allows the following operations : iteration, transformation ,ranking, selection and reduction. The ability to record and modify sequences of actions is important because there is no predetermined exploration strategy because this is a new area.When strategies are found to be successful, then they can be considered for automation. Before that, if there are enough strategies then they can be used in an advisory or an assistant mode. The emphasis of this system is to help automate the tedious part of the text manipulation process and to integrate underlying computationally driven text analysis with human guided decision making.
To summarize, large online text collections can be used to discover new facts and trends.
#codingexercise
Double GetAlternateEvenNumberRangeProductSqRtCubes()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProductSqRtCubes();
}
Tomorrow we discuss Two-layered Access control for Storage Area Network.

Wednesday, January 21, 2015

#codingexercise
Double GetAlternateOddNumberRangeProductSqRtCubes()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductSqRtCubes();
}
Today we continue our discussion on Hearst's paper on text data mining. We were discussing the LINDI project. The objectives of the LINDI project are to investigate how researchers can use large text collections in the discovery of new important information, and to build software systems to support this process. The main tools for discovering new information are of two types: 1) support for issuing sequences of queries and related operations across text collections. and 2) tightly coupled statistical and visualization tools for examination of association among concepts that co-occur within retrieved documents. Both set of tools make use of attributes associated with text collections and their metadata. That is why integration is recommended between the tools.
User and system interact iteratively. System proposes new hypotheses and strategies for investigating these hypotheses, and the user either uses or ignores these suggestions and decides on the next move.
In this project, newly sequenced genes are discovered by automation. Human genome researchers perform experiments in which they analyze co-expression of many new and known genes simultaneously. Given this, the goal is to determine which of the new ones are interesting. The strategy is to explore biomedical literature and come up with hypotheses about which genes are of interest.
In tasks like this, the user has to execute and keep track of tactical moves and repeat them often distracting from reasonings. This project provides a facility for users to build and so reuse sequences of query operations via a drag and drop interface. They allow the users to repeat the same sequence of actions for different queries.
The operations include :
Iteration of an operation over the items in a  set.
applying an operation and returning a transformed item.
applying an operation and returning an ordered set of items.
applying an operation and returning a selected set of items
applying an operation and returning a singleton result.

Tuesday, January 20, 2015

Today we continue our discussion on text data mining with another example for exploratory Data Analysis. We discuss using text to uncover social impact. Narin et al studied the effects of publicly financed research on industrial advances. After years of preliminary studies and building special purpose tools, the authors found that the technology industry, relies more heavily than ever on government sponsored research results. The relationships between patent text and published research literature was explored using a procedure as follows: The text from the front pages of the patents issued in two consecutive years was analyzed. There were hundreds of thousands of references from which those published in the last 11 years were filtered. These references were linked to known journals and authors' addresses. Redundant citations were eliminated and articles with no American author were discarded. The study had a core collection of 45000 papers. For these papers, the source of funding was found from the closing lines of the research paper. From these, it was revealed that there was an extensive reliance on publicly financed science.
Patents awarded to the industry were further filtered by excluding those awarded to schools and governments. From these, the study examined the peak year of literature references and found 5217 citations to science papers. 73.3 % of these were found to be written at public institutions.
 The example above illustrates a number of steps needed to perform the exploratory data analysis. Most of these steps at that time had to rely on data that was not available online and therefore had to be done by hand. With this example and the previous one to form hypotheses, we see that the exploratory data analysis uses text in a direct manner.
We next discuss the LINDI project to investigate how researchers can use large text collections in the discovery of new important information, and to build software systems to help support the process.
There are two ways  to discuss new information. Sequences of queries and related operations are issued across text collections. And concepts that co-occur within the retrieved documents are statistically and visually examined for associations.
 There are tools for both and these make use of attributes associated especially with text collections and their metadata. The steps in the second example should be tightly integrated with the analysis and integration tools needed in the first example.
#codingexercise
Double GetAlternateOddNumberRangeProductCubes()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductCubes();
}

Monday, January 19, 2015

#codingexercise
Double GetAlternateEvenNumberRangeProduct()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProduct();
}
Today we will continue to discuss Hearst's paper on untangling text data mining.
 We had cited an example from the previous paper about using text to form hypotheses about Disease. Today we look at this example straight from one of the sources. The fact that new hypotheses can be drawn from text has been alluring for a long time but virtually untapped. Experts can only read  a small subset of what is published in their fields and are often unaware of developments in related fields. Thus it should be possible to find useful linkages between information in related literatures but the authors of those literatures rarely refer to one another's work.
The example of tying migraine headache to deficiency in magnesium was suggested in Swanson and Smalheiser's efforts.  Swanson extracted various titles of articles in the biomedical literature and paraphrased them as we had seen earlier
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression is implicated in some migraines
high levels of magnesium inhibit spreading cortical depression
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability

This led to the hypotheses and its confirmation via experimental means. This approach has been only partially automated. By that Hearst means that there are many more possibilities as can be hinted with combinatorial explosion. Beeferman explored certain links via lexical relations using WordNet.  However, sophisticated new algorithms are needed for helping in the pruning process. Such process needs to take into account various kinds of semantic constraints. This therefore falls in the domain of computational linguistics.

#codingexercise
Double GetAlternateOddNumberRangeProduct()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProduct();
}

Sunday, January 18, 2015

Today we continue to read from Hearst: Untangling Text Data Mining. (TDM)
We saw the mention that TDM can yield tools that indirectly aid in the information access process. Aside form providing these tools, Hearst mention that it can also provide tools for exploratory data analysis.  Hearst compares TDM with computational linguistics. He says empirical computational linguistics computes statistics over large text collections in order to discover useful patterns. These patterns then inform the algorithms of various sub problems within natural language processing. For example, co-occurrence of prices, prescription and patent are highly likely to co-occur with the medical sense of the "drug" while "abuse, paraphernalia and illicit" are likely to co-occur with the illegal use of the word drug. This kind of information can also be used to improve information retrieval algorithms. However, these are different from TDM. TDM can instead help with the efforts to automatically augment existing lexical structures such as WordNet relations as mentioned by Fellbaum. Hearst also gave such an example of identifying lexicosyntactic patterns that help with WordNet relations. Manning gave an example of automatically acquiring sub-categorization data from large text corpora.
We now review TDM and category metadata. Hearst notes that text categorization is not TDM. He argues that text categorization reduces the document to  a set of labels but does not add new data.  However an approach that compares distributions of category assignments within subsets of the document collection to find interesting or unexpected trends can be considered TDM. For example, distribution of commodities in country C1 can be compared with those of C2 to constitute an economic unit. Another effort that can be considered as TDM is DARPA topic detection and Tracking initiative where it receives a stream of new stories in chronological order and the system marks a yes or no on arrival for whether the story is a first reference to an  event.
By citing these approaches Hearst says TDM says something about the world outside the text collection.
TDM can be considered to be a process of exploratory data analysis which results in new and yet undiscovered information or the answers for questions for which the answer is not currently known. The effort here is to use text for discovery in a more direct manner than inferencing manually. Two examples are provided.  First is using text to form hypothesis about disease and second is to use text to uncover social impact.
#codingexercise
Double GetAlternateEvenNumberRangeProductOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProductOfSquares();
}
Double GetAlternateEvenNumberRangeSqRtProductOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtProductOfSquares();
}

Saturday, January 17, 2015

#codingexercise
Double GetAlternateEvenNumberRangeSqRtSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtSumOfSquares();
}
Today we read a paper from Hearst : Untangling text data mining. Hearst reminds us that the possibilities of extracting information from text is virtually untapped because text expresses a vast range of information but encodes it in a way that is difficult to decipher automatically. We recently reviewed the difference between Text Knowledge Mining and Text Data Mining. In this paper, the focus is on text data mining. Some of the new problems encountered in computational linguistics are called out in this paper and outlines ideas about how to pursue exploratory data analysis over text.
Hearst differentiates between TDM and Information Access. The goal of information access is to help users find documents that satisfy their information needs. The standard procedure is akin to looking for needles in a needlestack. The analogy comes from the fact that the problem is about finding information that coexists with other valid pieces of information. The homing in on to the information that the user is interested in is the problem here. As per Hearst, the goal of data mining on the other hand, is to derive new information from data, finding patterns across datasets, and/or separating signal from noise. The fact that an information retrieval system can return a document that contains the information the user requested implies that no new discovery is made.
Hearst points out that text data mining is sometimes discussed together with search on the web. For example, the KDD-97 panel on data mining stated that the two challenges predominant for data mining are finding useful information on the web and discovering knowledge about a domain that is represented by a collection of web-documents as well as to analyze the transactions run in a web based system. This search-centric view misses the point that the web can be considered a knowledge base that is helpful to extract new never before encountered information.
The results of certain types of text processing can yield tools that indirectly aid in the information access process. Examples include text clustering to create thematic overviews of text collection.
#codingexercise
Double GetAlternateEvenNumberRangeSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSumOfSquares();
}
 

Friday, January 16, 2015

#codingexercise
Double GetAlternateOddNumberRangeSqRtSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeSqRtSumOfSquares();
}

We continue our discussion on Text Knowledge Mining. We were discussing reasoning. Complexity is one of the important aspects of the automatic inference in knowledge based systems. Systems may find a tradeoff between expressiveness of knowledge reasoning and complexity of reasoning. While this is true for both TDM and TKM, the TKM don't have as much difficulty. First, both TDM and TKM, are hard problems and there are exponential algorithms and many strategies for designing efficient mining algorithms. Second TKM is not intended to be exhaustive so it can limit the data sources.
Now let us look at how to assess the results. In data mining, there are two phases for evaluating the results. The first phase is one in which statistical significance of the results is assessed. The second phase is one in which there is subjective assessment for the novelty and usefulness of the patterns.
For TKM, we have the following:
The results are considered reliable if there is validity and reliability of the text. Second, inference procedures are applied.
The user decides whether the results are non-trivial.  fortunately the results are expected to be trivial.
The non-triviality of the results is evaluated by the expert user.
The novelty of the results are evaluated against the BK.
The usefulness is evaluated by the experts.
The authors conclude that TKM is a particular case of knowledge mining. Knowledge mining deals with knowledge while data mining deals with data in which the knowledge is implicit. The operations are therefore deductive and abductive inference.

Thursday, January 15, 2015

We continue with our discussion on Text Knowledge Mining. We were discussing knowledge representation, Text mining requires to translate text into a computationally manageable intermediate form. This step is crucial and poses several challenges to TDM and TKM like obtaining intermediate forms, defining structures for information extraction, identifying semantic links between new concepts, relating different identifiers etc. A key problem with obtaining intermediate forms is that the current methods require human interaction. The objective of an intermediate form is not to represent every possible aspect of the semantic content of a text, but those related to the kind of inference we are interested in. That is why even if we are not able to fully analyze the text, we are only impacted with some missing pieces of knowledge. And TKM is not expected to be exhaustive in obtaining new knowledge. That said, if the analyzer obtains an inexact representation of text in an intermediate form, this can affect the knowledge discovered even so much as reporting false discoveries. The knowledge in a knowledge based system is assumed to be reliable and consistent but the same cannot be said to be true for a collection of text. This poses another challenge in TKM.
We now look at the role of background knowledge. The inclusion of background knowledge in text mining applications is widely recognized. Similarly in TKM, text does not contain common sense knowledge and specific domain knowledge that is necessary in order to perform TKM. As an example text containing A is father of B and B is father of A is not considered contradictory without background knowledge. This knowledge can contribute to TKM in the following ways: First, it allows us to create new knowledge that is not fully contained in the collection of text, but can be derived from a combination of text and background knowledge. This was a requirement of Intelligent text mining. Another interesting application of Background Knowledge is in the assessment of knowledge, in aspects like novelty and importance.
Reasoning and complexity are also important aspects of automatic inference in knowledge based systems.
#codingexercise

Double GetAlternateOddNumberRangeSumOfSquares()(Double [] A)


{


if (A == null) return 0;


Return A.AlternateOddNumberRangeVarianceSumOfSquares();


}

Wednesday, January 14, 2015

#codingexercise
Double GetAlternateOddNumberRangeStdDev()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeStdDev();

}

We continue our discussion on Text Knowledge Mining. We discussed the algorithm for finding contradictions. Looking for contradictions is very useful. It helps with the consistency of text collection, or to assess the validity of a new text  to be incorporated, in terms of the knowledge already contained in the collection. In addition, we can use it to group texts such as when we take a  collection of papers expressing opinions about topics where opinions are in different groups. Finally it also lowers the overhead of reasoning with ontologies because we can now instead check the consistency by way of non-contradictions. This check can now become a preliminary requirement for reasoning.
We also looked at the challenges of TKM. Similar to the case in data mining, there are many existing techniques that can be applied but they have to be adapted. This may not always be easy. In addition, some areas require new research. Existing techniques could benefit areas like knowledge representation, reasoning algorithms for performing deductive and abductive inference and knowledge based systems. These are also applicable to natural language processing.
There are several differences between knowledge based systems and TKM
First, knowledge based system is built to contain as much knowledge as possible for a specific purpose. TKM treats them as reports and does not care for any one particular except that they are dedicated information collection.
Second, a knowledge based system tries to answer all the questions whereas a TKM assumes there is no such knowledge pieces and finds new hypothesis.
Third a knowledge based system does reasoning as part of a query processing. TKM does it to find new knowledge without specifying a query, though it can also choose to.
We also look at knowledge representation. Text mining requires to translate text into a computationally manageable intermediate form. This step of text mining is crucial and poses several challenges common to TDM and TKM.  A key problem for obtaining intermediate forms for TKM is that the currently used techniques for translating texts to intermediate forms are mainly semi-automatic involving human interaction. On the other hand, many domains are trying to express knowledge representation models directly. In SemanticWeb for example, not only ontologies are used to represent knowledge but also efficient deductive inference techniques like graph based search are also available.

#codingexercise

Double GetAlternateOddNumberRangeVariance()(Double [] A)


{


if (A == null) return 0;


Return A.AlternateOddNumberRangeVariance();


}


Tuesday, January 13, 2015

Today we continue discussing TKM techniques. Let us review the discovery of contradictions. Let us say res(ti) is a procedure performing resolution on the set of clauses obtained from ti, giving the value  false in case it finds a contradiction. Let ti, tj be the concatenation of texts ( the corresponding clauses ) ti and tj. Let PT be the set of all possible concatenations of subsets of T of any size. V represents the set of concatenations that do not contain contradictions.  Then the algorithm for discovering contradictions is as follows. We find a candidate V1 with text that has no contradictions. We concatenate it to V. This is our initialized set. For the rest of the n iterations, we find text t'  candidate and t'' initialized such that they are from disjoint sets and their combination doesn't contain a contradiction. We check with all the subsets of text containing the candidate to rule out contradictions and perform a union for non-contradiction. Finally, with the set of all possible concatenations of subsets of T of any size and with V that contains a set without contradictions, we can exclude the ones that do contain contradictions.
We now look at the challenges of Text knowledge mining. There is a good analogy with data mining in that existing techniques worked for data mining  and similarly there are existing techniques that can work for TKM. These include the ones for knowledge representation, reasoning algorithms for performing deductive and abductive inference, and knowledge based systems, as well as natural language processing techniques. As in the case of data mining, these techniques must be adapted for TKM.  In addition, new techniques may be needed.  We look at some of these challenges as well.
Knowledge based systems and TKM systems have very different objectives that affect the way techniques coming from knowledge representation and reasoning can be adapted to TKM. In a knowledge based system, a knowledge base is built containing the knowledge needed by the system to solve a specific problem. In TKM, the knowledge managed by the systems is collected from texts each of which is treated as a knowledge base as in historical or normative reports as opposed to something that builds a knowledge system. Another difference is that the knowledge based systems are intended to give answer to every possible question, or to solve any possible problem posed to the system while in TKM, new hypothesis and potentially useful knowledge is derived from a collection of text.  In fact, TKM may exclude or not know about a set of knowledge pieces.
#codingexercise
Double GetAlternateOddNumberRangeMode()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeMode();

}

Monday, January 12, 2015

Today we will continue our discussion on Text Knowledge Mining. We discussed a variety of techniques and how they differ from or are related to TKM. Some advanced data mining techniques that can be considered TKM are related to literature-based discovery. The idea behind literature based discovery is this. Let A1, A2, A3, ... , An, B1, B2 .. C, be a set of concepts and suppose that we are interested in finding the relations between C and some of Ai. Suppose that no Ai and C appear simultaneously in the collection, but they can appear simultaneously with some of Bi. If Bi is large enough then a transitive relation between Ai and C can be hypothesized. This therefore can be used to create new hypothesis that can be been later confirmed by experimental studies.
In this technique, the identification of concepts to obtain intermediate forms of documents is a key point. The main problem with this point is that a lot of background knowledge about the text is necessary and it varies from one domain to another. In addition, the authors state that the relation between concepts is based in most of the cases on their simultaneous appearance in the same title or abstract. Also, the generation of hypothesis does not follow a  formal procedure. That said, some formal model for conjectures, consequences and hypothesis are available in the literature which then can be a tool for literature based discovery and TKM.
A text mining technique that can be considered TKM is one involving the discovery of contradictions in a collection of texts.  The starting point is a set of text documents T = t1, .. tn of arbitrary length. A first order logic representation in clausal normal form of the content of the text is employed as intermediate form  and a deductive inference procedure based on the resolution is performed. The intermediate form is obtained by a semi automatic procedure in two steps. First an interlingua representation is obtained from text. Then the first order clausal form is obtained by means of information extraction techniques even involving background knowledge.
The mining process here is similar to the level wise plus candidate generation exploration in frequent item sets discovery, but replacing the minimum support criteria by non-contradiction in the knowledge base obtained by the union of the intermediate forms of texts. Texts are identified first that do not have a contradiction Texts containing contradiction are not considered in the following level. Candidate generation for the next level is performed in the same way as in frequent item set mining in successive levels. The final result is a collection of subsets of texts in the collection that generate a contradiction.
#codingexercise
Double GetAlternateOddNumberRangeMedian()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeMedian();

}

One of the interesting things about the discussion of TKM is that it seems applicable to domains other than text.

Sunday, January 11, 2015

We continue the discussion on Text Knowledge Mining. We now look at what KM does not mean. In other words, we look at what people are saying about Knowledge mining. For example, knowledge mining is employed to specify that the intermediate form employed contains knowledge richer than simple data structures though inductive data mining techniques are used on the intermediate forms and the final patterns contain only those that are interesting to the user. Another example is when knowledge mining is employed to indicate that the background knowledge and user knowledge is incorporated in the mining process in order to ensure that the intermediate form and the final patterns contain only those concepts interesting for the user. In yet another example, knowledge mining refers to using background knowledge to evaluate the novel and interesting patterns after an inductive process.
Sometimes another term deductive mining is used but it is vague and even this is different from the proposal made so far. Deductive mining is used for a group of text mining techniques  where the better known example is said to be information extraction. This is referred to as the mapping of natural language texts into predefined, structured representation, or templates, which when filled represent an extract of key information from the original text or alternatively as the process of filling the fields and records of a database from unstructured text. This implies that no new knowledge is found.  Therefore this falls in the deductive category based on the starting point but the difference is that the deductive inference performed is translating from one representation to the other. One specific application automatic augmentation of an ontology relations is a good example of information extraction. TKM can use such techniques.
Some have added that deductive data mining be performed by adding deductive capabilities to mining tools in order to improve the assessment of the knowledge obtained.  The knowledge is supposed to be found mathematically and taken into account hidden data assumptions.
A deductive data mining paradigm is also proposed where the term refers to the fact that the discovery process is performed on the result of the user queries that limit the possibility to work on corrupt data. However in these cases, the generation of new knowledge is purely inductive.
Having discussed what TKM is different from, we now see what TKM is related to. Some have alluded to non-novel investigation like information retrieval, semi-novel investigation like standard text mining and KDD for pattern discovery and novel investigation or knowledge creation like Intelligent Text Mining which implies the interactions between users and text mining tools and/or artificial intelligence techniques. In the last category here, there is no indication about the kind of inference used to get the new knowledge.
Some advanced text mining techniques that can be considered TKM are related to literature based discovery where these techniques provide possible relations between concepts appearing in the literature about a specific topic although some authors beg to differ from discovery and mention assisting human experts in formulating new hypothesis based on an interactive process.
#codingexercise
Double GetAlternateOddNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeMax();
}
#codingexercise

Double GetAlternateOddNumberRangeAvg()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeAvg();

}

#codingexercise

Double GetAlternateOddNumberRangeSum()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeSum();

}

Saturday, January 10, 2015

We continue the discussion on Text knowledge mining : An alternative to text data mining paper, We saw that the authors proposed the TKM as one based on abduction and includes techniques that generates new hypothesis. Data mining techniques are based on inductive inference. TDM can be both subjective and objective. When TDM is subjective, typically domain knowledge is involved. When TDM is objective, typically statistics is used. Together they represent simple facts in data structures from which new knowledge can be obtained. This is mining and even though there is strong association with inductive learning, the word induction is generally not used with reference to it  The prevalence of such approach implies that data mining has been useful. Data mining however is  different from text mining in that the former depends on data model while the latter depends on natural language. Data models have limited expressive power. Natural language is richer. It is able to represent not only simple facts but also general knowledge, from relation like rules to complex procedures. The disadvantages with natural language involve computationally expensive to manage structures, vagueness, ambiguity etc. There is no restriction to what kind of techniques whether deductive and/or adductive can be used with mining. By proposing text mining based on non-inductive inference, the authors propose a new direction. TKM is mining by reasoning using knowledge contained in text.TKM is a particular case of what the authors call knowledge mining. The latter defined as obtaining non-trivial previously unknown and potentially useful knowledge from knowledge repositories. The main difference between the definition of knowledge mining and the definition of data mining is that the former intends to discover knowledge from knowledge repositories while the latter attempts to discover knowledge from data repositories. The difference is the starting point, the basic material from which new knowledge is to be obtained. The starting point also makes the approach clear in that the former is deductive or adductive inference while the latter is inductive reference. Reading this it might feel like we are taking existing techniques and applying to a higher layer. In fact there can be other layers as well. Knowledge and data are not the only layers for example, the actions are based on knowledge. Having actions recommended from knowledge  is yet another strata that we haven't reached yet. Coming back to text mining, the phases of the TKM process are exactly the same as those of TDM i.e. first text refining for obtaining a computationally manageable intermediate form representing the text, a mining procedure to obtain new knowledge.and a final step for assessing and filtering the knowledge obtained.
#codingexercise
Double GetAlternateOddNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeMin();
}

Friday, January 9, 2015

Today we discuss a paper called Text Knowledge Mining : An alternative to Text Data Mining. The authors introduce a notion that is different from contemporary mining techniques which they call inductive inference. They term their approach deductive inference and they include some of the existing techniques in this category. They also discuss about the application of existing theories in possible future research in this category.
They say that the text mining has essentially been data mining on unstructured data by obtaining structured datasets called intermediate forms. Some examples are :
a text unit of a word translates to an intermediate form of bag of words or N-grams.
A concept translates to concept hierarchy, conceptual graph, semantic graph, conceptual dependence.
A phrase translates to N-phrases, Multi-term text phrases, trends etc.
A paragraph translates to a paragraph, N-Phrases, multiterm text phrases, and Trends.
A text unit of document is retained as such.
They argue that text data is not inherently unstructured. It is characterized by very complex implicit structure that has defeated many representation attempts with a very rich semantics. The use of intermediate forms in fact loses the semantics of the text because we are using a very small part of their expressive power by the chosen text unit.
The authors quote the study of causal relationships in medical literature where the attempt to piece together the causes from the titles of the MEDLINE database, in order to generate previously unknown hypothesis actually produced good results. For e.g. Migraine is attributed to deficiency of Magnesium was established from the following:
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression is implicated in some migraines
high levels of magnesium inhibit spreading cortical depression
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
#codingexercise
Double GetAlternateEvenNumberRangeMedian()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMedian();
}
#codingexercise
Double GetAlternateEvenNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeStdDev();
}

Thursday, January 8, 2015

Today we continue to discuss the Shasta distributed shared memory protocol. We review the effects of upgrades and data sharing.The runs were repeated just like in the previous discussion. The execution time for base runs was measured for each application and the other times were normalized to this time. The base set of runs use upgrade requests and do  not use sharing writeback messages. The former implies that no data is fetched on a store if the processor already has a shared copy.  The latter implies that home is not updated on a 3 hop read operations.  The base run was compared to a run which does not support upgrade messages.In this case, the processor generates a read-exclusive request whether or not there is a local shared copy of the line. The base run was also compared to a run where the shared writeback messages were added. The runs were also compared with varying block sizes.
The results showed that the support for upgrade messages is important for a number of applications. On the other hand, sharing write back messages typically hurt performance. This is why it was not used in the base set of runs. One application was however an exception because several processors read the data produced by another processor. It was also seen that larger block sizes can sometimes exacerbate the cost of the write backs. In all these cases, we see that supporting a dirty sharing protocol is important for achieving higher performance in Shasta.
We now look at the effects of the migratory optimizations. The execution time for the base runs was recorded and the other times were normalized to this time. A comparison was made with the run involving migratory optimizations. The set of runs were then repeated for varying block sizes. It was seen that the migratory optimization did not provide an improvement or even degraded performance in some cases. The degradations are slight and could have been worse were it not for the revert mechanism and hysteresis built into the protocol. This is due to the fact  that the migratory patterns are either not present at the granularity of 64 bytes or larger block sizes or the pattern is unstable. There was only one application that detected a large number of stable patterns. In fact, the number of upgrade misses is reduced by over 90% for this application.  At the larger block sizes, the same application ends up having fewer and more unstable patterns, resulting in a slight loss of performance. Thus migratory optimizations in Shasta have had little or no benefit. There are several other factors that contribute to this. First the use of upgrade messages reduces the cost of store misses that may be eliminated. Second, exploiting release consistency is effective in hiding the latency of the upgrades. Finally, the batching optimization also leads to the merging of load and store misses to the same line within a single batch.
To summarize the results overall, the support for variable granularity communication is by far the most important optimization in Shasta. Support for upgrade messages and a dirty sharing protocol are also important for achieving higher performance. The optimizations from the release consistency models provides smaller performance gains because the processors are busy handling messages while they wait for their own requests to complete. And lastly migratory optimizations turn out not to be useful in the context of Shasta.
#codingexercise
Double GetAlternateEvenNumberRangeAvg()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeAvg();
}
#codingexercise
Double GetAlternateEvenNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMode();
}

Wednesday, January 7, 2015

Today we review the effects of exploiting release consistency in the Shasta protocol we were discussing. The change in the execution time of eight and sixteen processor runs with a 64 byte block size was studied for different levels of optimizations. For each application, the problem sizes was determined beforehand and the corresponding execution time was used to normalize the other times.The base set of runs exploited all of the optimizations related to batching and release consistency except that the release operations were blocking.The batching optimizations therefore included multiple outstanding misses and merging of loads and stores to the same lines within a batch. The release consistency optimizations included non blocking stores, eager exclusive replies, and lockup free optimization. The execution times were compared to a conservative implementation of sequential consistency. The execution times were also compared to a run with the addition of non-blocking releases to the optimizations in the base set of runs. The improvement in the base set of runs over the conservative sequential consistency was nearly 10 %. However, the addition of non-blocking releases had little or no improvement in performance. The execution time was also broken down and studied. It comprised of task time, read time, write time, synchronization time and message time. Task time includes the time for executing inline miss checks and the code necessary to enter the protocol. Read time and write time represent the stall time for read and write misses that are satisfied by other processors through the software protocol. There are also stalls on a store if there are non-contiguous stores to a pending line. Synchronization time represents the stall time for application locks and barriers including both acquire and release times. Message time represents the time spent handling messages when the processor is not already stalled. If the processors are stalled on data or synchronization, it is hidden by the read, write and synchronization time. Everything else was considered miscellaneous. The breakdowns indicate that the optimizations used in the base set of runs are effective in significantly reducing and sometimes eliminating the write stall time. When compared with the sequential consistency runs, the reduction in write stall time is accompanied by an increase in other overhead categories. Even though the processors do not directly stall for stores, the pending store requests still require servicing thereby increasing the time for read, synchronization, message and other overheads. This meant little or no improvement due to increases in other categories. The time spent by the processors to handle incoming protocol messages was also collected.  The contribution of the message handling time to the total execution time is less than 10%. Nevertheless, a large portion of messages are handled while a processor is waiting for its own data and synchronization requests.  The applications reported an average of 20-35% time  for this waiting. Therefore, the processors are heavily utilized while they wait for their own requests to complete. This also implies that when the stall time for some operations is hidden, it increases the stall time for other operations.
#codingexercise
Double GetAlternateEvenNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMax();
}
#codingexercise
Double GetAlternateEvenNumberRangeSum()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSum();
}

Tuesday, January 6, 2015

We discuss next the performance results for Shasta implementation.The cluster comprised of four AlphaServers 4100 connected by a Memory Channel network which is a memory mapped network that allows a process to transmit data to a remote process without any operating system overhead via a simple store to a mapped page. Shasta implementation involved a message passing layer that ran efficiently on top of the Memory Channel. By using separate message buffers between each pair of processors, locking is eliminated when adding or removing messages from the buffers. Message passing is exploited through shared memory segments when the communicating processors are on the same node.
Nine applications were run for at least a few seconds on the cluster.When the checking overheads were compared to the sequential running times, it ranged from 10 to 25%. On parallel execution, this overhead was relatively less due to communication and synchronization overheads.
The baseline for parallel execution was measured with 64 byte fixed line size, home placement optimization, all optimizations related to exploiting release consistency and avoiding sharing writeback option. All applications achieve higher speedups with more processors.
To study the effects of variable coherence granularity, five of the applications had it set with higher than 64 bytes. Variable granularity improved performance by transfering data in large units and reducing the number of misses on the main data structures.
When some of the applications were run with slightly larger problem sizes, the miss checks were identical but the speedups improved significantly.
The base set of runs exploited all of the optimizations related to batching and release consistency.  When compared with sequential consistency, the gain in performance was as much as 10%, with the gains more noticeable with fewer processors. Addition of non-blocking release optimization does not visibly improve performance beyond the base set of runs and in some cases leads to a slightly lower performance.
The execution time was also broken down and studied. Task time represented the time spent executing the application, including hardware cache misses. Task time also included the time for executing the inline miss checks and the code necessary to enter the protocol. Read time and write time represented the stall time for read and write misses.Synchronization time represented the stall time for application locks and barriers. Message time represented the time spent handling messages.
The runs indicated a significant reduction and sometimes elimination of write stall time. The overheads increased in read, synchronization, message handling and others but this had little or no effect on performance.
For runs with variable block sizes for the subset of applications that exploit this feature, the breakdowns indicated a higher efficiency compared to the runs with a 64 byte block size and the trends were similar to those observed as described above.
To further isolate the effects of various optimizations used in the base set of runs, the experiments were repeated while allowing the overlap of multiple misses in the batches relative to the sequential consistency. This showed virtually no improvement in performance. A second set of experiments was run by adding the eager exclusive reply optimization where the reply data to a read exclusive request is used by the requesting processor before all invalidations are acknowledged. In this case too, the performance did not improve.Therefore much of the performance difference between sequential consistency and base runs could be attributed to the non-blocking store optimization.
#codingexercise
Double GetAlternateEvenNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMin();
}

Monday, January 5, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing protocol optimizations with prefetch and home placement directives. The shared data in the system undergoes state transitions only between three states - invalid, shared and exclusive.  By requiring that each load and store  is a shared miss check on the data being referenced, Shasta maintains cache coherency. Shared data is split based on address space ranges into block and line size. The operations supported are read, read-exclusive and exclusive/upgrade and a state table is maintained for each line. Each processor maintains a directory and an owner for each line. With these data structures and the tight transitions, coherency is guaranteed via communication.
We now look at detecting migratory sharing patterns. Migratory sharing occurs when data is read and modified by different processors, leading to the migration of the data from one processor to another, By keeping extra information at each directory entry, the protocol detects whether the data in each line exhibits migratory pattern. When a threshold number of times migratory sharing is observed, the is designated for migratory conversion and automatically converted to a read-exclusive request at the directory. This conversion avoids a load miss followed by a store miss to the same line that is typical for migratory shared data. The protocol provides a mechanism to revert a line from migratory conversion. The reply data for a converted read request  is cached with a special caching state called exclusive-migratory.  The owner processor treats the line as exclusive and subsequently on a store changes the line to the ordinary exclusive state. Breaks in the migratory behavior are easy to detect. If an incoming request from another processor arrives before the owner processor writes to the line while the line is still is exclusive migratory, then a break is observed and a message is sent to the home directory to nullify or revert the migratory conversion for that line.
#codingexercise

Double GetAlternateEvenNumberRangeCount()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateEvenNumberRangeCount();

}
#codingexercise


Double GetAlternateEvenNumberRangeMin()(Double [] A)


{


if (A == null) return 0;


Return A.AlternateEvenNumberRangeMin();


}

The switching to and from migratory conversion state can be stopped for a given line if the line reverts a threshold number of times. This removes needless conversions and avoids continuous switching. 

Sunday, January 4, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing multiple coherence granularity. The block size is chosen automatically. It is the same as the allocated size up to 64bytes and line size otherwise. The programmer can override this to fine tune the application. Different granularities are associated with different virtual pages and place newly allocated data on the appropriate page. The block size of each page is communicated to all the nodes.at the time the pool of shared pages are allocated. A caveat with configurable granularities is that too much control may adversely affect the system performance. The home processor for the individual pages is explicitly specified. Shasta also supports non-binding prefetch and prefetch exclusive directives thus enabling the programmer to specify prefetch when large number of misses occur at a specific point in code.
This protocol exploits relaxed consistency memory models. It emulates the behavior of a processor with non-blocking loads and stores and a lockup free cache. The non-blocking store is supported by issuing a read-exclusive or exclusive request which records where the store occurred, thus letting the operation continue. Shasta then merges the reply data with the newly written data that is already in memory. The non-blocking load is implemented by batching optimization. Further non-blocking releases are supported by delaying a release operation on the side until all the previous operations have completed. This allows the processor to continue with operations following the release. Since the load and store operations are non-blocking, a line may be in one of two pending states - pending invalid and pending share. The pending invalid state corresponds to an outstanding read or read-exclusive on that line. The pending shared state corresponds to an outstanding exclusive request.
Batching reduces communication by merging load and store misses to the same line and by issuing requests to multiple lines at the same time. Sharing  handles a miss corresponding to batched load and store by jumping to the inline batch miss code if there's a miss on any single line within a batch.  The inline code calls a batch miss handler that issues all the necessary miss requests. Non-stalling stores are implemented by requiring the handler to not wait for invalidation acknowledgements. One caveat with batching is that the batch miss handler may bring in all the necessary lines, however it cannot guarantee that they will be in the appropriate state once all the replies have come back. The reason is that while the handler is waiting for replies, requests from other processes will be served and they can change the state of the lines in a batch. This does not impact the load operation which will still get the correct value as long as the original contents of the line remain in main memory. Hence the flag for invalidated lines is not set until the batch has ended. After the batch code has been executed, the invalidation is completed at the next entry into the protocol. At that time, stores to lines, which are no longer in exclusive or pending shared state, are reissued.
#codingexercise
Double GetAlternateEvenNumberRangeMid()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMid();
}

Saturday, January 3, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We study the performance optimizations in Shasta that reduce the effect of long latencies and large message overheads that are typical in software DSM systems. The optimizations include minimizing extraneous protocol messages, supporting prefetch and home placement directives, supporting coherence and communication at multiple granularities, exploiting a relaxed memory model, batching together requests for multiple misses, and optimizing access to migratory data. We review this one by one. Number of messages plays  a significant role in the overhead associated with DSM protocol. To reduce the number of messages, we trim the extraneous ones. First, Shasta protocol has a key property that the current owner node specified by the directory guarantees to service the request that is forwarded to it. Hence there are no retries and the messages such as negative acks for the retry requests. The transient state is maintained by allocating queue space at the target processor to delay servicing an incoming request. This avoids sharing write-backs or owner-ship changes or replacements as well and the current owner has a valid copy of the data.   There are other ways to trim the extraneous messages as well. Shasta supports dirty sharing which eliminates the need for sending an up to date copy of the line back to home when the home node is remote and the data is dirty in a third node. Third, supporting exclusive or upgrade requests is also an important optimization since it reduces the need for fetching data on a store if the requesting processor has the line in a shared state. Lastly, the number of invalidation acknowledgements that are expected for an exclusive request is piggybacked on one of the invalidation acknowledgements to the requestor being sent a separate message.
Shasta proposes supporting multiple granularities for communication and coherence, even within a single application. This is enabled for multiple applications via the configurability of the granularity in the inline state check. In addition, data that is close together can be communicated with coarser granularity within the same application while that which leads to false sharing can be communicated with finer granularity. The shared data in the system undergoes state transitions only between three states - invalid, shared and exclusive.  By requiring that each load and store  is a shared miss check on the data being referenced, Shasta maintains cache coherency. Shared data is split based on address space ranges into block and line size. The operations supported are read, read-exclusive and exclusive/upgrade and a state table is maintained for each line. Each processor maintains a directory and an owner for each line. With these data structures and the tight transitions, coherency is guaranteed via communication.
#codingexercise
Double GetAlternateNumberRangeVariance()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeVariance();
}

#codingexercise

Double GetAlternateNumberRangemid()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateNumberRangemid();

}

Friday, January 2, 2015

Today we will continue to discuss the WRL Research report on the Shasta shared memory protocol. We were introducing the system in our last post and in this post, we will continue to look at some of the checks that Shasta instruments. First let us look at a basic shared miss check.  This code first checks if the target address is in the shared memory range and if not, skip the remainder of the check. Otherwise the code calculates the address of the state table entry corresponding to the target address and checks that the line containing the target address is in the exclusive state. While Shasta optimizes these checks, the costs of the missed checks are still high.  To reduce the overhead further, some more advanced optimizations are involved. These are described next:
Invalid Flag technique: Whenever a line of processor becomes invalid,  the Shasta protocol stores a particular flag value in each longword of the line.  The miss check code for a load can then just compare the value loaded with the flag value. If the loaded value is not equal to the flag value, the data must be valid and the application code can continue immediately. If the loaded value is equal to the flag value, then a miss routine is called that first does the normal range check and state table lookup. The state check distinguishes an actual miss from a "false miss" and simply returns back to the application code in case of a false miss. A false miss is one when the application data actually contains the flag value. Another advantage of this technique is that the load of the state table entry is eliminated. The second optimization is batching miss checks. When the checks for multiple loads and stores are batched together, it reduces the overhead significantly.  If the base is the same and the offsets are all less than or equal to the Shasta line size, then a sequence of these loads and stores collectively touch at most two consecutive lines in memory.  Therefore, if inline checks verify that these are in the correct state, then all loads and stores can proceed without further checks. Its convenient to check both lines by just checking the beginning and ending address of the range. This batching technique also applies to loads and stores via multiple registers as long as the range can be determined.
Today we continue our discussion on the WRL research report on the performance of the Shasta shared memory protocol

Thursday, January 1, 2015

#codingexercise
Double GetAlternateNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeMode();
}
Today we discuss the WRL Research report on the performance of the Shasta Distributed Shared Memory Protocol. Shasta is a software that supports shared address space across a cluster of computers with physically distributed memory. Shasta can keep data coherent at a fine granularity. It implements this coherence by inlining code that checks the cache state of shared data before each load or store. In addition, shasta allows the coherence granularity to be varied across different shared data structures in the same application. This is helpful when compared to similar purpose systems that have inefficiencies due to fixed large page-size granularity. This paper talks about the cache coherence protocol in Shasta. The protocol borrows a number of optimizations from hardware systems and since it is written in software, it provides tremendous flexibility in the design of the protocol.  The protocol runs on Alpha systems connected via Digital's Memory Channel network. The performance of Shasta was studied for the different optimizations enabled. 
Since Shasta supports the shared address space entirely in software  and at fine granularity of coherence, it reduces the false sharing and transmission of unneeded data, both of which are potential problems in systems with large coherence granularities.  Code is inserted into the application executable before loads and stores that checks if the data being accessed is available locally in the appropriate state. These checks can further be minimized with appropriate optimizations making it a viable replacement of existing systems.
The Shasta protocol provides a number of mechanisms for dealing with the long communication latencies in a workstation cluster. Since it can support variable coherence granularities, it can exploit any potential gains from larger communication granularities from specific shared data. If there are concerns over the overheads in using software based messages, Shasta does minimize the extraneous coherence messages and uses only a fewer messages to satisfy the shared memory operations compared to protocols commonly used in hardware systems.
In fact, the optimizations that attempt to hide memory latencies by exploiting a relaxed memory consistency model, lead to a much more limited gains. This is due to the fact that when a processor waits for data or synchronization to be overlapped with the handling of incoming coherence messages from other processors, it spends most of its time in the wait making it difficult to improve performance by reducing the wait. Finally, optimizations related to migratory data are not useful in Shasta because the migratory sharing patterns are unstable or even absent at block sizes of 64 bytes or higher.  
Shasta divides the virtual address space of each processor into private or shared regions. Data in the shared region may be cached by multiple processors at the same time, with copies residing at the same virtual address space on each processor.
#codingexercise
Double GetAlternateNumberRangeCount()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeCount();
}

Let us discuss some more on this WRL Research report. The hardware cache coherent multiprocessors and Shasta both define the same basic states:
invalid - the data is not valid on this processor.
shared - the data is valid on this processor, and other processors have copies of the data as well.
exclusive - the data is valid on this processor, and no other processors have copies of this data.
Shasta inserts checks in the code prior to each load and store to safeguard against the invalid state. When a processor attempts to write data that is in an invalid or shared state, we say there is a shared miss. The checks inserted by Shasta safeguard against this shared miss.
Also as in hardware shared systems, Shasta divides up the shared address space into ranges of memory called blocks. All data within a block is in the same state and fetched and kept coherent as a unit. Shasta allows this block size to be different for different ranges of shared address space. Further, to simplify the instrumentation, Shasta divides up the address space further into fixed-size ranges called lines and maintains state information for each line in a state table. This line size is configurable at compile time and the block size is taken to be a multiple of the fixed line size.
Coherence is maintained using a directory based invalidation protocol. It supports three types of requests - read, read-exclusive, and exclusive (or -upgrade). Supporting exclusive requests is an important optimization since it reduces message latency and overhead if the processor already has the line in the shared state. Shasta also supports three types of synchronization primitives in the protocol and these are : locks, barriers, and event flags.  These primitives are sufficient for supporting the applications run on the system.
Each virtual page of data is associated with a home processor and each processor maintains the directory information for the shared data. Each line is assigned to an owner processor which is the last processor that held an exclusive copy of the line. The directory information comprises of a pointer to the current owner processor and a full bit vector of the processors that are sharing the data. Data can be shared without requiring the home node to have an up to date copy and this is called dirty sharing The owner is forwarded a copy when a request arrives at a home unless the home processor has a copy.
Message passing is costly when done via interrupts, hence messages are serviced through a polling mechanism. Polling is cheaper because there is a single cacheable location which can be tested to determine if a message has arrived. The polls are inserted at every loop backedge to ensure reasonable response times whenever the protocol waits for a reply. Further there are no messages between a shared miss check and the load or store that is being checked and thus polling can be used to simplify the inline miss checks.
#codingexercise
Double GetAlternateNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateNumberRangeStdDev();
}