Cluster computing

Saturday, January 10, 2015

We continue the discussion on Text knowledge mining : An alternative to text data mining paper, We saw that the authors proposed the TKM as one based on abduction and includes techniques that generates new hypothesis. Data mining techniques are based on inductive inference. TDM can be both subjective and objective. When TDM is subjective, typically domain knowledge is involved. When TDM is objective, typically statistics is used. Together they represent simple facts in data structures from which new knowledge can be obtained. This is mining and even though there is strong association with inductive learning, the word induction is generally not used with reference to it The prevalence of such approach implies that data mining has been useful. Data mining however is different from text mining in that the former depends on data model while the latter depends on natural language. Data models have limited expressive power. Natural language is richer. It is able to represent not only simple facts but also general knowledge, from relation like rules to complex procedures. The disadvantages with natural language involve computationally expensive to manage structures, vagueness, ambiguity etc. There is no restriction to what kind of techniques whether deductive and/or adductive can be used with mining. By proposing text mining based on non-inductive inference, the authors propose a new direction. TKM is mining by reasoning using knowledge contained in text.TKM is a particular case of what the authors call knowledge mining. The latter defined as obtaining non-trivial previously unknown and potentially useful knowledge from knowledge repositories. The main difference between the definition of knowledge mining and the definition of data mining is that the former intends to discover knowledge from knowledge repositories while the latter attempts to discover knowledge from data repositories. The difference is the starting point, the basic material from which new knowledge is to be obtained. The starting point also makes the approach clear in that the former is deductive or adductive inference while the latter is inductive reference. Reading this it might feel like we are taking existing techniques and applying to a higher layer. In fact there can be other layers as well. Knowledge and data are not the only layers for example, the actions are based on knowledge. Having actions recommended from knowledge is yet another strata that we haven't reached yet. Coming back to text mining, the phases of the TKM process are exactly the same as those of TDM i.e. first text refining for obtaining a computationally manageable intermediate form representing the text, a mining procedure to obtain new knowledge.and a final step for assessing and filtering the knowledge obtained.
#codingexercise
Double GetAlternateOddNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeMin();
}

Friday, January 9, 2015

Today we discuss a paper called Text Knowledge Mining : An alternative to Text Data Mining. The authors introduce a notion that is different from contemporary mining techniques which they call inductive inference. They term their approach deductive inference and they include some of the existing techniques in this category. They also discuss about the application of existing theories in possible future research in this category.
They say that the text mining has essentially been data mining on unstructured data by obtaining structured datasets called intermediate forms. Some examples are :
a text unit of a word translates to an intermediate form of bag of words or N-grams.
A concept translates to concept hierarchy, conceptual graph, semantic graph, conceptual dependence.
A phrase translates to N-phrases, Multi-term text phrases, trends etc.
A paragraph translates to a paragraph, N-Phrases, multiterm text phrases, and Trends.
A text unit of document is retained as such.
They argue that text data is not inherently unstructured. It is characterized by very complex implicit structure that has defeated many representation attempts with a very rich semantics. The use of intermediate forms in fact loses the semantics of the text because we are using a very small part of their expressive power by the chosen text unit.
The authors quote the study of causal relationships in medical literature where the attempt to piece together the causes from the titles of the MEDLINE database, in order to generate previously unknown hypothesis actually produced good results. For e.g. Migraine is attributed to deficiency of Magnesium was established from the following:
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression is implicated in some migraines
high levels of magnesium inhibit spreading cortical depression
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
#codingexercise
Double GetAlternateEvenNumberRangeMedian()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMedian();
}
#codingexercise
Double GetAlternateEvenNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeStdDev();
}

Thursday, January 8, 2015

Today we continue to discuss the Shasta distributed shared memory protocol. We review the effects of upgrades and data sharing.The runs were repeated just like in the previous discussion. The execution time for base runs was measured for each application and the other times were normalized to this time. The base set of runs use upgrade requests and do not use sharing writeback messages. The former implies that no data is fetched on a store if the processor already has a shared copy. The latter implies that home is not updated on a 3 hop read operations. The base run was compared to a run which does not support upgrade messages.In this case, the processor generates a read-exclusive request whether or not there is a local shared copy of the line. The base run was also compared to a run where the shared writeback messages were added. The runs were also compared with varying block sizes.
The results showed that the support for upgrade messages is important for a number of applications. On the other hand, sharing write back messages typically hurt performance. This is why it was not used in the base set of runs. One application was however an exception because several processors read the data produced by another processor. It was also seen that larger block sizes can sometimes exacerbate the cost of the write backs. In all these cases, we see that supporting a dirty sharing protocol is important for achieving higher performance in Shasta.
We now look at the effects of the migratory optimizations. The execution time for the base runs was recorded and the other times were normalized to this time. A comparison was made with the run involving migratory optimizations. The set of runs were then repeated for varying block sizes. It was seen that the migratory optimization did not provide an improvement or even degraded performance in some cases. The degradations are slight and could have been worse were it not for the revert mechanism and hysteresis built into the protocol. This is due to the fact that the migratory patterns are either not present at the granularity of 64 bytes or larger block sizes or the pattern is unstable. There was only one application that detected a large number of stable patterns. In fact, the number of upgrade misses is reduced by over 90% for this application. At the larger block sizes, the same application ends up having fewer and more unstable patterns, resulting in a slight loss of performance. Thus migratory optimizations in Shasta have had little or no benefit. There are several other factors that contribute to this. First the use of upgrade messages reduces the cost of store misses that may be eliminated. Second, exploiting release consistency is effective in hiding the latency of the upgrades. Finally, the batching optimization also leads to the merging of load and store misses to the same line within a single batch.
To summarize the results overall, the support for variable granularity communication is by far the most important optimization in Shasta. Support for upgrade messages and a dirty sharing protocol are also important for achieving higher performance. The optimizations from the release consistency models provides smaller performance gains because the processors are busy handling messages while they wait for their own requests to complete. And lastly migratory optimizations turn out not to be useful in the context of Shasta.
#codingexercise
Double GetAlternateEvenNumberRangeAvg()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeAvg();
}
#codingexercise
Double GetAlternateEvenNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMode();
}

Wednesday, January 7, 2015

Today we review the effects of exploiting release consistency in the Shasta protocol we were discussing. The change in the execution time of eight and sixteen processor runs with a 64 byte block size was studied for different levels of optimizations. For each application, the problem sizes was determined beforehand and the corresponding execution time was used to normalize the other times.The base set of runs exploited all of the optimizations related to batching and release consistency except that the release operations were blocking.The batching optimizations therefore included multiple outstanding misses and merging of loads and stores to the same lines within a batch. The release consistency optimizations included non blocking stores, eager exclusive replies, and lockup free optimization. The execution times were compared to a conservative implementation of sequential consistency. The execution times were also compared to a run with the addition of non-blocking releases to the optimizations in the base set of runs. The improvement in the base set of runs over the conservative sequential consistency was nearly 10 %. However, the addition of non-blocking releases had little or no improvement in performance. The execution time was also broken down and studied. It comprised of task time, read time, write time, synchronization time and message time. Task time includes the time for executing inline miss checks and the code necessary to enter the protocol. Read time and write time represent the stall time for read and write misses that are satisfied by other processors through the software protocol. There are also stalls on a store if there are non-contiguous stores to a pending line. Synchronization time represents the stall time for application locks and barriers including both acquire and release times. Message time represents the time spent handling messages when the processor is not already stalled. If the processors are stalled on data or synchronization, it is hidden by the read, write and synchronization time. Everything else was considered miscellaneous. The breakdowns indicate that the optimizations used in the base set of runs are effective in significantly reducing and sometimes eliminating the write stall time. When compared with the sequential consistency runs, the reduction in write stall time is accompanied by an increase in other overhead categories. Even though the processors do not directly stall for stores, the pending store requests still require servicing thereby increasing the time for read, synchronization, message and other overheads. This meant little or no improvement due to increases in other categories. The time spent by the processors to handle incoming protocol messages was also collected. The contribution of the message handling time to the total execution time is less than 10%. Nevertheless, a large portion of messages are handled while a processor is waiting for its own data and synchronization requests. The applications reported an average of 20-35% time for this waiting. Therefore, the processors are heavily utilized while they wait for their own requests to complete. This also implies that when the stall time for some operations is hidden, it increases the stall time for other operations.
#codingexercise
Double GetAlternateEvenNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMax();
}
#codingexercise
Double GetAlternateEvenNumberRangeSum()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSum();
}

Tuesday, January 6, 2015

We discuss next the performance results for Shasta implementation.The cluster comprised of four AlphaServers 4100 connected by a Memory Channel network which is a memory mapped network that allows a process to transmit data to a remote process without any operating system overhead via a simple store to a mapped page. Shasta implementation involved a message passing layer that ran efficiently on top of the Memory Channel. By using separate message buffers between each pair of processors, locking is eliminated when adding or removing messages from the buffers. Message passing is exploited through shared memory segments when the communicating processors are on the same node.
Nine applications were run for at least a few seconds on the cluster.When the checking overheads were compared to the sequential running times, it ranged from 10 to 25%. On parallel execution, this overhead was relatively less due to communication and synchronization overheads.
The baseline for parallel execution was measured with 64 byte fixed line size, home placement optimization, all optimizations related to exploiting release consistency and avoiding sharing writeback option. All applications achieve higher speedups with more processors.
To study the effects of variable coherence granularity, five of the applications had it set with higher than 64 bytes. Variable granularity improved performance by transfering data in large units and reducing the number of misses on the main data structures.
When some of the applications were run with slightly larger problem sizes, the miss checks were identical but the speedups improved significantly.
The base set of runs exploited all of the optimizations related to batching and release consistency. When compared with sequential consistency, the gain in performance was as much as 10%, with the gains more noticeable with fewer processors. Addition of non-blocking release optimization does not visibly improve performance beyond the base set of runs and in some cases leads to a slightly lower performance.
The execution time was also broken down and studied. Task time represented the time spent executing the application, including hardware cache misses. Task time also included the time for executing the inline miss checks and the code necessary to enter the protocol. Read time and write time represented the stall time for read and write misses.Synchronization time represented the stall time for application locks and barriers. Message time represented the time spent handling messages.
The runs indicated a significant reduction and sometimes elimination of write stall time. The overheads increased in read, synchronization, message handling and others but this had little or no effect on performance.
For runs with variable block sizes for the subset of applications that exploit this feature, the breakdowns indicated a higher efficiency compared to the runs with a 64 byte block size and the trends were similar to those observed as described above.
To further isolate the effects of various optimizations used in the base set of runs, the experiments were repeated while allowing the overlap of multiple misses in the batches relative to the sequential consistency. This showed virtually no improvement in performance. A second set of experiments was run by adding the eager exclusive reply optimization where the reply data to a read exclusive request is used by the requesting processor before all invalidations are acknowledged. In this case too, the performance did not improve.Therefore much of the performance difference between sequential consistency and base runs could be attributed to the non-blocking store optimization.
#codingexercise
Double GetAlternateEvenNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMin();
}

Monday, January 5, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing protocol optimizations with prefetch and home placement directives. The shared data in the system undergoes state transitions only between three states - invalid, shared and exclusive. By requiring that each load and store is a shared miss check on the data being referenced, Shasta maintains cache coherency. Shared data is split based on address space ranges into block and line size. The operations supported are read, read-exclusive and exclusive/upgrade and a state table is maintained for each line. Each processor maintains a directory and an owner for each line. With these data structures and the tight transitions, coherency is guaranteed via communication.
We now look at detecting migratory sharing patterns. Migratory sharing occurs when data is read and modified by different processors, leading to the migration of the data from one processor to another, By keeping extra information at each directory entry, the protocol detects whether the data in each line exhibits migratory pattern. When a threshold number of times migratory sharing is observed, the is designated for migratory conversion and automatically converted to a read-exclusive request at the directory. This conversion avoids a load miss followed by a store miss to the same line that is typical for migratory shared data. The protocol provides a mechanism to revert a line from migratory conversion. The reply data for a converted read request is cached with a special caching state called exclusive-migratory. The owner processor treats the line as exclusive and subsequently on a store changes the line to the ordinary exclusive state. Breaks in the migratory behavior are easy to detect. If an incoming request from another processor arrives before the owner processor writes to the line while the line is still is exclusive migratory, then a break is observed and a message is sent to the home directory to nullify or revert the migratory conversion for that line.
#codingexercise

Double GetAlternateEvenNumberRangeCount()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateEvenNumberRangeCount();

}
#codingexercise

Double GetAlternateEvenNumberRangeMin()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateEvenNumberRangeMin();

}

The switching to and from migratory conversion state can be stopped for a given line if the line reverts a threshold number of times. This removes needless conversions and avoids continuous switching.

Sunday, January 4, 2015

Today we continue the discussion on the performance of the Shasta Distributed Shared Memory Protocol. We were discussing multiple coherence granularity. The block size is chosen automatically. It is the same as the allocated size up to 64bytes and line size otherwise. The programmer can override this to fine tune the application. Different granularities are associated with different virtual pages and place newly allocated data on the appropriate page. The block size of each page is communicated to all the nodes.at the time the pool of shared pages are allocated. A caveat with configurable granularities is that too much control may adversely affect the system performance. The home processor for the individual pages is explicitly specified. Shasta also supports non-binding prefetch and prefetch exclusive directives thus enabling the programmer to specify prefetch when large number of misses occur at a specific point in code.
This protocol exploits relaxed consistency memory models. It emulates the behavior of a processor with non-blocking loads and stores and a lockup free cache. The non-blocking store is supported by issuing a read-exclusive or exclusive request which records where the store occurred, thus letting the operation continue. Shasta then merges the reply data with the newly written data that is already in memory. The non-blocking load is implemented by batching optimization. Further non-blocking releases are supported by delaying a release operation on the side until all the previous operations have completed. This allows the processor to continue with operations following the release. Since the load and store operations are non-blocking, a line may be in one of two pending states - pending invalid and pending share. The pending invalid state corresponds to an outstanding read or read-exclusive on that line. The pending shared state corresponds to an outstanding exclusive request.
Batching reduces communication by merging load and store misses to the same line and by issuing requests to multiple lines at the same time. Sharing handles a miss corresponding to batched load and store by jumping to the inline batch miss code if there's a miss on any single line within a batch. The inline code calls a batch miss handler that issues all the necessary miss requests. Non-stalling stores are implemented by requiring the handler to not wait for invalidation acknowledgements. One caveat with batching is that the batch miss handler may bring in all the necessary lines, however it cannot guarantee that they will be in the appropriate state once all the replies have come back. The reason is that while the handler is waiting for replies, requests from other processes will be served and they can change the state of the lines in a batch. This does not impact the load operation which will still get the correct value as long as the original contents of the line remain in main memory. Hence the flag for invalidated lines is not set until the batch has ended. After the batch code has been executed, the invalidation is completed at the next entry into the protocol. At that time, stores to lines, which are no longer in exclusive or pending shared state, are reissued.
#codingexercise
Double GetAlternateEvenNumberRangeMid()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMid();
}