Cluster computing

Monday, December 9, 2013

sample maven test
// modify pom.xml to include junit
import org.junit.Assert;
import org.junit.Test;

@Test
public void test()
{
Rectangle sample = new Rectangle(1,2);
double area = sample.Area();
Assert.assertEquals(area, 2);
}
}
mvn test

interface IMeasurable
{
double Area() throws illegalAccessException();
}
abstract class shape implements IMeasurable
{
volatile double x;
volatile double y;
shape (double a, double b)
{
x = a;
y = b;
}
double Area() throws illegalAccessException();
{
throw new illegalAccessException();
}
}
class Rectangle extends shape
{
Rectangle (double x, double y)
{
super(x,y);
}
final void PrintMe()
{
System.Out.PrintLn("I'm a rectangle with length : " + X + "and breadth :" + Y + "and Area :" Area());
}
double Area() throws illegalAccessException()
{
return x*y;
}
}

Sunday, December 8, 2013

While Matsuo's paper discusses an approach to extract keywords by first taking the top 30% of the frequent terms and then clustering based on pairwise co-occurrence of terms, we strive to collect the terms that differentiate from the background by automatically finding the number of clusters using DBSCAN algorithm and KLD distances based on similarity of distributions. In this case we use the KLD-distance as defined by Bigi :
D-KLD(P/Q) = Sum-x(P(x)-Q(x))log(P(x)/Q(x))
and the distribution is taken with the probabilities as
pg = the sum of the total number of terms in sentences where g appears) divided by the total number of terms in the document.

In the KLD metric, when we have a limited set of sentences, this probability is then normalized by multiplying with a scaling factor when the term occurs or set to a default value
when it doesn't.

We use the DBSCAN algorithm like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
By clustering points, we incrementally provide user-defined number of keywords by taking the high valued clusters first and enumerating the keywords in that cluster.

Selenium is a web based suite of tools for browser automation.
It requires 2 jar files such as selenium-server.jar and selenium-java-client-driver.jar
Maven can be used to include the required jar file. We can then import the Maven project into our IDE such as Eclipse. To use maven, we define a pom.xml file.
We define a page object that can the test cases target. This page object changes only when the page changes. Many test cases can target the same page. The page object uses the WebDriver to test the page. The page objects can be written to extend a generic base page object.
The controls on the page are referred to with WebElement. These can take input and this is how the tests can drive the testing. To look for inputs in a given page, we can rely on the driver's findElement method. This will recursively traverse the controls in the document tree to find the element. You can also specify XPath to find the control. The WebElements can be found by ID and this is the most efficient and preferred way or by class name which is the attribute on the DOM element. Other means of finding the elements include by name, by tag name which is the DOM tag name of the element, by link text or partial link text and CSS or Javascript.
The Selenium-WebDriver makes direct calls to the browser using each browser's native support for automation. Unlike the older Selenium-RC which used to inject the Javascript code, the webdriver drives the browser directly using the browser's built in support for automation.
The selenium server helps in the following ways:
1) It distributes the tests over multiple machines or virtual machines
2) It connects to a remote machine that has a particular browser version
3) It helps to use the HtmlUnitDriver instead of the Java bindings for each language.
As with all user interface automation, tests may require to wait for a few seconds for the page or the controls to load. This has often caused reliability issues for the tests
As with most web pages, Tests may have to target international sites.
Each browser maintains some support for automation
Cross browser compatibility is one of the primary concerns for writing tests.
This is where the WebDriver helps.

Saturday, December 7, 2013

Reflections on the last assignment based on what could have gone right and what could have been done better:
But before that I want to add a disclaimer that whatever I write here is for personal improvement and has no relevance or reference to anyone or anything. You are welcome to read this as a post-mortem or lessons learned but not as anything more and hopefully not to be taken out of context. These are in a way meant for capturing my take before we lose it.
Some of the things we learned in the last assignment are as follows:
1) First among these is a thorough understanding of the task at hand. Sometimes questions are not enough. Often the questions don't even need to be asked from others. Sometimes the task involves solving a problem. Other times its straightforward to do it. No matter how obvious the solution looks, its worthwhile to ask a question to yourself.. As a tester, I tried to write test cases that were meaningful. I could have spent hours writing exhaustive test cases and sometimes some test cases are only covered by adding variations, but with a shortage of time and effort, it was important to understand the feature before writing test cases. An architecture diagram helps. More so the customer facing UI and the actions permitted as well as restricted. As an example, I was testing a feature for passbook integration of digital Starbucks cards and it took several iterations before converging to a test case that was both meaningful and mattered to the end user.
2) As a follow up of 1) it mattered that the process of thinking through showed in a document such as a test plan and if not, perhaps in code. I was lucky to be advised to follow the process for most of the tasks. While I agreed with this in principle, it was easier to analyze by first putting things in code even if the code doesn't need to be shared. Moreover, I was working with APIs that could be called by itself or used together in sequences or workflow.
3) Staying two steps ahead of the game. Things change often in the project. The first two steps helped with identifying the resources needed and finding the dependencies. These could be permissions, machines or inventory.
4) Getting a review of the test plan and the test cases gets you some support and confidence from others. This is very much needed and more so for the plan than even code reviews.
5) Prioritizing the tasks and timely action is critical. Sometimes we wait too long and become reactive when in the first place we could have detected and been pro-active. As an example, I worked on giving feedback to a developer on a card auto-load feature by providing data for unit-test that ensured code was in a reasonably good shape prior to test cases and automation.
6) Some pressure is inevitable. For example, I wish I was not in a haste to do items. And to have verified them before publish. This would have mitigated a lot of discussion and gathered more support. However I don't regret partial publish, I'm only wishing that it were incremental and verified
7) Lastly, I wanted to say some-things just take time and that focus helps. However, if we spend time on activities such as investigations, bug triages or test runs, we find less time to code. What we could have perhaps done better was to share the estimates and checkpoint often, reprioritizing the tasks as appropriate or leaving the grind where it didn't matter.

In order to complete the discussion on test portal, we do the following:
We say that there is very little code to be written for the page to kick off a test run and to list the test results.
This works well with a model view controller,
We can add more views for detailed results of each run easily with the framework
We want the results to be published to TFS instead of publishing to files or database.
The runs are kicked off on a local or remote machine via a service that can listen and kick off the mstest. In our case, since we wanted to target the testing of queues, we could use queues for the service to listen on. That said the overall approach is to use as much of the out of box solutions we get rather than having to customize or write new components. Test controller and runs with MTM is well known. The focus here is on our test categorizations and the use of our portal with a variety of tests.
I want to talk a little bit about the UI and database as well. The choice of MVC/ MVVM let's us build out separate testable views with no limit on what can go into the views. We can even have JavaScript and Ajax for post backs and handle control toolkits. With the use of view models, we can have more than one tabs on the pages. This gives us the flexibility to have partial views one for each genre of test such as webtests, apitests, and queue tests. Then with the choices to customize the page based on user, have separate views for changing test data and pages to view stats or associate TFS bugs/ work items and to provide charts or email subscriptions, we can make it all the more sophisticated.
I also don't want to limit the discussion to any one stack. We should be able to implement with any stack be it VS or selenium based.
As an aside, test is also an area where there are many ideas applicable and these have value elsewhere too. For example, from UI, we can borrow the ideas for consolidation, customization, shared or dedicated views , resource grouping and categorization, layout and organization and from MVC we can borrow decomposition, test ability, and separation of concerns. We can illustrate ideas with an example for test ability, an example for MVC organization, and an idea for representing persistence via an organization of providers. We can even represent incremental builds.
As a developer we look for ideas and implementations everywhere.

Friday, December 6, 2013

I want to talk about a web application for executing test runs and viewing results. Lets say that the tests are authored in VS and that they are categorized by means of attributes. The user is able to select one or more categories of this test for execution. For each run, the corresponding mstest arguments are set and invoked. The web page itself has a progress indicator for the duration of the test execution. The test run is invoked by a command line an d is independent of the portal view. The results of the test run could be displayed with the availability if a trx file from the run or from a database where the results are populated independently. The tests themselves have names, owners, categories and the path to the assembly so they can be found and executed. The test execution and history can also be displayed from files or database. What we need us the minimal logic to kick off a run and show the results. The results for each run could then show the history of each test. Since the views are only about the test results and the input to kickoff a run, the models are for the runs/results and the test/category to be executed.
The views are separated out into the list view and the details view for each results. The list view could have additional features such as paging, sorting, searching etc. to make it easier to navigate the test results but the details page only requires the path to the trx file and a drill-down into the history of individual tests within the category. The web page could be personalized to display the signed on user and the test runs associated with that user. This requires the use of AD integration for the login if the webpages are to be displayed internally. A domain name registration could help reach this web application by name.
The reason for describing this web application is to see if the implementation can be done with minimal code and leveraging the latest MVC webAPI frameworks to get great working results. This will allow the views and pages to be added rapidly. So far we have talked about the controller kicking off the run and returning a default view. Then the views get updated when the results are ready. Since the completion is dependent on when we have a trx file or the results in the database, the two are handled separately.
In other words, we display the web page caters to the convenience of kicking off the run and waiting for the results. To talk about the process where the results are populated into a database offline, we could use either a custom executable or the option with mstest to publish the results to TFS. The latter is convenient in the sense that TFS already enables web access and navigation so our portal doesn't need to do a lot.
Finally, we display the results with JQuery progress bars, itemized row level styles and overall page beautifying style and content layout. The web pages could also have themes and variations based on how the tests are populated. For example, for tests that target say MSMQ, the tests could optionally show which queues are targeted.

Thursday, December 5, 2013

In today's post we talk about maximal Kullback-Leibler Divergence cluster analysis by Anne Sheehy. In this paper, she discusses a new procedure for performing a cluster analysis along with a proof for a consistency result for the procedure. The clustering of data X1, X2, ..., Xn with distribution P on sample space X in R with a set of known partitions K results in members Vk from the collection of k partitions of space X. The partition that best describes the clustering structure of the data is defined to be the one which maximizes a criterion. In this case, the criterion is a weighted sum of Kullback-Leibler divergences.
The Kullback-Leibler function K(Q,P) takes the value 0 when m = EpX and increases as m moves away from EpX.The rate of increase of the function will depend on many things besides the distance of m from EpX, including the dispersion of P in the direction in which m is moving.

The value K(Qn-min(A),Pn) differentiates the subset A of {X1, X2, ...., Xn} from the rest of the sample. If the cluster A of the sample is far removed from the rest, it will have a very large value for K(Qn-min(A),Pn).
Here the subset A are chosen such that P(A) > 0.
The paper mentions a reference to Blaschke's Selection Theorem
If a clustering procedure results in k clusters on a given set of data, and all points in any one of these clusters is removed from the sample space and re-clustered, it should give the original k-1 clusters except for the one that is removed. This "cluster omission admissibility" is a characterisitic of a good clustering procedure.
In order to select a subset of the original sample space, we could consider clustering the probabilites into a set of k -partitions or ranges of values. Since the value ranges are bounded to be between 0 and 1 and we can divide the space into k non-overlapping ranges, we will be able to use one or more of these clusters for different subsets. The goal of the subset is to maximize the KLD criterion so this works to choose one from the different subsets based on the value of the K(Qn-min(A), Pn)
The probabilities used in KLD is based on term frequencies and can be taken as the ratio of the sum of the total number of terms in sentences where this term appears divided by the total number of terms in the document.
Thus we can apply the term frequency at any scope, whether sentences relative to documents or documents relative to collection.
The procedure is the same for selection regardless of the scope and the clustering too can be applied using measures that are based on different scopes.