Cluster computing

Tuesday, December 10, 2013

This post is from the white paper on Inrix web site : Fueling future mobility with big data.
This paper talks about how high quality traffic data and sophisticated analysis helps get people around quickly and efficiently. High quality traffic data helps in more than one way. First it improves journey times. We can see how by seeing more accurate data to the user and improving the satisfaction. Second, the traffic data is a layer that helps build applications for connected cars and smarter cities.
Mobile devices, applications and internet sites help provide digital maps that improve navigation. Navigation technology has become so ubiquitous its no longer a differentiating factor and is demanded as even a built-in feature across utilities. However, traffic data is different from navigation and there are several variables. First coverage for all available roads and not just the busy ones is a differentiating factor for traffic data. Coverage expands the choices for the routes that are not available otherwise and have frequently hurt the driver experience. Second, exact locations and dimensions of a traffic queue is critical to planning routes. This level of detail is generally not available with most providers. Traffic data has been sticky to specific maps and sometimes services, making it available widely and consistently has been a challenge. Timeliness of incidents reports to the driver is critical to knowing change of routes or other impact. There is an industry wide latency in providing such data.
Inrix strives to improve traffic data on all of these fronts with its traffic data service, hoping to make an impact to driving but in how driving shapes city planning. The source of the traffic data is the crowd and this is expected to increase rapidly with more penetration by applications and implementations. The coverage is improving where 1 million miles of road is now added where only 3 million miles were being covered. Moreover, the traffic can now be painted on any map, any device and in several countries.
With the rising popularity of public transit, I'm excited to see improvements in bus traffic locally here on the east side.

Monday, December 9, 2013

I got the following interview question:

Using the following function signature, write a C# function that prints out every combination of indices using Console.WriteLine() whose values add up to a specified sum, n. Values of 0 should be ignored.

public void PrintSumCombinations(List<int> numbers, int n);

· It’s okay to use additional private functions to implement the public function

· Be sure to print out the indices of numbers and not the values at those indices

· Don’t worry too much about memory or CPU optimization; focus on correctness

To help clarify the problem, calling the function with the following input:

List<int> numbers = new List<int> { 1, 1, 2, 2, 4 };

PrintSumCombinations(numbers, 4);

Should result in the following console output (the ordering of the different lines isn’t important and may vary by implementation):

0 1 2 (i.e. numbers[0] + numbers[1] + numbers[2] = 1 + 1 + 2 = 4)

0 1 3

2 3

4

Here is my hint: Generate the variations based on permutations and regardless of the content, then check each sequence for the expected sum.
public void Permute(ref List<int> numbers, ref List<int> candidate, ref bool[] used, int n)

{

if (candidate.Sum() == n)

{
candidate.ForEach(x => Console.Write(x.ToString() + " "));

Console.WriteLine();

}

for (int i = 0; i < numbers.Count; i++)

{
if (used[i]) continue;

candidate.Add(numbers[i]);
used[i] = true;

Permute(ref numbers, ref candidate, ref used, n);

candidate.Remove(candidate.Last());
used[i] = false;

}

}

For combinations, we could take different length substrings and permute them. There may be repetitions but we process just the same.

And here is another way to solve the problem
public static void Combine(ref List<IndexedNumber> numbers, ref List<IndexedNumber> candidate, ref List<List<IndexedNumber>> sequences, int level, int start, int n)

{
for (int i = start; i < numbers.Count; i++)

{
if (candidate.Contains(numbers[i]) == false)

{

candidate[level] = numbers[i];
if (candidate.Sum() == n)

sequences.Add(new List<IndexedNumber>(candidate));

if (i < numbers.Count - 1)

Combine(ref numbers, ref candidate, ref sequences, level + 1, start + 1, n);

candidate[level] = new IndexedNumber() { Number = 0, Index = -1 };

}

}

}

sample maven test
// modify pom.xml to include junit
import org.junit.Assert;
import org.junit.Test;

@Test
public void test()
{
Rectangle sample = new Rectangle(1,2);
double area = sample.Area();
Assert.assertEquals(area, 2);
}
}
mvn test

interface IMeasurable
{
double Area() throws illegalAccessException();
}
abstract class shape implements IMeasurable
{
volatile double x;
volatile double y;
shape (double a, double b)
{
x = a;
y = b;
}
double Area() throws illegalAccessException();
{
throw new illegalAccessException();
}
}
class Rectangle extends shape
{
Rectangle (double x, double y)
{
super(x,y);
}
final void PrintMe()
{
System.Out.PrintLn("I'm a rectangle with length : " + X + "and breadth :" + Y + "and Area :" Area());
}
double Area() throws illegalAccessException()
{
return x*y;
}
}

Sunday, December 8, 2013

While Matsuo's paper discusses an approach to extract keywords by first taking the top 30% of the frequent terms and then clustering based on pairwise co-occurrence of terms, we strive to collect the terms that differentiate from the background by automatically finding the number of clusters using DBSCAN algorithm and KLD distances based on similarity of distributions. In this case we use the KLD-distance as defined by Bigi :
D-KLD(P/Q) = Sum-x(P(x)-Q(x))log(P(x)/Q(x))
and the distribution is taken with the probabilities as
pg = the sum of the total number of terms in sentences where g appears) divided by the total number of terms in the document.

In the KLD metric, when we have a limited set of sentences, this probability is then normalized by multiplying with a scaling factor when the term occurs or set to a default value
when it doesn't.

We use the DBSCAN algorithm like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
By clustering points, we incrementally provide user-defined number of keywords by taking the high valued clusters first and enumerating the keywords in that cluster.

Selenium is a web based suite of tools for browser automation.
It requires 2 jar files such as selenium-server.jar and selenium-java-client-driver.jar
Maven can be used to include the required jar file. We can then import the Maven project into our IDE such as Eclipse. To use maven, we define a pom.xml file.
We define a page object that can the test cases target. This page object changes only when the page changes. Many test cases can target the same page. The page object uses the WebDriver to test the page. The page objects can be written to extend a generic base page object.
The controls on the page are referred to with WebElement. These can take input and this is how the tests can drive the testing. To look for inputs in a given page, we can rely on the driver's findElement method. This will recursively traverse the controls in the document tree to find the element. You can also specify XPath to find the control. The WebElements can be found by ID and this is the most efficient and preferred way or by class name which is the attribute on the DOM element. Other means of finding the elements include by name, by tag name which is the DOM tag name of the element, by link text or partial link text and CSS or Javascript.
The Selenium-WebDriver makes direct calls to the browser using each browser's native support for automation. Unlike the older Selenium-RC which used to inject the Javascript code, the webdriver drives the browser directly using the browser's built in support for automation.
The selenium server helps in the following ways:
1) It distributes the tests over multiple machines or virtual machines
2) It connects to a remote machine that has a particular browser version
3) It helps to use the HtmlUnitDriver instead of the Java bindings for each language.
As with all user interface automation, tests may require to wait for a few seconds for the page or the controls to load. This has often caused reliability issues for the tests
As with most web pages, Tests may have to target international sites.
Each browser maintains some support for automation
Cross browser compatibility is one of the primary concerns for writing tests.
This is where the WebDriver helps.

Saturday, December 7, 2013

Reflections on the last assignment based on what could have gone right and what could have been done better:
But before that I want to add a disclaimer that whatever I write here is for personal improvement and has no relevance or reference to anyone or anything. You are welcome to read this as a post-mortem or lessons learned but not as anything more and hopefully not to be taken out of context. These are in a way meant for capturing my take before we lose it.
Some of the things we learned in the last assignment are as follows:
1) First among these is a thorough understanding of the task at hand. Sometimes questions are not enough. Often the questions don't even need to be asked from others. Sometimes the task involves solving a problem. Other times its straightforward to do it. No matter how obvious the solution looks, its worthwhile to ask a question to yourself.. As a tester, I tried to write test cases that were meaningful. I could have spent hours writing exhaustive test cases and sometimes some test cases are only covered by adding variations, but with a shortage of time and effort, it was important to understand the feature before writing test cases. An architecture diagram helps. More so the customer facing UI and the actions permitted as well as restricted. As an example, I was testing a feature for passbook integration of digital Starbucks cards and it took several iterations before converging to a test case that was both meaningful and mattered to the end user.
2) As a follow up of 1) it mattered that the process of thinking through showed in a document such as a test plan and if not, perhaps in code. I was lucky to be advised to follow the process for most of the tasks. While I agreed with this in principle, it was easier to analyze by first putting things in code even if the code doesn't need to be shared. Moreover, I was working with APIs that could be called by itself or used together in sequences or workflow.
3) Staying two steps ahead of the game. Things change often in the project. The first two steps helped with identifying the resources needed and finding the dependencies. These could be permissions, machines or inventory.
4) Getting a review of the test plan and the test cases gets you some support and confidence from others. This is very much needed and more so for the plan than even code reviews.
5) Prioritizing the tasks and timely action is critical. Sometimes we wait too long and become reactive when in the first place we could have detected and been pro-active. As an example, I worked on giving feedback to a developer on a card auto-load feature by providing data for unit-test that ensured code was in a reasonably good shape prior to test cases and automation.
6) Some pressure is inevitable. For example, I wish I was not in a haste to do items. And to have verified them before publish. This would have mitigated a lot of discussion and gathered more support. However I don't regret partial publish, I'm only wishing that it were incremental and verified
7) Lastly, I wanted to say some-things just take time and that focus helps. However, if we spend time on activities such as investigations, bug triages or test runs, we find less time to code. What we could have perhaps done better was to share the estimates and checkpoint often, reprioritizing the tasks as appropriate or leaving the grind where it didn't matter.

In order to complete the discussion on test portal, we do the following:
We say that there is very little code to be written for the page to kick off a test run and to list the test results.
This works well with a model view controller,
We can add more views for detailed results of each run easily with the framework
We want the results to be published to TFS instead of publishing to files or database.
The runs are kicked off on a local or remote machine via a service that can listen and kick off the mstest. In our case, since we wanted to target the testing of queues, we could use queues for the service to listen on. That said the overall approach is to use as much of the out of box solutions we get rather than having to customize or write new components. Test controller and runs with MTM is well known. The focus here is on our test categorizations and the use of our portal with a variety of tests.
I want to talk a little bit about the UI and database as well. The choice of MVC/ MVVM let's us build out separate testable views with no limit on what can go into the views. We can even have JavaScript and Ajax for post backs and handle control toolkits. With the use of view models, we can have more than one tabs on the pages. This gives us the flexibility to have partial views one for each genre of test such as webtests, apitests, and queue tests. Then with the choices to customize the page based on user, have separate views for changing test data and pages to view stats or associate TFS bugs/ work items and to provide charts or email subscriptions, we can make it all the more sophisticated.
I also don't want to limit the discussion to any one stack. We should be able to implement with any stack be it VS or selenium based.
As an aside, test is also an area where there are many ideas applicable and these have value elsewhere too. For example, from UI, we can borrow the ideas for consolidation, customization, shared or dedicated views , resource grouping and categorization, layout and organization and from MVC we can borrow decomposition, test ability, and separation of concerns. We can illustrate ideas with an example for test ability, an example for MVC organization, and an idea for representing persistence via an organization of providers. We can even represent incremental builds.
As a developer we look for ideas and implementations everywhere.