Cluster computing

Wednesday, December 11, 2013

Summary of book reading on Professional LAMP web development:
This book is a good read even for those familiar with any part of the stack.
The changes introduced in PHP5 are listed as 1) passing variables based on reference by default as opposed to passing by value by default in PHP4.
2) In Php4 we had set_error_handler() to define a user function for error handling. In PHP5, we have the same try-catch-throw semantics as in other languages.
3) There is a builtin Exception class and it has message, code, file and line information.
4) Interfaces have been added to PHP5 and classes can implement multiple interfaces.
5) The Standard Public Library in PHP5 has introduced new set of classes and interfaces. Iterator class along with DirectoryIterator, RecursiveIterator and ArrayIterator are now available.
6) Constructors and destructors by way of __construct() and __destruct() have been added
7) Access modifiers, final, static abstract can be added to control your classes.
8) Instead of overload() in PHP4, we can now call a builtin methods.
MySQL has also some advanced features. For example, we can query multiple tables, do full-text searching, control access, analyze the database and maintain it. Tools such as PhpMyAdmin and MySQL Administrator GUI tool are available for some of these chores. The query text generated from these tools can give a better understanding of what the tool or the application does with the database.
Apache comes with the following features : URL rewriting, URL spell checking, content compression, using MySQL with Apache, Apache and SSL, Apache as a file repository and Summary.
Site Security can be tightened with the following features : controlling access, website attacks, keeping the system current and updated, updating PEAR and PECL packages installed with PHP, writing cron job to do automatic updates and by reducing the likelihood of register_globals exploit or SQL Injection attack. SQL exploits are avoided by initializing all variables.

Continuing from the previous post...
We read that the traffic data and services provided by Inrix overhauls the coverage, detail, availability and timeliness of traffic data and creates avenues for applications where there existed none. For drivers, this translates to more options for planning trips and less blind spots. With more coverage, there's more options and with up-to-date information, there's more predictability and time for evasive response.
For automakers, the detail in the data enables multimodal routing. In addition, the device map and platform independence enables ubiquity without the need to get TMC licenses per geography.
Inrix's traffic can benefit public sector agencies as well. Transportation agencies can better manage road network and historical traffic pattern. Public safety agencies can give more detailed information on road conditions. Emergency management agencies can get a better route to respond first to incidents.
Historical and present-day data can both be presented so that it can be used to plan high occupancy lanes, traffic signal patterns, road sensor deployment and road narrowing schemes. Overall, the timeliness and reliability of data is improved enabling better driving experience and reduced costs. Car manufacturers become more competitive and agencies operate safer.
Wide reachability across geographies and reduced mapping costs translate to savings.
Reducing traffic congestion and improving highway conditions now becomes easier. From short term management of incidents to long term planning of road networks, the insights from the traffic data can help. Furthermore, new avenues for technology providers and mobile applications are enabled.
The highlights for the XD Traffic are:
four million miles of real time coverage, including one million miles of roads never covered before.
independence from devices, maps and geography.
resolutions of up to 250 meter enabling greater detail in data.
Higher reachability across 37 countries with emerging markets
Improvements in traffic data on motorways, highways and arterial
Improvements in road closure and traffic analysis.

Tuesday, December 10, 2013

This post is from the white paper on Inrix web site : Fueling future mobility with big data.
This paper talks about how high quality traffic data and sophisticated analysis helps get people around quickly and efficiently. High quality traffic data helps in more than one way. First it improves journey times. We can see how by seeing more accurate data to the user and improving the satisfaction. Second, the traffic data is a layer that helps build applications for connected cars and smarter cities.
Mobile devices, applications and internet sites help provide digital maps that improve navigation. Navigation technology has become so ubiquitous its no longer a differentiating factor and is demanded as even a built-in feature across utilities. However, traffic data is different from navigation and there are several variables. First coverage for all available roads and not just the busy ones is a differentiating factor for traffic data. Coverage expands the choices for the routes that are not available otherwise and have frequently hurt the driver experience. Second, exact locations and dimensions of a traffic queue is critical to planning routes. This level of detail is generally not available with most providers. Traffic data has been sticky to specific maps and sometimes services, making it available widely and consistently has been a challenge. Timeliness of incidents reports to the driver is critical to knowing change of routes or other impact. There is an industry wide latency in providing such data.
Inrix strives to improve traffic data on all of these fronts with its traffic data service, hoping to make an impact to driving but in how driving shapes city planning. The source of the traffic data is the crowd and this is expected to increase rapidly with more penetration by applications and implementations. The coverage is improving where 1 million miles of road is now added where only 3 million miles were being covered. Moreover, the traffic can now be painted on any map, any device and in several countries.
With the rising popularity of public transit, I'm excited to see improvements in bus traffic locally here on the east side.

Monday, December 9, 2013

I got the following interview question:

Using the following function signature, write a C# function that prints out every combination of indices using Console.WriteLine() whose values add up to a specified sum, n. Values of 0 should be ignored.

public void PrintSumCombinations(List<int> numbers, int n);

· It’s okay to use additional private functions to implement the public function

· Be sure to print out the indices of numbers and not the values at those indices

· Don’t worry too much about memory or CPU optimization; focus on correctness

To help clarify the problem, calling the function with the following input:

List<int> numbers = new List<int> { 1, 1, 2, 2, 4 };

PrintSumCombinations(numbers, 4);

Should result in the following console output (the ordering of the different lines isn’t important and may vary by implementation):

0 1 2 (i.e. numbers[0] + numbers[1] + numbers[2] = 1 + 1 + 2 = 4)

0 1 3

2 3

4

Here is my hint: Generate the variations based on permutations and regardless of the content, then check each sequence for the expected sum.
public void Permute(ref List<int> numbers, ref List<int> candidate, ref bool[] used, int n)

{

if (candidate.Sum() == n)

{
candidate.ForEach(x => Console.Write(x.ToString() + " "));

Console.WriteLine();

}

for (int i = 0; i < numbers.Count; i++)

{
if (used[i]) continue;

candidate.Add(numbers[i]);
used[i] = true;

Permute(ref numbers, ref candidate, ref used, n);

candidate.Remove(candidate.Last());
used[i] = false;

}

}

For combinations, we could take different length substrings and permute them. There may be repetitions but we process just the same.

And here is another way to solve the problem
public static void Combine(ref List<IndexedNumber> numbers, ref List<IndexedNumber> candidate, ref List<List<IndexedNumber>> sequences, int level, int start, int n)

{
for (int i = start; i < numbers.Count; i++)

{
if (candidate.Contains(numbers[i]) == false)

{

candidate[level] = numbers[i];
if (candidate.Sum() == n)

sequences.Add(new List<IndexedNumber>(candidate));

if (i < numbers.Count - 1)

Combine(ref numbers, ref candidate, ref sequences, level + 1, start + 1, n);

candidate[level] = new IndexedNumber() { Number = 0, Index = -1 };

}

}

}

sample maven test
// modify pom.xml to include junit
import org.junit.Assert;
import org.junit.Test;

@Test
public void test()
{
Rectangle sample = new Rectangle(1,2);
double area = sample.Area();
Assert.assertEquals(area, 2);
}
}
mvn test

interface IMeasurable
{
double Area() throws illegalAccessException();
}
abstract class shape implements IMeasurable
{
volatile double x;
volatile double y;
shape (double a, double b)
{
x = a;
y = b;
}
double Area() throws illegalAccessException();
{
throw new illegalAccessException();
}
}
class Rectangle extends shape
{
Rectangle (double x, double y)
{
super(x,y);
}
final void PrintMe()
{
System.Out.PrintLn("I'm a rectangle with length : " + X + "and breadth :" + Y + "and Area :" Area());
}
double Area() throws illegalAccessException()
{
return x*y;
}
}

Sunday, December 8, 2013

While Matsuo's paper discusses an approach to extract keywords by first taking the top 30% of the frequent terms and then clustering based on pairwise co-occurrence of terms, we strive to collect the terms that differentiate from the background by automatically finding the number of clusters using DBSCAN algorithm and KLD distances based on similarity of distributions. In this case we use the KLD-distance as defined by Bigi :
D-KLD(P/Q) = Sum-x(P(x)-Q(x))log(P(x)/Q(x))
and the distribution is taken with the probabilities as
pg = the sum of the total number of terms in sentences where g appears) divided by the total number of terms in the document.

In the KLD metric, when we have a limited set of sentences, this probability is then normalized by multiplying with a scaling factor when the term occurs or set to a default value
when it doesn't.

We use the DBSCAN algorithm like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
By clustering points, we incrementally provide user-defined number of keywords by taking the high valued clusters first and enumerating the keywords in that cluster.

Selenium is a web based suite of tools for browser automation.
It requires 2 jar files such as selenium-server.jar and selenium-java-client-driver.jar
Maven can be used to include the required jar file. We can then import the Maven project into our IDE such as Eclipse. To use maven, we define a pom.xml file.
We define a page object that can the test cases target. This page object changes only when the page changes. Many test cases can target the same page. The page object uses the WebDriver to test the page. The page objects can be written to extend a generic base page object.
The controls on the page are referred to with WebElement. These can take input and this is how the tests can drive the testing. To look for inputs in a given page, we can rely on the driver's findElement method. This will recursively traverse the controls in the document tree to find the element. You can also specify XPath to find the control. The WebElements can be found by ID and this is the most efficient and preferred way or by class name which is the attribute on the DOM element. Other means of finding the elements include by name, by tag name which is the DOM tag name of the element, by link text or partial link text and CSS or Javascript.
The Selenium-WebDriver makes direct calls to the browser using each browser's native support for automation. Unlike the older Selenium-RC which used to inject the Javascript code, the webdriver drives the browser directly using the browser's built in support for automation.
The selenium server helps in the following ways:
1) It distributes the tests over multiple machines or virtual machines
2) It connects to a remote machine that has a particular browser version
3) It helps to use the HtmlUnitDriver instead of the Java bindings for each language.
As with all user interface automation, tests may require to wait for a few seconds for the page or the controls to load. This has often caused reliability issues for the tests
As with most web pages, Tests may have to target international sites.
Each browser maintains some support for automation
Cross browser compatibility is one of the primary concerns for writing tests.
This is where the WebDriver helps.