Cluster computing

Extending NoSQL Databases with User Defined document types and key values

The structure and operations on the NoSQL databases facilitate querying against values. For example, if we have documents in XML, we can translate them to JSON and store as documents or key-values in the NoSQL databases. Then a query might look like the following:

Db.inventory.find( { type: “snacks” } )

This will return all the documents where the type field is “snacks”. The corresponding map-reduce function may look like this:

Db.inventory.mapReduce(

Map à function() {emit(this.id, this.calories);},

Reduce à function(key, values) { return Array.sum(values) },

{

query à query: {type: “snacks”},

output à out: “snack_calories”

}

)

This works well for json data types and values. However, we are not restricted to the builtin types. We can extend the key values with user defined types and values. They will just be marked differently from the builtin types. When the mapper encounters data like this, it loads the associated code to interpret the user types and values. The code applies the same query operators such as equality and comparision against values that the mapper would have done if it were in native JSON format. This delegation of interpretation and execution allows the NoSQL databases to be extended in forms such as computed keys and computed values.

Let us take the above example where the calories have to be computed from ingredients.

In this case, the code would look like the following

Function (ingredients, calories){

var total_calories = 0;

ingredients.forEach(ingredient, index, ingredients){

total_calories += calories[index];

}

Return total_calories;

}

While this logic for computed key –values can be written outside the database as map-reduce jobs, this logic can stay as close to the data it operates on and consequently be stored in the database.

Moreover logic can be expressed with different runtimes and each runtime can be loaded and unloaded to execute the logic.

One advantage of having a schema for some of the data is that it brings you the seamless use of structured queries to these specific data. As an example, we can even use XML data itself given the XPath queries that can be run on them. Although we will load an XML parsing runtime for this data, it will behave the same as other data types for the overall Map-Reduce.

Another example of user defined datatype is tuples. Tuples are easier to understand both in terms of representation and search. Let us use an example here:

We have a tuple called ‘Alias’ for data about a person. This tuple consists of (known_alias, use_always, alternate_alias). The first part is text, the second Boolean and the third is a map<text, text>

The person data consists of id, name, friends and status.

We could still insert data into the person using JSON as follows:

[{"id":"1","name":"{"firstname":"Berenguer", "surname": "Blasi", "alias_data":{"know_alias":"Bereng", "use_alias_always":true}}", "friends":"[{"firstname":"Sergio", "surname": "Bossa"}, {"firstname":"Maciej", "surname": "Zasada"}]"}]'

However, when we search we can explicitly use the fields of the type as native as those of the JSON.

There are standard query operators of where, select, join, intersect, distinct, contains, SequenceEqual that we can apply to these tuples.

The reason tuples become easier to understand is that each field can be dot notation qualified and the entire data can be exploded into their individual fields with this notation as follows:

</fields>

The above example is taken from Datastax and it serves to highlight the seamless integration of tuples or User Defined types in NoSql databases.

In addition, Tuples/UDTs are read/written in one single block, not on a per field basis, so they are read as a single block read and write. Tuples/ UDTs can also participate in a map like data model although the are not exactly map values. For example, a collection of tuples/UDTs have a type field that represent what would have been the map key. We just have to declare a UDT type that includes tuples as well and for this UDT we specify the search the same way as in a map like data model but using dot notations for the fields . For example we can search with {!UDT}alias.type:Bereng AND alias.use_alias_always:True

#coding question
Determine the maximum gradient in a sequence of sorted numbers
Int GetMaxGradient (int [] numbers)
{
Int max = 0;
For (int I = 1; I < numbers.length; i++){
Int grad = Math.abs (numbers [i] - numbers [i-1]);
If (grad > max) max = grad;
}
Return max;
}

Cluster computing

Friday, November 13, 2015

No comments:

Post a Comment