Cluster computing

Sunday, February 21, 2016

National databases – SQL or NoSQL a comparision
National databases have to integrate data from a variety of sources mostly regional in nature. What are some of the challenges and how are they overcome ? We take a look in this study. Further we will explore what could be an integrator for various upstream databases. We will also compare SQL and NoSQL solutions in this regard.
Integration:
Regional databases evolve different from the National database. They are the primary source of truth for their region. They need not have the same convention, syntax and semantics as the national one.
It’s also likely that they have overlapping information about the same entity as that entity moves from region to region. Consider name-address pair as an example. This varies for individuals as they go out of state. Fortunately such information has a schema and the data is often indexed.
Consequently, data with schema is extracted , transformed and loaded into central databases with or without code. As the data flows through this process, there is no straightforward way of knowing that origin or the final destination completely therefore staging tables are used to consolidate, scrub and translate data. Periodic work flows are required to add data as and when differences arise. This is a fairly complicated process because changes in the data affect this process.
If we look at the interface to plug in data sources of various regions, they look like the standard query operators on the C# LINQ pattern. But essentially we just need an iterator that can get one record at a time with state information from one to the other. Sure some of the attributes may be different or unrecognizable or even missing but the they can be filled in a large sparse table where columns are mapped to the attributes. A connecting string including the credentials is required for each data source. This may not be the final destination table but it can be used to feed the final destination
On the other hand, NoSQL databases enable a homogenous document model across heterogeneous data. Here each regional database can have any kind of information. Since they are documents or key value pairs they are easily combined or collected together. Searches over these fairly large and possibly redundant data is made better with parallelization techniques such as map-reduce algorithm.
Considerations for NoSQL databases include :
Searrch by key, by cell or for values in column families.
Limiting search to key/column ranges
Retaining the last N historical values.
If it supports bit operations, set operations, list operations, hashes and sorted sets.
If it has scriptability
If it supports transactions and publisher subscriber messaging
If the dataset size can be predicted.
Etc

The operations on public databases are mostly read only. Therefore transaction semantics for create-update-delete operations and isolation levels are not as important as retrieving some or all records. Consequently a nosql option and parallelization techniques are favored.
#merlin qa
In the previous code exercise problem, we showed an approach based on permutations that we can exploit to cover all possible orderings of spells before taking the one that yields maximum materials. I called this the naive approach. An alternative would be to have a score for each spell and then order the spells by that score in a strictly increasing materials. The score can be based on the maximum material produced if any or zero or a combination of the materials produced.

Cluster computing

Sunday, February 21, 2016

No comments:

Post a Comment