Thursday, November 5, 2015

Today we continue the discussion on NoSQL Databases. We have HBase which is written in Java. It can store billions of rows times millions of columns and is modeled after Google's BigTable. It has an Apache License. Like CouchDB, it also supports HTTP/REST. HBase is built on a framework of data storage and data processing services.The former is named HDFS, short for Hadoop Distributed File System. The latter is named MapReduce and uses a high performance parallel data processing technique. A data warehouse named Hive, and a query language named Pig is also available over this framework. Together this forms the Hadoop framework. Hbase stores key-value as columns in a column family and each row can have more than one column family. Each row need not have the same number of columns. Query predicates can be pushed down via server side scan and get filters. Optimizations are for realtime queries. Hadoop scales horizontally in that additional commodity hardware can be added without interruption and node failures can be compensated with redundant data. It does not guarantee ACID properties and supports forward only parsing.  It supports a rolling restart for configuration changes and minor upgrades. Its random access performance is like MySQL. It is commonly used for search engines and analyzing log data. Wherever a BigTable is required, the HBase can be used.
A Hypertable is another NoSQL database that is like a faster smaller HBase. It is written in C++ and has a GPL license. It is sponsored by Baidu.It's protocol is Thrift, C++ library or HQL shell. It implements Google's BigTable design. It runs on Hadoop's HDFS, It uses its own SQL like language which it calls HQL. It can search by key, by cell or for values in column families. Search can be limited to key/column ranges  It retains the last N historical values. The tables appear in namespaces while map/reduce continues to be with Hadoop.
#interviewquestion
Given an array of integers, find count of minimum number of elements to be removed from the array such that the maximum element of the new array is at most twice of the minimum. O(nlogn) solution was required.
It would help to sort the elements and then remove the elements from both front and back that meets the two criteria.

The two criteria are the range of the subarray and the count of items removed
The elements may increase sharply or gradually at either ends so we must consider removing elements from both ends.  We can switch ends depending on the lower and upper bound of the range but we cannot stop at the first encounter for the range under consideration because another might satisfy the minimum count criteria.
For example  if we have 2 3 4 5 7 9 10 11
Then if we take 2, 3, 4, 11  out we get 5 7 9 10
and this is the minimum number we need to remove.
we look at the item 1 and take its double.  We count the elements on the right that are larger than the double. Since the elements are sorted the look up is logN although linear could also do in this case
for each element we consider, we can find the count of elements to be removed. For the element in the left, we can have a range starting with that number. For the element in the right we can have a range ending in that range. Therefore it is better to take half that range.
int getMinCountRemoval(int [] num)
{
   Array.Sort(num);
   int minCount = num.Length;
   for (int i = 0; i < num.Length-1; i++) // this is where we could be smart and switch front or rear alternatingly for range detection until we reach the middle or bail out sooner.
   {
         int rangeStart = num[i];
         int  rangeEnd = 2* num[i];
         int index = find_next_val(num, rangeEnd); // index of number >= rangeEnd;
         int count = num.Length;
         if ( index != -1 && index > i){ //found
            count = (Length -index - 1) + i;
            if (count < minCount){
                      minCount = count;
            }
         }else{
             count = i-1;
             minCount = count +1;
             break;
         }
   }
   if (minCount < 0 || minCount == num.Length)
      throw new NotFoundException();
   return minCount;
}

We continue our discussion on NoSQL databases:
Redis 3.0 is blazingly fast because its database is in memory and is disk backed. Therefore its dataset is limited to RAM. However the RAM is not limited to one machine and includes that of a cluster. It supports simple values or data structures that can be looked up by a key. It supports bit operations, set operations, list operations, hashes and sorted sets. It even has LUA scriptability  it supports transactions and publisher subscriber messaging. It comes useful wherever the dataset size can be predicted. It has  a BSD license.


From a discussion of the above NoSQL databases, it looks like ElasticSearch is the one that provides most search capabilities. However, none provide the kind of search that say clusty.com provides. Incorporating clustering as part of the search techniques can improve the versatility of the search that these databases provide.

No comments:

Post a Comment