Thursday, May 31, 2018

We mentioned BigTable and BigQuery in earlier posts where we discussed the purpose of each as forms of storage and processing as well as low-latency and analytical. In this regard there are a few considerations to BigQuery that are not quite obvious:
1)  BigQuery uses columnar data storage to perform analytics over its data. But data does not need to reside in the BigQuery if performance is not a limitation.  External data sources such as log files in cloud storage and records in BigTable can be directly queried. With or without ETL jobs and support for streaming data, some of these analysis can be done. against external data sources.

2) BigQuery is used not just for slower analytical processing. Even near real-time analysis can be achieved by streaming data into BigQuery.

3) BigQuery does not use SQL only. It can support user-defined functions via Javascript.

4) Query quotas and pricing may affect the choice between the interactive vs batch but they can both be used in an assortment for say a dashboard.

5) Queries can be prioritized so that adhoc queries and prepared queries can be run separately and with performance in favor of the latter.

6) BigQuery can work based on a snapshot of the data from a point in time.

7) Since the queries serve organizational needs, they can be arranged as per the requirements of the organization and this means they reflect the well known structure represented from organizations.

8) Since SQL is available for query language, we can use joins and normalized schemas increasing the possibilities of the query

9) Similarly data can be secured with fine grained access control such as at row and column level.

10) we can monitor and audit the usage of these queries so we always know the usages.
#codingexercise
Partition an array into two contiguous subsequences such that there need to be minimum value added to both sums to make them equal 
int GetPartition(List<int> A) 
{ 
    int n = A.Count; 
    var prefixes = new int[n]; 
    var suffixes = new int[n]; 
  
    prefixes[0] = A[0]; 
    for (int i = 1; i < n; i++) { 
        prefixes[i] = prefixes[i - 1] + A[i]; 
    } 
  
    suffixes[n - 1] = A[n - 1]; 
    for (int i = n - 2; i >= 0; i--) { 
        suffixes[i] = suffixes[i + 1] + A[i]; 
    } 
     Int min = suffixes[0]; 
    int index = 0; 
  
    for (int i = 0; i < n - 1; i++) { 
        if (Math.Abs(suffixes[i + 1] - prefixes[i]) < min) { 
           min = Math.Abs(suffixes[i + 1] - prefixes[i]) ; 
           if (suffixes[i + 1] < prefixes[i]) { 
               index = i + 1; 
            } else { 
               index = i; 
            } 
        } 
    } 
    return index; 
} 
Alternatively, for every position candidate in the array, we can compare the sum before and after  
Int before = 0; 
before  = before + A[I]; 
int after = A.Sum() - before; 
#codingexercise
In a rowwise and columnwise sorted matrix of distinct increasing integers, find if a given integer exists. 
Tuple<int,intGetPositionRowWise(int[,] A, int Rows, int Cols, int X) 
{ 
var ret = new Tuple<intint>(){-1,-1}; 
// x, y coordinates 
for (int i =0; i < Rows; i++) 
{ 
int index = binary_search(A, cols, i, X); 
if ( index != -1) 
{ 
ret.first = i; 
ret.second = index; 
break; 
} 
} 
return ret; 
} 

Wednesday, May 30, 2018

We compared BigTable, BigQuery, Cloud Datastore and Firebase in previous post. Big Table is used for Analytics, Maps and GMail. BigQuery is a data warehouse but the myth that you need to get the data into the BigQuery before performing analytics is not always true. If your data is stored in the BigTable or Cloud Storage, it can still be analyzed via BigQuery.
BigQuery uses columnar data storage to perform analytics over its data. But data does not need to reside in the BigQuery if performance is not a limitation.  External data sources such as log files in cloud storage and records in BigTable can be directly queried. With or without ETL jobs and support for streaming data, some of these analysis can be done. against external data sources.
Moreover it is not necessary that BigQuery is used only for slower analytical processing. Even near real-time analysis can be achieved by streaming data into BigQuery.
Furthermore, BigQuery does not use SQL only. It can support user-defined functions via Javascript. The functions take columns as inputs and return a result.
Together with the help of streaming analysis and user defined batch operations, a  wide variety of processing is enabled. Query quotas and pricing may affect the choice between the interactive vs batch but they can be used in an assortment for say a dashboard.

Tuesday, May 29, 2018

We compared BigTable and BigQuery in previous post. Big Table is a NoSQL database where the latency for data access is kept low even in the face of petabytes of data and millions of operations per second. Data is retrieved using scan operations. It is read and written under 10 milliseconds. The limits for best practice include 4KB per key for data keys, about 100 families per table, 16KB qualifier per column, 10MB per cell, and 100MB for all values in a row. BigTable is known to power Analytics, Maps and GMail.
BigQuery is a data warehouse that can store terabytes of data and allows queries to be written in SQL. It can power a wide variety of functionalities for an analytics dashboard. It supports relational database model as primary and  key-value store as secondary and with append-only tables. It can query large amounts of data for analysis in less time but requires more time to query small specific transactional data. Query execution time can be in the order of seconds.  Big query has two forms of costs - storage cost and query cost.
We did not talk about unstructured data. Generally such data is stored for GCP in Cloud Storage. Only if you need a mobile SDK, you would use  Firebase storage Similarly for structured data that is not relational and requires a Mobile SDK, you would use a Firebase Realtime DB from GCP.
Firebase is a comprehensive mobile development platform. It has all the monitoring needed for app analytics and performance. Mobile backend code can be run without managing servers as Cloud Functions and the app data can be stored and synced at global scale.

Monday, May 28, 2018

We were discussing the use of a table and standard query operators to allowing developers to expose resources for the users to query themselves. They can pass the filter criteria directly in the url query parameters via one or more of the well known url patterns. The use of a table means we can also add rows and columns to increase data set and attributes respectively. We could make the resources even more specific by having more than one column in a composite key.
The use of a cloud database to store the table only improves its appeal because the database service becomes managed while being available from all geographical regions with high availability for large dataset. The only  focus for the developers that remains in this case, is the application optimization.
The notions of a Big Table in a multi-model database and Big Query are also evidence to the popularity of this simpler paradigm. Let us take a moment to review these concepts.
Big Table is a Google offering for the last decade and more and is a NoSQL database where the latency for data access is kept low even in the face of petabytes of data and millions of operations per second. Data is retrieved using scan operations. It is read and written under 10 milliseconds. The limits for best practice include 4KB per key for data keys, about 100 families per table, 16KB qualifier per column, 10MB per cell, and 100MB for all values in a row. BigTable is known to power Analytics, Maps and GMail.
BigQuery is Google's data warehouse offering and is less than a decade old. It can store terabytes of data and allows queries to be written in SQL. It can power a wide variety of functionalities for an analytics dashboard. It supports relational database model as primary and  key-value store as secondary and with append-only tables. It can query large amounts of data for analysis in less time but requires more time to query small specific transactional data. Query execution time can be in the order of seconds.  Big query has two forms of costs - storage cost and query cost.
The choice of storage could be driven by the following rules:
if (your_data_is_structured &&
     your_workload_is_analytics &&
     you_need_updates_or_low_latency) use BigTable;
if (your_data_is_structured &&
     your_workload_is_analytics &&
     you_do_not_need_updates_or_low_latency) use BigQuery;
if (your_data_is_structured &&
     your_workload_is_not_analytics &&
     your_data_is_relational &&
     you_need_horizontal_scalability) use Cloud Spanner;
if (your_data_is_structured &&
     your_workload_is_not_analytics &&
     your_data_is_relational &&
     you_do_not_need_horizontal_scalability) use Cloud SQL;