Cluster computing

Wednesday, May 30, 2018

We compared BigTable, BigQuery, Cloud Datastore and Firebase in previous post. Big Table is used for Analytics, Maps and GMail. BigQuery is a data warehouse but the myth that you need to get the data into the BigQuery before performing analytics is not always true. If your data is stored in the BigTable or Cloud Storage, it can still be analyzed via BigQuery.
BigQuery uses columnar data storage to perform analytics over its data. But data does not need to reside in the BigQuery if performance is not a limitation. External data sources such as log files in cloud storage and records in BigTable can be directly queried. With or without ETL jobs and support for streaming data, some of these analysis can be done. against external data sources.
Moreover it is not necessary that BigQuery is used only for slower analytical processing. Even near real-time analysis can be achieved by streaming data into BigQuery.
Furthermore, BigQuery does not use SQL only. It can support user-defined functions via Javascript. The functions take columns as inputs and return a result.
Together with the help of streaming analysis and user defined batch operations, a wide variety of processing is enabled. Query quotas and pricing may affect the choice between the interactive vs batch but they can be used in an assortment for say a dashboard.

Tuesday, May 29, 2018

We compared BigTable and BigQuery in previous post. Big Table is a NoSQL database where the latency for data access is kept low even in the face of petabytes of data and millions of operations per second. Data is retrieved using scan operations. It is read and written under 10 milliseconds. The limits for best practice include 4KB per key for data keys, about 100 families per table, 16KB qualifier per column, 10MB per cell, and 100MB for all values in a row. BigTable is known to power Analytics, Maps and GMail.
BigQuery is a data warehouse that can store terabytes of data and allows queries to be written in SQL. It can power a wide variety of functionalities for an analytics dashboard. It supports relational database model as primary and key-value store as secondary and with append-only tables. It can query large amounts of data for analysis in less time but requires more time to query small specific transactional data. Query execution time can be in the order of seconds. Big query has two forms of costs - storage cost and query cost.
We did not talk about unstructured data. Generally such data is stored for GCP in Cloud Storage. Only if you need a mobile SDK, you would use Firebase storage Similarly for structured data that is not relational and requires a Mobile SDK, you would use a Firebase Realtime DB from GCP.
Firebase is a comprehensive mobile development platform. It has all the monitoring needed for app analytics and performance. Mobile backend code can be run without managing servers as Cloud Functions and the app data can be stored and synced at global scale.

Monday, May 28, 2018

We were discussing the use of a table and standard query operators to allowing developers to expose resources for the users to query themselves. They can pass the filter criteria directly in the url query parameters via one or more of the well known url patterns. The use of a table means we can also add rows and columns to increase data set and attributes respectively. We could make the resources even more specific by having more than one column in a composite key.
The use of a cloud database to store the table only improves its appeal because the database service becomes managed while being available from all geographical regions with high availability for large dataset. The only focus for the developers that remains in this case, is the application optimization.
The notions of a Big Table in a multi-model database and Big Query are also evidence to the popularity of this simpler paradigm. Let us take a moment to review these concepts.
Big Table is a Google offering for the last decade and more and is a NoSQL database where the latency for data access is kept low even in the face of petabytes of data and millions of operations per second. Data is retrieved using scan operations. It is read and written under 10 milliseconds. The limits for best practice include 4KB per key for data keys, about 100 families per table, 16KB qualifier per column, 10MB per cell, and 100MB for all values in a row. BigTable is known to power Analytics, Maps and GMail.
BigQuery is Google's data warehouse offering and is less than a decade old. It can store terabytes of data and allows queries to be written in SQL. It can power a wide variety of functionalities for an analytics dashboard. It supports relational database model as primary and key-value store as secondary and with append-only tables. It can query large amounts of data for analysis in less time but requires more time to query small specific transactional data. Query execution time can be in the order of seconds. Big query has two forms of costs - storage cost and query cost.
The choice of storage could be driven by the following rules:
if (your_data_is_structured &&
your_workload_is_analytics &&
you_need_updates_or_low_latency) use BigTable;
if (your_data_is_structured &&
your_workload_is_analytics &&
you_do_not_need_updates_or_low_latency) use BigQuery;
if (your_data_is_structured &&
your_workload_is_not_analytics &&
your_data_is_relational &&
you_need_horizontal_scalability) use Cloud Spanner;
if (your_data_is_structured &&
your_workload_is_not_analytics &&
your_data_is_relational &&
you_do_not_need_horizontal_scalability) use Cloud SQL;

Sunday, May 27, 2018

We were discussing the use of a table and standard query operators to allow users to lookup resources over webAPIs which can be invoked from any device or application. In fact, any other way to perform lookups of resources over the web may be rather limiting.

The language to invoke query operations over the web is with the help of web application programming interface or webAPIs for short. They conform to one or more of the following patterns for the query operations:

GET /resources?offset=0&limit=100 retrieves the first hundred records.

GET /resources?field=field1,field2 &ordering=field1retrieves just those fields and ordered by field1

GET /resources?sort=-created_at retrieves the most recent resources added pertaining to the caller

GET /resources?state=closed retrieves those resources that are no longer available

GET /resources?latitude=&longitude=&radius=

GET /resources?id={"lt": 100, "gt": 30}

GET .resource?ql=location within 500 of 40.042016, -86.900749

GET /resources?ql=select * where name = 'Fred'&limit=50 (Courtesy: usergrid.apache.org)
The purpose of using urlpatterns like above is that they standardize the operations and format for any dataset.

If we had resources described by a composite key with more than one attributes, we can make it even more granular with the addition of more columns to the key. This will not affect any of the existing data and will make the query operation even more specific. The convenience of a table to add features for querying is already well-known. In addition, the standard timestamps of created, modified and status might also help with passing the querying logic to the client so they can construct any url patterns. If the clients are internal, they can even specify the entire logic on a sql statement. to the resource.
Tables are also stored in database where the index based lookups are very efficient. Furthermore, there is no limit to the number of records the table may have. However, it might be prudent to archive older unused records to keep the table to a smaller size.

Saturday, May 26, 2018

With my experience building web applications and services for many companies, I have found that one of the easiest and appealing engagements is to connect customers to their resources via the web. Most organizations often find that they have a data source and a whole lot of customers who want to look up their data in one or more ways.

The simplicity of using a table to list their resources and allowing the customer to query their own data however they like, is not only universally appealing but also very convenient. A manifestation of a mature solution to this problem space that is so ubiquitous across organizations is the availability of rich querying operations and simpler storage.

A table and standard query operators serve this purpose very well. When the same query operations are available on the web as methods that can be invoked with a web address identifier, it allows many devices as well as applications to consume the data in a universal way. Said another way, the absence of this manifestation indicates a potential limitation.

GET /resources?offset=0&limit=100 retrieves the first hundred records.

GET /resources?field=field1,field2 &ordering=field1retrieves just those fields and ordered by field1

GET /resources?sort=-created_at retrieves the most recent resources added pertaining to the caller

GET /resources?state=closed retrieves those resources that are no longer available

GET /resources?latitude=&longitude=&radius=

GET /resources?id={"lt": 100, "gt": 30}

GET .resource?ql=location within 500 of 40.042016, -86.900749

GET /resources?ql=select * where name = 'Fred'&limit=50 (Courtesy: usergrid.apache.org)

Friday, May 25, 2018

We were discussing reduction in vectors and graphs. We now proceed to the hybrid graph processing of parallel Breadth First Search. The hybrid version uses intra node multi-threading.Its called hybrid because it uses both distributed memory graph partitioning and shared memory traversal parallelism which can enable scalability to several thousands of cores.
The partitioning may simply be 1D before the level synchronization happens. The distance array is distributed among the processes. Every process maintains the status of the vertices it owns and it can utilize multi-threading to enumerate the adjacencies but only the owner process of a vertex determines if it is visited already or not. Therefore all the adjacencies of the vertices in the current frontier need to be sent to their corresponding owner. This happens in the all-to-all communication step. Most of the activity is data parallel. The only synchronization that happens is at the barriers. The only improvement over the serial level synchronous algorithm is the computation involved from distributed graph which is the preparing of messages and the all to all communication.
Let FS be the stack to store the vertices at the current level which represent the frontier
and NS be the stack to store the newly visited vertices one hop away from the current level.

the serial algorithm is:
After initialization and pushing the root on the stack,
while FS not empty:
for each u in FS {
for each neighbor v of u:
if v is not visited:
push v onto NS
set level of v
}
Set FS to NS, empty NS, increment level

The hybrid algorithm is:
After initialization, push the root on the stack after determining owner,
and For each of the the p processors, initialize send and receive buffers and thread local stack tbuf-ij for thread i at processor j
while FS not empty:
for each u in FS in parallel{
for each neighbor v of u do
pv = find_owner(v)
push v to tbuf-ipv
Thread barrier
for each of the processors
merge thread-local tbuf-ij in parallel and form send-buf-j
Thread barrier
Now perform all to all collective step with master:
send data in send-buf and aggregate newly visited vertices in recv-buf
Thread barrier
for each u in the recv-buf in parallel do
if d is infinity then
set level of u
push u to NS-i
Thread barrier
Perform parallel merge of NS-i to FS
Thread barrier
#detailed discussion in https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

Thursday, May 24, 2018

We were discussing dimensionality reduction using both linear and non-linear transformations.
This technique eliminates noise by choosing the dimensions as salient features and lowers cost without significant loss of information. Given a data set X with n data points that needs to be reduced to d dimensions, a linear transformation proceeds by selecting V data set that have d dimensions corresponding to each other and matrix multiplying their transpose to each of the points in the X data set to get Y. Since the V has d dimensions the resulting linear transformations also have d dimensions. Non linear dimensionality reduction techniques may even learn an internal model within the data as in the case of manifold learning. In this case, a high dimensionality data set may be projected onto smaller dimension while trying to preserve the structure of inter-point distances from the high dimensional space in the lower dimension projection. It is called non-linear because the mapping cannot be represented as a linear combination of original variables.
Different set of dimensions for the reduced space results in different perspectives which yields different visualizations.
Interestingly graphs can also be reduced. It is called hypergraph coarsening which is an approximation of the original structure of the graph. Coarsening can be iterative. Succession of smaller hypergraphs tend to make incremental progress towards a coarser graph with less overall loss of information. There are several methods. Pairs of vertices that are similar can be merged. Skip edges may be overlayed on the same graph to represent the coarser graph. Centrality may be adjusted based on weights of the edges removed from vertices that are removed without loss of path or connectivity among the fewer vertices.
#book summary : https://1drv.ms/b/s!Ashlm-Nw-wnWtnZC3FVGlkh-m47E