Cluster computing

Wednesday, December 19, 2018

Today we continue discussing the best practice from storage engineering:

195) User’s location, personally identifiable information and location-based services are required to be redacted. This involves not only parsing for such data but also doing it over and over starting from the admission control on the boundaries of integration. If the storage tier stores any of these information in the clear during or after the data transfer from the pipeline, it will not only be a security violation but also fail compliance.

196) Data Pipelines can be extended. The storage tier needs to be elastic and capable of meeting future demands. Object Storage enables this seamlessly because it virtualizes the storage. If the storage spans clusters, nodes can be added. Segregation of data is done based on storage containers.

197) When data gets connected, it expands the value. Even if the storage tier does not see more than containers, it does very well when all the data appears in its containers. The connected data has far more audience than it had independently. Consequently, the storage tier should facilitate data acquisition and connections

198) Big Data is generally accumulated from some source. Sensor data for example can be stored in NoSQL Databases. However, the data is usable when the right metadata is also recorded with the observational data. To do this continuously, the storage tier must facilitate metadata acquisition.

199) Cleaning and parsing: Raw data is usually noisy and imperfect. It has to be carefully parsed. For example, with full text analysis, we perform stemming and multi-stage pre-processing before the analysis. This applies to admission control and ingestion as well.

200) Data modeling and analysis: Data model may be described with entity-relationship diagrams, json documents, objects and graph nodes. However, the models are not final until several trials. Allowing the versions of data models to be kept also helps subsequent analysis.

#codingexercise
How does SkipList work?
SkipList nodes have multiple next pointers that point to adjacencies based on skipping say 4,2,1
In a sorted skiplist this works as follows:

SkipListNode skipAhead(SkipAheadNode a, SkipAheadNode b) {
SkipListNode cur = a
SkipListNode target = b;
If ( a == null) return a;
For ( cur; cur.next && cur.next.data <= target.data; ) {
// skip by 4, if possible
If (cur.next && cur.next.next && cur.next.next.next && cur.next.next.next.next &&
cur.next.next.next.next <= target.data)
cur = cur.next.next.next.next;
// Skip by 2, if possible
If (cur.next && cur.next.next &&
cur.next.next.next. <= target.data)
cur = cur.next.next;
// Skip by 1, if possible
If (cur.next &&
cur.next <= target.data)
cur = cur.next;
}
Return cur.next;
}
Since the SkipList already has the links at skip levels of 4,2,1 etc we avoid the checking and using next.next.next notations

The number of levels for skipping may not be restricted to using 4,2,1 only.

Cluster computing

Wednesday, December 19, 2018

No comments:

Post a Comment