Cluster computing

Sunday, February 14, 2021

Importer-Exporter

Importer-Exporter:

Problem Statement: Data in a database poses a challenge to the utilities that have matured as usable from command-line tools because it is not in a declarative form. The moment data is exported, it becomes an instant legacy. On the other hand, when the data is exported as files or objects, or events, the businesses get empowered to use non-relational technologies such as Big Data analysis stacks, Snowflake warehouse, and Stream analytics. What is needed is an import-export tool and this article strives to discuss the options available for same.

Solution: Data import/export is not a new feature. Extract-Transform-Load operations can export data in a variety of formats such as CSV or JSON. What was missing was the layout of the data in the form of the organization in the database so that the data becomes easy to differentiate and work with on disk as opposed to remaining hidden within the database. When the cells, the rows, and the tables are organized in a way that they are easy to locate using a folder/file hierarchy, they become easy for command-line tools that are savvy about folder/file hierarchy and can recursively differentiate instances of data.

The ETL wizards are only part of the solution because they export data but not the organization. If they were to export schema, it would assist with the organization of the data, but it would not be sufficient for the end-to-end capability required from a tool to differentiate instances of data between database snapshots.

OData was a technique that popularized exposing the data to the web using the conventional web Application Programming Interface (API). This data now could be imported via a command-line tool such as curl and the JavaScript Object Notational Format (JSON) made it universally acceptable for programmability. Yet the organization information was largely left to the caller and the onus to make hundreds of calls does not yet address the need described. Again, this was also an insufficient solution but certainly a helpful step.

S3 APIs with its tri-level hierarchy of Namespace, Bucket, and Object was an interesting evolution because each object was web-accessible. If the data could be exported from a database into these containers, the use of command-line tools became unnecessary as downstream systems such as event processors could directly read the data without the need to put them on disk in between the stages of processing. The ability to differentiate instances of data from database snapshots could have been modified to iterate over the remote web locations which incidentally maintained a folder prefix organization. Whether the database, table, and columns could be translated to Namespace, Buckets and Objects remained at the discretion of the caller and the nature or size of data. This mapping even if predetermined could not be found in the existing pieces of solutions proposed by ETL wizards, OData, or S3 exporters and importers.

Filesystem remains universally appealing to all shell-based command-line tools. Writing C program-based tools that can download data from a database and write them to disk in the form of a folder or files is, therefore, the only complete solution to differentiate instances of data between database snapshots. This option was always left out to the individual developers to implement for their use case because the size of the production database could be upwards of Terabytes and filesystems stretched even on clusters were not considered practical for production. But these very developers truly proved that data is not sticky to relational databases because they found and promoted batch and stream-based processing that avoids the centralized store.

Conclusion: With the ability to connect to a relational store and a predetermined strategy to flatten data from database snapshots to folder/file layout, a shell-based command-line tool can enable a source control tracking system on data which allows data revisions to be visualized and compared better.

Saturday, February 13, 2021

The export of versioning system:

Problem Statement: Versioning is a core requirement for many software products. It usually takes the form Major.Minor.patch where the Major is incremented if there are breaking changes from the previous version, the minor is incremented if the changes are backward compatible, and the patch is used to denote mere bug fixes. Yet it varies from system to system and occasionally requires the wheel to be reinvented across components. Instead, if we look at the source control tracking software such as GitHub, we see a much desirable standardization that serves two-fold purposes – enable each version to be uniquely identified and manage the versions to effectively compare changes. Is it possible to offload the versioning to GitHub?

Solution: We ask if the versioning serves the same purpose. For binary dependencies of large software, the versioning holds a different meaning and form than those for text files that are easy to diff without even version. GitHub solves universal versioning and is ubiquitous in its adoption across companies. There is even a translation from Git logs to Semantic Versioning 2.0.0 where the former captures the changes to the versions in a workspace of files while the latter refers to the standardization of convention for the meaning and form used to represent versions. GitVersion is a tool that can translate the individual versions to the semantic versioning. Together with the ability to make unique, universal version, to give every change a version and to bridge the gap to translate those versions to a form acceptable as compliant with Semantic Versioning standards, GitHub seems like a one-stop shop for versioning, tracking and building a management system around changes.

The artifacts that are versioned by Git are mere files. Therefore, any system that utilizes Git to version its changes must export its changes in a set of files that can be checked in and versioned with Git. The workflow to export the artifacts and import those with versioning into the system avoids reinvention, rewrite, inconsistencies and bugs to deal with their artifacts. Git, on the other hand, has been revising its own versioning strategy from where it was using GitFlow and GitHubFlow in version 2 to using configurations in v3. With configurations, many aspects of the versioning can be tweaked while the former versioning strategy did not.

The other approach is to mimic Git’s version strategy into the product that requires versioning. This avoids having the export and import the artifacts as files and works with object stores such as Artifactory. It is especially helpful if the versions are created using md5 since the hashes will appear similar and have a fixed size of 32 characters in length.

The export of data to text files helps track revisions of data. Just like code is revisioned, data can benefit from change capture in a way that is widely accepted.

Conclusion: Versioning is increasingly becoming a commodity requirement that will likely handed over to platforms, services and clouds that take away the onus from projects, organizations and businesses. This technique to use something like Git that is well-known, universal and ubiquitous will help reduce the cost to build and maintain products in much the same way as cloud computing has reduced the onus and ownership for applications from startups.

Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWxUODFr8nsjQ4GeYR?e=8PMCye

Addendum: Elaboration on the specifications required for Semantic Versioning (SemVer)

1. There must be an API for semantic versioning, and it should be public.

2. Version beginning with X.Y.Z must be non-negative and must not contain leading zeros.

3. A versioned package that is released must not allow modifications to package.

4. Zero as value to begin with for major version is permitted only for development.

5. 1.0.0 version defines the public API.

6. x.y.Z must be incremented only when the fixes are backward compatible.

7. x.Y.z must be incremented if new features are introduced, and older ones are deprecated.

8. X.y.z must be incremented if the compatibility is broken.

9. A pre-release version may be appended with a hyphen and a dot separated identifiers.

10. Metadata may be denoted by appending a plus sign and a series of dot separated identifiers.

11. Precedence evaluates major, minor, patch first then pre-release version, and finally the normal version

Friday, February 12, 2021

Generating hashes

Generating hashes:

Hashes generation is an in important step in the versioning and signing stages of software development and release. A hash is a fingerprint for the associated artifact. With this identifier, the artifact can be tracked or referenced along with the guarantee that if the content changed, the hash would also change. The integrity of the file is therefore manifested with the hash.

Popular forms of hashes are Message Digest (MD5) and Secure Hash Algorithm (SHA). SHA-1 hashes are 160 bits or 20 bytes long. It comprises of hexadecimal numbers 40 digits long. The message digest is like the Rivest design for MD4 MD5. Take 5 blocks of 32 bits each, unsigned and in Big-Indian. Then do a preprocessing to the message. Append the bit 1 to the message. Append a padding of up to 512 bits so that the message aligns with 448. Append the length to the message as an unsigned number. Then do the processing on successive 512-bit chunks. For each chink, break the chunk into sixteen 32-bit big endian words. Initializes the hash value for this chunk as h0, h1, h2, h3 and h4. Extend the sixteen 32-bit words into eighty 32-bit words this way: for each of the 16 to 79’th word, XOR the word that appears 3, 8, 14, and 16 words earlier and then left rotate by 1. Initialize the hash value for this chunk as set of 5. In the main loop for i from 0 to 7, for each of the four equal ranges, apply the ‘and’ as well as the ‘or’ to the chunks in a predefined manner specified differently for each range. Then recompute the hash values for the chunks by exchanging them and re-assigning the first and left rotating the third by 30. At the end of the look, recompute the chunk's hash to the result so far. The final hash value is the appending of each of these chunks. Keyed MD5 produces a cryptographic checksum for a message as m + MD5(m + k). Another popular form of hashes is used with certificates. A certificate is a document with a digital signature and is signed by a Certification Authority. Certificates enable public Key Authentication that happens with A sending E (x, Public-B) to B and B sending back the decrypted x.

Like Versioning using hashes, signing is the process by which a digital signature is created from the file contents. The signature proves that there was no tampering with the contents of the file. The signing itself does not need to encrypt the file contents to generate the signature. In some cases, a detached signature may be stored as a separate file. Others may choose to include the digital signature along with the set of files as an archive. The signature differs from the fingerprint or hash in that there is decryption involved which allows the content to be irrefutably come from the purported origin. Signing uses a private-public key pair to compute the digital signature. The private key is used to sign a file while the public key is used to verify the signature. The public key can be published with the signature or it can be made available in ways that are well-known to the recipients of the signed files.

The process of signing can take one of many forms of encryption methods. The stronger the encryption the better the signature and lessen the chances that the file could have been tampered. The process of signing varies across operating system.

Git popularized the use of ‘gpg’ tool to sign and verify the files. This tool even generates the key-pair with which to sign the files. The resulting signature is in the Pretty Good Privacy protocol format and stored as a file with extension .asc. Publishing the public key along with the detached signature is a common practice for many distributions of code. GitHub solves universal versioning and is ubiquitous in its adoption across companies. There is even a translation from Git logs to Semantic Versioning 2.0.0 where the former captures the changes to the versions in a workspace of files while the latter refers to the standardization of convention for the meaning and form used to represent versions. GitVersion is a tool that can translate the individual versions to the semantic versioning. Together with the ability to make unique, universal version, to give every change a version and to bridge the gap to translate those versions to a form acceptable as compliant with Semantic Versioning standards, GitHub seems like a one-stop shop for versioning, tracking and building a management system around changes.

With versioning and signing becoming a standard by the vendors for source control, object store and artifact repository managers, applications can choose to offload these activities to stacks that can lower their Total Cost of Ownership.

Thursday, February 11, 2021

The plan for a mobile application that monitors outbound web traffic from the device:

Problem statement: Mobile applications are increasingly becoming smarter with one of the two popular mobile platforms that have their own app stores which find universal appeal among their customers. These users download a variety of applications and enable notifications and settings that permit the applications to upload and download without user awareness. Although most applications play nice, there is little or no tool with the user that can analyze these application behaviors to help them draw insights into their mobile phone usage.

Solution: This proposal is to help Android users with the display of a pie chart for applications that have the most web traffic over the last 24 hours. Applications generally tend to send or receive traffic from the same source or destination. The pie chart will display these source or destination endpoints as the basis for the web traffic. As the user studies the pie chart for the websites visited by the applications on their device, they can take corrective actions for themselves. These actions include turning off notifications, removing privileges or even uninstalling the application. Securing the mobile device for a user and empowering them to take corrective actions reaffirms the trust the user places on the smartest device closest to them. It might mean reduced functionality in some cases, but the tradeoff is in the hands of the user and increasing awareness where there was none.

Usually, applications cannot themselves declare what websites they will be accessing, and it is harder for an application to even snoop on the traffic of other applications without going through the platform or operating system of the mobile applications. Such access requires privilege and enabling such application on the mobile device then raises the questions “who guards the guard?” Instead, the platform provides Java Native Interfaces JNI that can be used to listen on the web traffic and then the application can determine how to collect and analyze them.

The use of a hash table or dictionary for the purpose of collecting and counting accesses to webpages is typical for determining the running counts of web pages accessed. An inverted hash table for the highest number of counts and their corresponding web pages will be required to draw the pie chart. The collection can be viewed anytime and on demand, but the accumulation is only for the last 24hours. Therefore, a sliding window is required to discard older entries and retain newer entries.

This application itself will have to provide user controls, navigations and experience that are typical for any application on a mobile application. The navigation to the homepage, the display of the pie chart, the refreshes to the chart both from user navigations to it from external applications or pages or from refresh of the pie chart or internal navigations within the pie chart will have to be refreshed automatically.

Finally, the application must demonstrate that it handles all the lifecycle and display events that are associated with the application. If these handlers are correctly written, the user experience in viewing the report will be smooth and satisfying.

Wednesday, February 10, 2021

How to write a chatbot?

Problem statement: Many web sites provide a chatbot experience to their users where they can ask questions and receive answers relevant to their business. A chat bot can also be hosted on a mobile application so that it responds only to the mobile device owner. In such a case, the chatbot can be trained to be a translator, movie-based responder, a nurse, a sentiment analyzer or a humor bot. The implementation for a chatbot remains the same across these usages except that they are trained on different dataset.

Solution: Writing a chatbot starts with a deep learning model. This model is easier to build on some well-known machine learning platforms. The model must be trained on the relevant dataset. It must also be tuned to serve satisfying responses. If the model evaluation is a black box, it will not perform well. That is why this article describes how to build and train such a model from scratch.

A chat bot is well-served with a sequence-to-sequence model. More information about this type of model can be found in the documents listed under the reference section but at a high level, they work with sequences rather than symbols that constitute the sequence. Therefore, it does not need to know what the parts of the sequence represent and whether they are words or video frames. It can even infer the meaning of those symbols. When raw data is shredded into sequences, this model keeps a state information per sequence that it infers from that sequence. This state is the essence of the sequence. Using this state, this model can translate or interpret input sequences (text) to output sequences (responses). One of the popular sequence-to-sequence models is the Recurrent Neural Network or RNN for short. The RNN encoder-decoder model was proposed by Bahdanau et al in 2014 and it can be used to write any kind of decoder that generates custom output which makes it suitable for a wide variety of usages.

This model is built on the following premise. It dissects the text into timesteps and encodes internal state pertaining to those timesteps. The context is learnt from the sequence which is the semantic content and basis for any follow-up. Neurons help remember information and expose just enough to build a context. A sequence database helps stash the slices of ordered elements as sequences from the sentences. Given a threshold for support – a metric, the model finds the complete set of frequent subsequences. If the addition of an element to a sequence will not make it frequent, then none of its super-sequences will be frequent. With the help of such a sequence, it can be followed up with an interpretation using corresponding sequence generation. The state is decoded, and a new output sequence is generated. This forms the response of the chatbot.

Code for this sequence-to-sequence analysis is available with this one and machine learning frontends such as TensorFlow makes it easy to load a saved model and use from any client while the Keras backend on a Colab like environment can help train the model independently and save it for future use.

Tuesday, February 9, 2021

Writing a recommender using TensorFlow.js:

Introduction: TensorFlow is a machine learning framework for JavaScript applications. It helps us build models that can be directly used in the browser or in the node.js server. We use this framework for building an application that can recommend using deep learning that can find the best result in millions.

Description: The recommender from TensorFlow is built on two core features – one that supports fast approximate retrieval and another that supports better techniques for modeling feature interactions. The SaveModel object is instantiated for the recommender which takes as input query features and provides the recommendations as output. Feature interactions are based on deep and cross networks which are efficient architectures for deep learning. Cross features are essential to span large and sparse feature space which are typical of datasets used with recommenders. ScaNN is a state of the art nearest neighbor search library (NNS) and it integrates with TensorFlow recommenders.

For example, we can say:

scann = tfrs.layers.factorized_top_k.ScaNN(model.user_model)

scann.index(movies.batch(100).map(model.movie_model), movies)

A Keras layer is like a backend and can run on Colab environment. Keras can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

While Brute force approaches make fewer inferences per second, scann is quite scalable and efficient

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the Keras API. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Conclusion: Tensorflow.js is becoming a standard for implementing machine learning models. Its usage is simple, but the choice of model and the preparation of data takes significantly more time than setting it up, evaluating, and using it.

Similar article: https://1drv.ms/w/s!Ashlm-Nw-wnWxRyK0mra9TtAhEhU?e=TOdNXy

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWrwRgdOFj3KLA0XSi

Monday, February 8, 2021

social graph continued ...

Logistic Regression: individual behavior based on demographics can be used to predict the likelihood of a category of actions from that individual. It can also be used for finding repetitions in actions. This is a form of regression that supports binary outcomes. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affects a pair of outcomes.

This is widely used method for machine learning involving neurons that have one or more gates for input and output. Each neuron assigns a weight usually based on probability for each feature and the weights are normalized across resulting in a weighted matrix that articulates the underlying model in the training dataset. Then it can be used with a test data set to predict the outcome probability. Neurons are organized in layers and each layer is independent of the other and can be stacked so they take the output of one as the input to the other. All messages written by individuals on social engineering applications can now be evaluated with SoftMax NLP classifiers for detecting keywords

Naïve Bayes is widely used for cases where conditions apply especially binary conditions such as with or without. If the input variables are independent, their states can be calculated as probabilities, and there is at least a predictable output, this algorithm can be applied. The simplicity of computing states by counting for class using each input variable and then displaying those states against those variables for a give value, makes this algorithm easy to visualize, debug and use as a predictor.

Collaborative filtering Recommendations include suggestions for knowledge base, or to find model service requests. In order to make a recommendation, first a group sharing similar taste is found and then the preferences of the group are used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keep tracking of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group. To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group.

Conclusion: There are several other mining algorithms that can be listed along with those above for their usages on social graphs. This is just a precedence for applications such as chatbots and assistants to find useful and relevant information for individuals or businesses from the social graph.