Cluster computing: January 2021

Sunday, January 31, 2021

SiEM

Event management solutions are popular in supporting early attack detection, investigation, and response. Leading products in this space focus on analyzing security data in real-time. These systems for security information and event management (SIEM) collect, store, investigate, mitigate, and report on security data for incident response, forensics, and regulatory compliance.

The data that is collected for analysis comes from logs, metrics, and events which are usually combined with contextual information about users, assets, threats, and vulnerabilities. This machine data comes from a variety of sources and usually carries a timestamp for every entry. The stores for this data are usually a time-series database that knows how to consistently keep the order of events as they arrive. The stores also support Watermarks and Savepoints for its readers and writers respectively so that they can resume from an earlier point in time.

These systems, by their nature of dependence on the timelines, require analytical processing to occur on historical batches of data. Elastic storage for big data promoted the use of the map-reduce technique of processing batches where the computations for each batch were mapped before the results from their processing were reduced to a result. This took a lot of time to be useful for high volume traffic such as from the Internet-of-Things. When the batches of data became smaller, the latency for each batch could be reduced but it still did not alleviate the compute. New algorithms for streaming computed the result continuously as data arrived as events one by one. Stream processing libraries supported the aggregation and transformation by processing one event at a time. With the help of these libraries, the logic for analytics could now be conveniently written by filtering or aggregating a subset of events. Higher-level querying languages such as SQL could also be supported by viewing these events to be in a table.

The table is a popular means for analytics because the database industry has had a rich tradition for analytics. Family of data mining techniques such as Association Mining, Clustering, Decision Trees, Linear Regression, Logistic Regression, Naïve Bayes, Neural Network, Sequence Clustering, and Time-Series was now available out of the box when the data appears in a table. Unfortunately, not all these techniques apply directly to streaming data where the data appear one by one. The standard practice for using these techniques for forecasting, risks and probability analysis, recommendations, finding sequences, and grouping starts with a model. First, the problem is defined, then the data is prepared, it is explored to help with the selection of a model, then the models are built and trained over the data, then the models are validated and finally they are kept up to date. Usually, 70% of the data is used for training, and the remaining 30% is used for testing and predictions. A similar technique is used with machine learning packages and algorithms.

Saturday, January 30, 2021

Writing a sequential model using TensorFlow.js:

Introduction: TensorFlow is a machine learning framework for JavaScript applications. It helps us build models that can be directly used in the browser or in the node.js server. We use this framework for building an application that can recognize drawings of different types using a sequential model

Description: The model chosen is a Recurrent Neural Network model. This is used for finding groups via paths in sequences. A Sequence Clustering algorithm is like a clustering algorithm mentioned above but instead of finding groups based on similar attributes, it finds groups based on similar paths in a sequence. A sequence is a series of events. For example, a series of web clicks by a user is a sequence. It can be also be compared to the IDs of any sortable data maintained in a separate table. Usually, there is support for a sequence column. Support is a metric based on probabilities. The sequence data has a nested table that contains a sequence ID which can be any sortable data type.

The JavaScript application loads the model before using it for prediction. When enough training data images have been processed, the model learns the characteristics of the drawings which results in their labels. Then as it runs through the test data set, it can predict the label of the drawing using the model. TensorFlow has a library called Keras which can help author the model and deploy it to an environment such as Colab where the model can be trained on a GPU. Once the training is done, the model can be loaded and run anywhere else including a browser. The power of TensorFlow is in its ability to load the model and make predictions in the browser itself.

The labeling of drawings starts with a sample of say a hundred classes. The data for each class is available on Google Cloud as numpy arrays with several images numbering say N, for that class. The dataset is pre-processed for training where it is converted to batches and outputs the probabilities.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

TensorFlow makes it easy to construct this model using the Keras API. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The model works better with fewer parameters. It might contain 3 convolutional layers and 2 dense layers. Keras.Sequential() instantiates the model. The pooling size is specified for each of the convolutional layers and they are stacked up on the model. The model is trained using the tf.train.AdamOptimizer() and compiled with a loss function, optimizer just created, and a metric such as top k in terms of categorical accuracy. The summary of the model can be printed for viewing the model. With a set of epochs and batches, the model can be trained.

With the model and training/test sets defined, it is now as easy to evaluate the model and run the inference. The model can also be saved and restored. It is executed faster when there is GPU added to the computing.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. A batch size of 256 and the number of steps as 5 could be used. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

When the model is trained, it might take a lot of time say about 4 hours. When the test data has been evaluated, the model’s efficiency can be predicted using precision and recall, terms that are used to refer to positive inferences by the model and those that were indeed positive within those inferences.

Conclusion: Tensorflow.js is becoming a standard for implementing machine learning models. Its usage is simple, but the choice of model and the preparation of data takes significantly more time than setting it up, evaluating, and using it.

Similar article: https://1drv.ms/w/s!Ashlm-Nw-wnWxRyK0mra9TtAhEhU?e=TOdNXy

Friday, January 29, 2021

Writing a regressor using TensorFlow.js:

Introduction: TensorFlow is a machine learning framework for JavaScript applications. It helps us build models that can be directly used in the browser or in the node.js server. We use this framework for building an application that can detect objects in images using a regressor rather than a classifier.

Description: A classifier groups entries based on similarity to each other. Images can also be compared to one another. However, it has no relevance to the position of an object in an image. A regressor uses a bounding box that spans the image with varying sizes until it finds a portion of the image that matches an object. The object itself can be specified as a bounding box within an image. The data for training as well as test are images where the training data set has bounding box and label while the test data set does not.

The JavaScript application uses labels from the images to train the model. When enough training data images have been processed, the model learns the characteristics of the object detected. Then as it runs through the test data set, it can predict the bounding box and the label if a similar object is determined in the test data image.

As with any ML learning example, the data is split into 70% training set and 30% test set. There is no order to the data and the split is taken over a random set.

The model chosen is an object detection model. This model specifies the bounding box as top left and bottom right co-ordinates using horizontal and vertical offset notations. The size of the image is known before hand in terms of width and length and the bounding boxes are guaranteed to be within the image. The object, filename and type of file as image can be optionally specified to each image so that they can be looked up in a collection. The output consists of a label and a bounding box. A label map file is used to specify the objects to be detected and, in this case, there is only one object specified.

TensorFlow makes it easy to construct this model using an API. It can only present the output after the model is trained. In this case, the model must be run after the training data has labels assigned. This might be done by hand. The API expects data to be converted into a sequence of binary records also called TFRecord which is a simple format for storing a sequence of binary records.

When the model is trained, it can be done in batches of predefined size. The number of passes of the entire training dataset called epochs can also be set up front. A batch size of 90 and the number of steps as 7000 could be used. These are called model tuning parameters. Every model has a speed, Mean Average Precision and output. The higher the precision, the lower the speed. It is helpful to visualize the training with the help of a high chart that updates the chart with the loss after each epoch. Usually there will be a downward trend in the loss which is referred to as the model is converging.

Thursday, January 28, 2021

The API layer for use by mobile applications...

This is a continuation from the previous post:

Applications that allow user defined scripts to be run along with a rich suite of scriptable objects and APIs arguably improve automation. For over two decades with varying technologies, Microsoft has shown an example of improving script ability and automations. First there was Component Object Model, then it was Visual Basic that could use those objects and finally, there was Powershell. Many applications, services and instances can allow extensibility via user defined scripts. These can be invoked from command line as well as from other scripts. It also helps with testing.
Lastly, the importance of customer feedback cannot be understated in customer facing clients such as mobile applications and designer interfaces. Usability engineering can make it more convenient to navigate pages, use controls and view dashboards but the customers dictate the workflow. The prioritization of software features via customer feedback rests primarily with the product management team not the engineering team.

The set of considerations mentioned so far have been technological. The discussion that follows adds some more based on the business domain. Decades of efforts in streamlining data access layer across business domains have given rise to expertise and maturity in web architectures. Yet businesses continue to develop home grown stacks for their respective business applications, some under the requirement for adoption of container orchestration technologies and others with the excuse of developing independently testable microservices. Companies such as finance are heavily invested in shared data and indexes. They have a need for extract-transform-load as core part of their services. Retail companies such as for clothes or beverage are increasingly invested in point-of-sales experience. Telecommunication companies are required by law to meet compliance both for subscriber records as well as anti-trust regulatory compliance. Due to their different requirements, these organizations come up with a portfolio of solutions and then retrofit standardization and consistency across their applications and specially when technical debt mitigation is permitted. All the mentions made so far have proven useful across industries regardless of their choices of a technology such as what authentication to support. Therefore, customization and business domain requirements should not be allowed to circumvent or ignore the mentions made here.

Certain improvements are driven by their business priority and severity. Even if the technology and architecture is determined up front, their implementations may take shape differently from the others. Even the end state of the implementation at the time of release may not be at par with what was on the whiteboard once, but the techniques suggested here have stood the test of time and landscape. Perhaps the single most important contributor for these has been the popularity of their usages with developer community. For example, this is demonstrated by the adoption of GraphQL/REST over SOAP and containers over virtual machines. Developers also find the public cloud architecture convenient for the v1 products of many companies in several sectors and verticals. Developer community forums have also been a significant source of information for this document.

The following section is a note about testing. Mobile applications are written with the help of simulators that have internet access. It is common for the developer to test the user interface as and when it is being developed. Automations for the mobile applications are rather cumbersome. One of the ways to mitigate this difficulty has been to separate the front-end testing with a mobile view that can be called from desktop browsers. Usually, the web interface displayed from a backend is in response to a web request that can take additional query parameters to request a response for the mobile platform. Then there is browser driven user interface automation that can even work headless and drive through the workflows using the web controls. A combination of such approach can thoroughly test the mobile application from a users' point of view.

Lastly, standard practice for web applications via enterprise application blocks or cloud computing documentations is a great place for reference and resources but we must take note that the spirit with which they are written, suggests that our implementations can be lean and mean so that we may incur less total cost of ownership in development and operations.

Wednesday, January 27, 2021

The API layer for use by mobile applications...

This is a continuation from the previous post:

The entire software consuming the dependencies regardless of its organization may also be considered a service with its own APIs. Writing command line convenience tools to drive this software with an api or a sequence of apis can become very useful for diagnostics and automation. Consider the possibility of writing automations with scripts that are not restricted to writing and testing code. Automation has its own benefits but not having to resort to code widens the audience. Along the lines of automation, we mentioned convenience, but we could also site security and policy enforcement. Since the tools are designed to run independently and with very little encumbrance, they can participate in workflows previous unimagined with existing codebase and processes.
API versioning is mentioned as a best practice for clients and consumers to upgrade. And there is one very useful property of the HTTPS protocol where a request can be redirected so that the users may be kept informed at the very least or their calls translated to newer versions in full-service solutions. It is true however that versioning is probably the only way for providing an upgrade path to users. In order that we don't take on the onus of backward compatibility, the choice to offer both old and new becomes clearer. It is also true that not all customers or consumers can move quickly to adopt new APIs and the web service then takes the onus of maintaining the earlier behavior in some fashion. As web services mature and they become increasingly hard and unwieldy, they do not have any other way to maintain all the earlier behavior without at least relieving the older code base from newer features and providing them in a new offering.

As services morph from old to new they may also change their dependencies both in the version and in the kind of service. While user interface may become part of the responses as the response body, it may be better to separate the UI as a layer from the responses so that the service behind the API may be consolidated while the UI may be customized on different clients. This keeps customization logic away from the services while enabling the clients to vary on plugins, browsers, applications and devices. Some request responses may continue to keep their UI because they may not accept any other form of data transfer but the XMLHttpRequest that transfers the data from the browser to the server is still a client-based technology and doesn't need to be part of the server response body. Another reason why servers choose to keep their responses to include user facing forms and controls is that they want to enforce same origin and strict client registrations and their redirections. By requiring that some of their APIs to be internal, they also restrict others from making similar calls. APIs do have a natural address, binding and contract properties that allow their endpoints to be secured but client technologies on the other hand do not require such hard and fast rules. Moreover, with services relaying data via calls, strict origin registration and control may still be feasible.
There is more freedom with visualizations displayed by clients that are built for desktop rather than for mobile applications. This allows curated libraries for charts and graphs, visualizations, analytics and time-series reporting. While some may be client libraries, others may be dedicated stacks by themselves. If the treatment to data prior to rendering requires dedicated stack, it can even be written as a separate microservice specific to desktop clients. Many off-the-shelf products are built to facilitate solution integration via services which ties in well to empower desktop clients that are not resource starved.