Cluster computing

Sunday, February 7, 2021

Social graph

This essay is about data mining on social engineering data. Applications such as Facebook, Whatsapp, Twitter and Instagram have centered on personal connections around an individual and these social graphs are rich in information that can be mined with well-known data mining algorithms for a variety of purposes such as recommendations for the user, commercials and marketing.

The data mining algorithms were well suited for relational databases and tabular format. Graphs have their own databases where relationships are described by edges between nodes. The nature of the data does not change. Its representation and querying language changes but it has been possible to standardize the query language over diverse set of data stores such as relational stores, big data, snowflake schema and graph databases. With this assumption, we proceed to list the use case scenarios for data mining over social engineering data.

1. Classification algorithms: Forming groups of individuals has always been organic on social engineering platforms united by some common purpose or campaign. This application of classification algorithm forms a statistics table of the different interactions that these individuals have had over time and collects them in a vector representation for each individual based on some chosen metrics as features Then the groups are learned and it provides additional information into what the individual may have been too pigeon-holed to see but the software can make a classification.

2. Regression algorithms – Almost any demographic data pertaining to individuals on a social graph is likely to form a scatter plot. One of the best advantages of linear regression is the prediction about time as an independent variable. When the data point has many factors contributing to their occurrence, a linear regression gives an immediate ability to predict where the next occurrence may happen

3. Segmentation algorithms – A segmentation algorithm divides data into groups or clusters that have similar properties. Population segmentation on social graph provides interesting insights into how individuals might react to campaigns.

4. Sequence Analysis Algorithm – The difference between classification algorithms and sequence algorithms is the latter focuses on paths in sequences. It does not even need to know the meaning of the constituents of the sequence. It just has to encode the sequence to a context and use its corresponding decoder to generate an output sequence. Chatbots use this to create responses to individuals chat. The relays between individuals can be similarly studied.

5. Outliers mining algorithm – Since everyone is not a conformist, there are bound to be fringe groups and outliers whose identification alone is of valuable interest to various agencies. This calls for the use of Outlier algorithms to determine their identities.

6. Decision tree – Perhaps the most used data mining model is the decision tree simply because it is easy to visualize and study as it forms branches based on decision splits of the user community. Well-trained models can be easily to predict the label associated with newcomers

7. Time-series algorithm – perhaps the most anticipated information from Social graph is how things change over time. Using the historical data to predict the outcome of a variable falls within this category of analysis

Saturday, February 6, 2021

Personal sentiment analyzer

Personal Sentiment Analyzer:

Introduction: We rely on direct messages to get immediate help. The wireless or mobile internet connection is taken for granted just as much as the party answering the other end of the line. But there is another possibility – a chatbot. Many web sites provide a chatbot experience to their users where they can ask questions and receive answers relevant to their business as a first responder and self-service option. A chat bot can also be hosted on a mobile application so that it responds only to the mobile device owner. In such a case, the chatbot can be trained to be a translator, movie-based responder, a nurse, a sentiment analyzer or a humor bot. The implementation for a chatbot remains the same across these usages except that they are trained on different dataset.

Problem statement: There are many usages for a chatbot. It can be a first aid counsel, a trauma responder, a pacifist, a service professional. As long as it has domain expertise, a chat bot can be a stop gap measure and a first responder often being available when the need is critical. How do we build such a chatbot?

Solution: Writing a chatbot starts with a deep learning model. This model is easier to build on some well-known machine learning platforms as well as algorithms. The model must be trained on the relevant dataset. It must also be tuned to serve satisfying responses. If the model evaluation is a black box, it will not perform well. That is why this article describes how to build and train such a model from scratch.

A chat bot is well-served with a sequence-to-sequence model. More information about this type of model can be found in the documents listed under the reference section but at a high level, they work with sequences rather than symbols that constitute the sequence. Therefore, it does not need to know what the parts of the sequence represent and whether they are words or video frames. It can even infer the meaning of those symbols. When raw data is shredded into sequences, this model keeps a state information per sequence that it infers from that sequence. This state is the essence of the sequence. Using this state, this model can translate or interpret input sequences (text) to output sequences (responses). One of the popular sequence-to-sequence models is the Recurrent Neural Network or RNN for short. The RNN encoder-decoder model was proposed by Bahdanau et al in 2014 and it can be used to write any kind of decoder that generates custom output which makes it suitable for a wide variety of usages.

This model is built on the following premise. It dissects the text into timesteps and encodes internal state pertaining to those timesteps. The context is learnt from the sequence which is the semantic content and basis for any follow-up. Neurons help remember information and expose just enough to build a context. A sequence database helps stash the slices of ordered elements as sequences from the sentences. Given a threshold for support – a metric, the model finds the complete set of frequent subsequences. If the addition of an element to a sequence will not make it frequent, then none of its super-sequences will be frequent. With the help of such a sequence, it can be followed up with an interpretation using corresponding sequence generation. The state is decoded, and a new output sequence is generated. This forms the response of the chatbot.

Code for this sequence-to-sequence analysis is available with this one and machine learning frontends such as TensorFlow makes it easy to load a saved model and use from any client while the Keras backend on a Colab like environment can help train the model independently and save it for future use.

Friday, February 5, 2021

SIEM continued ...

Endpoint protection now has a recently affirmed practice of using a variety of intelligent, lightweight sensors that capture and record all relevant endpoint activity ensuring true visibility across the environment. They may come with a small footprint, no reboot, no daily AV definitions, no user alerts, no impact on the endpoints and protect both offline and online access. The use of distributed sensors also implies a centralized analysis service that can be hosted in the cloud so that it can scale arbitrarily. Together with the use of sensors and services, this kind of SIEM can crunch a large amount of data. By correlating billions of events in real-time and applying graph-based techniques, it can draw a link between events and adversary activity quickly. It’s a powerful and massive scalable graph database that can be used with machine learning techniques to detect patterns. This makes SIEM stand out as a special purpose platform and are not integrated with a general-purpose IT platform software-as-a-service.

With the use of data mining algorithms and machine learning packages, the analytics has improved in ways that go beyond traditional processing. Operational data no longer finds appeal in relational storage even if the analysis is simpler. Events and products that build on events are increasingly taking over the analysis and providing insights that are being talked about and shared with visualizations from companies that specialize in the layers that render charts and graphs.

There is no specific say in how the intelligence with the events will evolve but it is likely to occur in the initiatives mentioned, before becoming integrated with more established platforms in IT.

Among the emerging trends, there is a shift towards machine data collection on the edge servers. This provides a unique and growing field of innovations in a way like the one cloud computing has provided. Whether endpoints or edge computing, sophisticated and connected sensors will empower centralized threat analysis and pre-emptive measures.

Thursday, February 4, 2021

SiEM continued ...

The end-point protection technique can be elaborated this way. It helps a company defend against Internet-based breaches and data losses. It provides barriers against malware, data loss, and theft and mitigates network intrusion. The type and number of endpoints, how it is hosted – on-site, in a virtualized environment, or the cloud, the management tools required whether it is on-site, remote, or mobile, its performance expectations and support determine the choice of vendor for the endpoint protection. Reviewers of endpoint protection technologies indicate that the size of a company does not matter to the endpoints being protected. Also, endpoint protection typically scales to hundreds or thousands of endpoints. An endpoint device is an Internet-capable computer hardware device on a TCP/IP network with an address and port that clients can connect. This can be any web service or application of any size that can be reached over say HTTP or HTTPS. The devices hosting the applications can be cloud-based servers, on-site web farms, desktop computers, laptops, smartphones, tablets, thin clients, printers, or other specialized hardware such as POS terminals and smart meters.

Policies are associated with endpoints and these are managed as network rules and firewalls within an organization. A system administrator may divide the network, secure access via firewalls, disable ports, and establish static rules to prevent undesirable access to devices hosting endpoints. One of the techniques used to protect endpoints is an HTTP proxy. As a man in the middle, it does not require any invasion of the server offering the services and can perform the same mitigations that could have been taken on the said server. This proxy monitors and measures incoming traffic to the advantage of services behind it. The proxy can not only support relay behavior but also filtering. They support the promiscuous mode of listening. Proxies can also be forward or reverse where the former helps with anonymity in that it retrieves resources from the web on behalf of the users behind the proxy. A reverse proxy is one that secures the resources of the corporate from outside access. A reverse proxy can do several things such as load-balancing, authentication, decryption, or caching. SSL acceleration is another option where this proxy enables hardware acceleration and a central place for SSL connectivity for clients.

Wednesday, February 3, 2021

SIEM continued ...

The business sources of data have been a majority from Enterprise operations data, followed by IT warehouse for advanced trending, business application owner data, executive dashboard and security/audit compliance systems.

AIOps application to Security Information and Event Management (SIEM) requires a special mention. It is used to support early attack detection, investigation, and response. There are several approaches to meet these requirements including a log index store, a metrics time-series database, sensors and intelligent endpoint protection, network and security, and their corresponding intelligence in stores, collection agents, analysis, and reporting stacks. Not everyone looks for all these features to be present in a SIEM solution and some vendors find a niche market with their offerings. Integration of SIEM products has been particularly difficult because of restrictions and limitations with standardization of techniques and commodity software. Some deployments are forced to be software as a service with multi-tenants. Solutions also span hybrid clouds and private clouds.

Technologies used in SIEM do not belong to the platform because they form a dedicated purpose, yet platforms are commonly used to manage IT operations because they span all aspects such as CMDB, incidents and requests via ITSM (service management) and alerts and events via ITOM (operations management). SIEM cannot be included in the portfolio of a platform without integration and most of these products are specialized or prefer to have their own management interface. The SIEM products are also aggressively taking on both the ingestion of events from all services monitored via a platform as well as the use of specialized or general-purpose machine learning algorithms.

Some of the techniques in SIEM need to be called out for the impact they make and their justification to not be part of a platform. These include endpoint protection techniques. Earlier viruses used to be the major form of threat attacks to security if it were not for the vulnerabilities existing in the systems and were largely dealt with by firewalls, host scanning and sanitization, and policies including software control. Endpoint protection has changed that game. It is now a cloud-hosted service that is not confined to a platform or its resources, supports its own stack, and can scale to any number of events.

Tuesday, February 2, 2021

SIEM continued ...

The integration of tools also depends on cloud support. Most event processing automation such as IT process automation, service desks for trouble ticketing, and CMDB require or support some form of cloud computing. These are predominantly hybrid cloud or virtualized infrastructure, microservices/containers, AWS, Azure, and Google cloud, usually in that order. This is a growth area for all event processing platforms and cloud vendors are increasingly providing an altruistic toolset for the integration of applications and monitoring of resources. The landscape for advanced IT Operations analytics shows a need for discovery, dependency mapping, and automation of applications with their integrations. These dependencies are discovered via a mix of both agent-based and agentless techniques. Native discovery contributors are primarily containers in a private cloud, layer 3 logical layer, data center elements, and component detail, virtualized environments in a private cloud, and microservices in a private cloud.

These challenges from the variety of data types and integrations indicate that the data is not always collected in the form that it can usually be analyzed for insights. While preparation of data is the predominant time-consuming factor, the discovery of data and the investment in automation to keep it coming are more so. On the bright side, the analysis may take much lesser time and platforms are best suited to define the automation for data collection, the preparation for the data, and the heuristics to help with the analysis phase.

AIOps deals with the following requirements on triage: Isolate whether the problem is within the application, server, network, or database, investigate across virtualized systems, isolate infrastructure issues internal to systems, those within the database, those within the storage, investigate across application tiers, isolate middleware issues, isolate infrastructure issues in the network, isolate infrastructure issues within the public cloud, visibility from the branch into issues such as QoS and such others.

Monday, February 1, 2021

SIEM continued ...

The intelligence in event monitoring comes with a good quality model that has been honed on a large data set from a variety of sources and tuned to behave very well to the data in its domain. Even streams of continuous events have a historical section and an active front for new events. By periodically running the data on the ever-increasing historical sections, a model can be improved to provide insights into the events as they occur. This is still not real-time in that it is still executing on a historical batch of events. Streaming algorithms that catch up on the historical events and adapt to newer events while being used for predictions is not as prevalent as the batch-oriented or model-prediction approach, but it opens the possibility for a new form of analysis and much of the machine data is suitable for a stream abstraction. This does not prevent us from using a model-prediction approach where we can make the model as sophisticated as necessary and it can be used to make both short-term and long-term predictions. The Microsoft Time-Series algorithm is an example of using historical data to continuously make predictions on incoming data. It has two forms where one is used for short term predictions and another for long term predictions. It is even possible to blend the algorithms. The short-term-algorithm is an autoregressive tree model for representing periodic time-series data while the long-term algorithm is an autoregressive integrated moving average.

The challenge with event monitoring lies with the type of data and the integration of the test tool rather than the analysis or prediction with the intelligence mentioned above. Data comes in many forms, from a variety of sources, and is received by systems that are deployed in one of several ways. Let us take a few examples of each. The systems that receive the data are deployed in one of the following options – SaaS is the predominant deployment option with well over 80% of the cases, followed by the on-premises deployment option and then by privately hosted deployment which is in the same ballpark of popularity. Privately hosted and multi-tenancy options follow next with only so much lagging behind the other options. Each of these deployment options requires versatility from the intelligence added to monitoring. Managed full-service solutions are always preferred over the complexity of those with more functionally rich suites. It is even typical to see 0.5 to 1.5 human resource assigned to the administration of these solutions.

The type of data varies far more in number than that of the deployment options. Data or alerts are almost universally collected for the monitoring of the performance of applications and services in DevOps. This type of data pertains to performance-related events. Time Series events such as metrics also constitute a type of data demanding their own solution stack. Log files or access records are the third types of data that justify their own index and analysis stacks if they are not archived or lying around. Security-related events are also time-series, but they are treated differently from others for the alerts they need to raise. Transaction related events such as those from customers determine application performance but hold a special significance for the business as opposed to the internal-facing operational data. Internal configuration or topology events form yet another class of data. Anything that is not transactional is treated as Unstructured data by default. This classification of data comprises those that are treated in batches, micro-batches, and streaming mode. Data that is collected from agents for systems such as by telemetry, instrumentation, or as proprietary byte codes form a different type. Text data that is exported out of disparate systems is usually in the form of comma-separated values. Web requests and responses including those made from client-side scripts using the browser also form their own class because they are usually in the clear or encrypted per request. The data sent over the web by the Internet-of-Things forms a different type. SIEM data is different from others. Network traffic such as flow-related data or packets, or wire data forms their own classes. Web proxies have gained a lot of popularity as a gateway to services and are of interest both for analytics as well as troubleshooting purposes. Infrastructure as a code or software-defined stacks represent a type of data that require integrations because of the tools dedicated to them. Only a quarter of the hundreds of tools available for these data types are put to the integration with automation.