Cluster computing

Thursday, September 28, 2023

This is a continuation of a previous article on the use of Artificial Intelligence and Product Development. This article talks about the bias against AI as outlined in reputed journals.

A summary of the bias against AI is that some of it comes from inaccurate information from generative AI. Others come from the bias served up by the AI tools. These are overcome with a wider range of datasets. AI4ALL for instance, works to feed AI a broad range of content to be more inclusive of the world. Another concern has been over-reliance on AI. A straightforward way to resolve this is to balance the use of AI with those requiring skilled supervision.

The methodical approach to managing bias involves three steps: First, data and design must be decided. Second, outputs must be checked and third, problems must be monitored.

Complete fairness is impossible due in part to decision-making committees not being adequately diverse and choosing the acceptable threshold for fairness and determining whom to prioritize being challenging. This makes the blueprint for fairness in AI across the board for companies and situations to be daunting. An algorithm can check whether there is adequate representation or weighted threshold, and this is in common use but unless equal numbers of each class is included in the input data, these selection methods are mutually exclusive. The choice of approach is critical. Along with choosing the groups to protect, a company must determine what the most important issue is to mitigate. Differences could stem from the sizes of the group or accuracy rate between the groups. The choices might result in a decision tree where the decisions must align with the company policy.

Missteps remain common. Voice recognition, for example, can leverage AI to reroute sales calls but might be prone to failures with regional accents. In this case, fairness could be checked by creating a more diverse test group. The final algorithm and its fairness tests need to consider the whole population and not just those who made it past the early hurdles. Model designers must accept that data is imperfect.

The second step of checking outputs involves checking fairness by way of intersections and overlaps in data types. When companies have good intentions, there’s a danger that an ill-considered approach can do more harm than good. An algorithm that is deemed neutral can still result in disparate impact on different groups. One effective strategy is a two-model solution such as the generative adversarial networks approach. This is a balanced approach between the original model and a second model where one checks for individual’s fairness. They converge to produce a more appropriate and fair solution.

The third step is to create a feedback loop. Frequently examining the output and looking for suspicious patterns on an ongoing basis, especially where the input progresses with time, is important. Since bias goes unnoticed usually, this can catch it. A fully diverse outcome can look surprising, so people may reinforce bias when developing AI. This is evident in rare events where people may object to its occurrence and might not object if it fails to happen. A set of metrics such as precision and recall can be helpful. Predictive factors and error rates are affected. Ongoing monitoring can be rewarding. For example, demand forecasting by adapting to changes in data and correction in historical bias can show improved accuracy.

A conclusion is that bias may not be eliminated but it can be managed.

Wednesday, September 27, 2023

This is a continuation of a previous article on AI for product development. Since marketing is one of the core influences on product development, this article reviews how AI is changing marketing and driving rapid business growth.

Marketers use AI to create product descriptions. Typically, this involves words and phrases that come from research on target audience but when the same is used by marketers over and over again, it can become repetitive. AI rephrasing tools can help teams find new ways of describing the most prominent features of their products.

Content marketers are often caught up in the task of creating more content but it’s equally important to optimize the content that’s already on the site. As content gets older, it becomes dated and less useful which brings down the SERP. When a particular URL is provided, AI can inform the keywords its ranking that URL for and which keywords need a boost. This helps marketers go further.

AI is most used in data analytics. Performance of various content types, campaigns and initiatives used to be time consuming just by virtue of sourcing it from various origins and the tools varied quite widely. Now teams empower themselves to quickly get and analyze the data in which they are interested. Business Intelligence teams continue to tackle complex data, but it is easier to get started with data analytics for most users.

AI can also help optimize marketing activities by providing insights into customer behavior and preferences, identifying trends and patterns, and automating processes such as content creation, customer segmentation and more. AI initiatives achieve better results and help the marketing strategy better connect with the customers.

Website building, personalized targeting, content optimization, or even chatbot assistance for customer support are some well-known areas for AI based enhancements. AI content generation can help accelerate content creation. Fact-checking information in the articles and ensuring that messaging and tone are aligned with the brand voice continue to require supervision.

The right tool for the right job adage holds truer than ever in the case of AI applications. Technology and infrastructure can evolve with business as it grows, and long-term investments certainly help with the establishment of practice. Text to Text and Text-to-Image generators are popularized by tools like ChatGPT and DALL-E 2. These make use of large language models, natural language processing, and artificial neural networks. The caveat here is that different tools are trained on different models. It is also possible to mix and match, for example using ChatGPT to create a prompt and then use the prompt with DALL-E 2 or Midjourney. Social media platforms like Facebook and Instagram offer ad targeting and audience insights. Email marketing platforms like Mailchimp provide AI powered recommendations for subject lines and send times.

Some of the bias against AI comes from inaccurate information from generative AI. Others come from the bias served up by the AI tools. These are overcome with a wider range of datasets. AI4ALL for instance, works to feed AI a broad range of content to be more inclusive of the world. Another concern has been over-reliance on AI. A straightforward way to resolve this is to balance the use of AI with those requiring skilled supervision.

Tuesday, September 26, 2023

Continued from previous post...

Third, AI can change how to collect customer feedback. A minimum viable product is nothing more than a good start and feedback loop with the target audience is essential to taking it to completion. Until recently, product analytics has been largely restricted to structured or numerical data. Notable and eminent AI experts argue that this is merely 20% of the data and that companies have the remaining as unstructured and in the form of documents, emails, and social media chatter. AI is incredibly good with analyzing large amounts of data and even benefits from being tuned with more training data. Compare this with focus groups that are not always accurate representations of customer sentiment, and this leaves the product team vulnerable to potentially creating a product that does not serve its customers well. These same experts also make a case for the generative AI to help convert customer feedback into data for business.

Fourth, AI can help with redefining the ways teams develop products. It involves how engineers and product managers interact with the software. In the past, professionals were trained in the use of software-products-suite to the point where they were designated experts who understood how each piece worked and imparted the same via training to others. With AI, new team members can be onboarded rapidly by letting the AI generate the necessary boilerplates or prefabricated units or provide a more interactive way of getting help on software and hardware tools. What used to be wire diagrams and prototyping can now be replaced with design examples with constraints provided to chatbots. The interface seems just as human as a chat interface, so nothing about the internals of machine learning needs to be known to those wishing to use the interface.

Finally, AI helps with creativity as well. Machine learning algorithms are already used to learn patterns of transforming inputs to outputs and then apply that pattern to unseen data. The new generative models can even take this process a step further by encoding state between the constant stream of inputs which not only helps to get a better understanding of such things as sentiments but also generate suitable output without necessarily understanding or interpreting each input unit of information. This is at the core of capturing how a software engineer creates software, a designer creates a design, or an artist creates an art.

By participating in the thinking behind the creation, AI is poised to extend the abilities of humans past their current restrictions. Terms like co-pilots are beginning to be used to describe this co-operative behavior and come to the aid of product managers, software engineers, and designers.

The ways in which AI and humans can improve each other towards the development of a product is a horizon filled with possibilities and some trends are already being embraced in the industry. Customer experience is shifting in favor of self-service with near human like experience via interactive chats and industrial applications that leveraged machine learning models are actively replacing their v1.0 models with generative v2.0 models. More interactive and engaging experiences in the form of recommendations, or spanning across content, products or frameworks are certainly being envisioned. By virtue of both the data and the analysis models, AI can not only improve but redefine the product development process.

Experimentation at various scopes and levels is one way to increase our understanding of the role AI can play and this is getting a lot easier to get started. It is even possible to delegate the knowledge of machine learning to tools that can work across programmatic interfaces regardless of the purpose or domain of the applications. Just as prioritizing the use cases were a way to improve the return on investment for a product, AI initiatives must also be deliberated to determine the high-value engagements. In similar fashion, leadership and stakeholder buy-ins are necessary to articulate the value addition in the bigger picture as well as to take questions to cast away any rumored concerns such as privacy and data leakages. When convincing the leadership for investments, the limitation of the role of AI to a trusted co-pilot is required. Lastly, the risks of not investing in AI could also be called out.

Monday, September 25, 2023

AI and Product development - Part 1.

This article focuses on the role of Artificial Intelligence in product development. Both in business and engineering, a new product development covers the complete process from concept to realization and introducing it to the market. There are many aspects and interdisciplinary endeavors to get a product and thereby a venture off the ground. A central aspect of this process is the product design, which involves various business considerations and is broadly described as the transformation of a market opportunity into a product available for sale. A product is meant to generate income and technological companies leverage innovation in a rapidly changing market. Cost, time, and quality are the main variables that drive customer needs. Business and technology professionals find the product-market fit as one of the most challenging aspects of starting a business and startups are often constrained to meet this long and expensive process. This is where Artificial Intelligence holds promises for startups and SMBs.

Since the product design involves predicting the right product to build and investing in prototypes, experimentation and testing, Artificial Intelligence can help us be smarter about navigating the product development course. Research studies cite that 35% of the SMBs and startups fail due to no market need. AI powered data analysis can help them to be more accurate with a well-rounded view of the quantitative and qualitative data to determine whether the product will meet customer needs or even whether the right audience has been selected in the first place. Collecting and analyzing are strengths of AI and in this case helps to connect with the customers at a deeper level. One such technique is often referred to as latent semantic analysis in AI which helps to articulate the real customers’ needs. Hidden matrix or latent semantic analysis or SoftMax classification was nearly unknown until 2013. The traditional way of creating software products, especially when it was technologically driven, attributed to the high failure rate. This is an opportunity to correct that.

Second, AI boosts the iteration and time to market cycles by plugging into the CI/CD pipelines and reports. Mockups and prototypes often take time in the range of a few weeks at the least as they overcome friction and unexplored territory. This is a fairly long period of time for all participants in the process to see the same outcome. The time and money spent to create and test a prototype could end up costing the initiative in the first place. If this period could be collapsed by virtue of better insights into what works and what doesn’t, reprioritizing efforts to realize the products, better aligning with a strategy that has more chance towards becoming successful, and avoiding avenues of waste or unsatisfactory returns, the net result is shorter and faster product innovation cycles.

One specific ability of AI is called to attention in this regard. The so-called Generative AI can create content from scratch with high speed and even accuracy. This ability is easily seen in the field of copywriting which can be considered a content production strategy. Only in copywriting, the goal is to convince the reader to take a specific action and achieve it with its persuasive character, using triggers to arouse readers’ interest, to generate conversations and sales. Copyrighting is also an essential part of digital marketing strategy with potential to increase brand awareness, generate higher-quality leads, and acquire new customers. Good copywriting articulates the brand’s messaging and image while tuning into the target audience. This is a process that has parallels to product development. AI has demonstrated the potential to generate content from scratch. The difference between content writing and copywriting remains with these product developers to fill.

Sunday, September 24, 2023

Azure managed instance for Apache Cassandra is an open-source NoSQL distributed database that is trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault tolerance on commodity hardware or cloud infrastructure makes it the perfect platform for mission critical data.

Azure managed instance for Apache Cassandra is a distributed database environment. This is a managed service that automates the deployment, management (patching and node health), and scaling of nodes within an Apache Cassandra cluster. It also provides the capability for hybrid clusters, so Apache Cassandra datacenters deployed in Azure can join an existing on-premises or third party hosted Cassandra ring. This service is deployed using Azure Virtual machine scale sets.

However, Cassandra is not limited to any one form of compute platform. For example, Kubernetes runs distributed applications and Cassandra and Kubernetes can be run together. One of the advantages is the use of containers and another is the interactive management of Cassandra from the command line. The Azure managed instance for Apache Cassandra is notorious for allowing limited form of connection and interactivity required to manage the Cassandra instance. Most of the Database administration options are limited to the Azure command line interface that takes the invoke-command option to pass the actual commands to the Cassandra instance. There is no native invocation of commands directly by reaching the IP address because the Azure Managed Instance for Apache Cassandra does not create nodes with public IP addresses, so to connect to a newly created Cassandra cluster, one will need to create another resource inside the VNet. This could be an application, or a Virtual Machine with Apache’s open-source query tool CSQLSH installed. The Azure Portal may also provide connection strings that have all the necessary credentials to connect with the instance using this tool. Native support for Cassandra is not limited to the nodetool and sstable commands that are permitted via the Azure CLI command options. CSQLSH is a command-line shell interface for interacting with Cassandra using CQL (Cassandra Query Language). It is shipped with every Cassandra package and can be found in the bin/ directory. It is implemented with the Python native protocol driver and connects to the single specified node, and this greatly reduces the overhead to manage the Cassandra control and data planes.

The use of containers is a blessing for developers to deploy applications in the cloud and Kubernetes helps with the container orchestration. Unlike managed Kubernetes instances in Azure that can allow a client to configure the .kubeconfig file with connection configuration using the az cli get-credentials and kubectl switch context commands, the Azure managed instance for Apache Cassandra does not come with the option to use kubectl commands. The use of containers helps with managing add or remove of nodes to the Cassandra cluster with the help of the cassandra.yaml file. It can be found in the /etc/cassandra folder within the node. One cannot access the node directly from the Azure managed instance for Cassandra so a shell prompt in the node is out of the question. The nodetool option to bootstrap is also not available via Invoke-Command but it is possible to edit this file. One of the most important properties of this application is the option to set seed-providers for existing datacenters. This option allows a new node to quickly become ready by importing all the necessary information from the existing datacenter. The seed provider must not be set to the new node but point to the existing node.

Cassandra service on a node must be stopped prior to the execution of some commands and restarted post execution. The database must also be set to read-write for certain commands to execute. These options can be set as command line parameters to the Azure Command-line interface for the managed-cassandra set of commands.

Saturday, September 23, 2023

This is a continuation of the previous articles on Azure Databricks usage and Overwatch analysis. While they talked about configuration and deployment of Overwatch, the data ingestion for analysis was taken to be the event hub which in turn collects it from the Azure Databricks resource. This article talks about the collection of the cluster logs and those from the logging and print instructions from the notebooks that run on the clusters.

The default cluster logs directory is the ‘dbfs:/cluster-logs’ and Databricks instance collects it every five minutes and archives every hour. The spark driver logs are saved in this directory. This location is managed by Databricks, and the cluster logs are saved in this directory in a sub-directory named after each cluster. When the cluster is created to attach a notebook to, the cluster’s logging destination is set to dbfs:/cluster-logs by the user under the advanced configuration section of the cluster creation parameters.

The policy under which the cluster gets created is also determined by the users. This policy could also be administered so that the users only create clusters compliant to a policy. In this policy, the logging destination option can be preset to a path like ‘dbfs:/cluster-logs.’ It can also be substituted with a path like ‘/mnt/externalstorageaccount/path/to/folder’ if a remote storage location is provided but it is preferable to use the built-in location.

The Azure Databricks instance will transmit cluster-logs along with all other opted-in logs to the event hub and for that it will require a diagnostic setting specifying the namespace and the EventHub to send to. Overwatch can read this EventHub data but reading from the dbfs:/cluster-logs location is not specified in the documentation.

There are a couple of ways to do that. First, the cluster log destination can be specified in the mapped-path-parameter in the Overwatch deployment csv, so that the deployment knows this additional location to read the data from. Although documentation suggests that the parameter was introduced to cover those workspaces that have more than fifty external storage accounts, it is possible to include just one that the overwatch needs to read from. This option is convenient for reading the default location but again the customers or the administrator must ensure that the clusters are created to send the logs to that location.

While the above works for new clusters, the second option works for both the new and the existing clusters in that a dedicated Databricks job is created to read cluster log locations and transmit to the location that the Overwatch reads from. This job would use the shell command of ‘rsync’ or ‘rclone’ and perform copying activity that can resume with intermittent network failures and indicate progress. When this job runs periodically, the clusters are unaffected and along with the Overwatch jobs, this job would run to make sure that all the relevant logs not covered by those streaming to the EventHub are also read by Overwatch.

Finally, the dashboards that report the analysis performed by Overwatch, which are also available out-of-the-box, can be scheduled to run nightly so that all the logs collected and analyzed are included periodically.

Friday, September 22, 2023

This is a continuation of the previous articles on Azure Databricks and Overwatch analysis. This section focuses on the role-based access control required for the setup and deployment of Overwatch.

The use of a storage account as a working directory for Overwatch implies that it will need to be accessed from the databricks workspace. There are two ways to do this – one that involves the azure active directory credentials passthrough with ‘abfss@container.storageaccount.dfs.core.windows.net’ name resolution and another that mounts the remote storage account as a folder on the local file system.

The former requires that the cluster be enabled for active directory credentials passthrough and will work for directly resolving the deployment and reports folder but for contents whose layout is dynamically determined, the resolution is expensive each time. The abfss scheme also fails with error 403 when there are tokens demanded for certain activities. Instead, the second way of mounting helps with one time setup. The mount is setup with the help of a service principal and getting OAuth tokens from the active directory. It becomes the prefix for all the temporary files and folders.

Using the credentials with the Azure Active Directory only works when there are corresponding role assignments and container/blob access control lists. The role assignment for the control plane differs from that of the data plane so there are roles for both. This separation of roles allows access to certain containers and blobs without necessarily allowing access to change the storage account and container organization or management. With acls applied to individual files/blobs and folders/container, the authentication-authorization-auditing is completely covered and scoped at the finest granularity.

Then queries like the following can come very helpful:

1. Frequent operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| summarize count() by OperationName

| sort by count_ desc

| render piechart

2. High latency operations can be queried with:

StorageBlobLogs

| where TimeGenerated > ago(3d)

| top 10 by DurationMs desc

| project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs – ServerLatencyMs

3. Operations causing the most error are caused by:

StorageBlobLogs

| where TimeGenerated > ago(3d) and StatusText !contains "Success"

| summarize count() by OperationName

| top 10 by count_ desc

4. Gives the number of read transactions and the number of bytes read on each container:

StorageBlobLogs

| where OperationName == "GetBlob"

| extend ContainerName = split(parse_url(Uri).Path, "/")[1]

| summarize ReadSize = sum(ResponseBodySize), ReadCount = count() by tostring(ContainerName)