Friday, March 31, 2023

 

This is a continuation of the discussion on Azure Data Platform.

GitHub actions are highly versatile and composable workflows that can be used to respond to events in the code pipeline. These events can be a single commit or a pull-request that needs to be build and tested or the deployment of a code change that has been pushed to production. While automations for continuous integration and continuous deployment are well-established practices in DevOps, GitHub Actions goes beyond DevOps by recognizing events that can include the creation of a ticket or an issue. The components of a GitHub actions are the workflow triggered for an event and the jobs that the workflow comprises of. Each job will run inside its own virtual machine runner, or inside a container and has one or more steps that either run a script or an action that can be reused.

A workflow is authored in the .github/workflows/ directory and specified as a Yaml file. It has a sample format like so:

name: initial-workflow-0001

run-name: Trying out GitHub Actions

on: [push]

jobs:

  check-variables:

    runs-on: ubuntu-latest

    environment: nonProd

    steps:

      - name: Initialize

        uses: actions/setup-node@v3

        with:

          node-version: '16'

      - name: Display Variables

        env:

             variable: 0001

        if: github.ref == ‘refs/heads/main’ && github.event_name == ‘push’

        run: |

            echo 000-“$variable”

            echo 001-“${{ env.variable }}”

And whenever there is a commit added to the repository in which the workflows directory was created, the above workflow will be triggered. It can be viewed with the Actions tab in the repository where the named job can be selected, and it steps expanded for the activities performed.

With this introduction, let us check out a scenario for using it with the Azure Data Platform. Specifically, this scenario calls for promoting the Azure Data Factory to higher environments with the following steps:

1.       Development data factory that is created and configured with GitHub.

2.       Feature/working branch is used for making changes to pipelines, etc.

3.       When changes are ready to be reviewed, a pull request is created from the feature/working branch to the main collaboration branch.

4.       Once the pull request is approved and changes merged with the main collaboration branch, the changes are published to the development factory.

5.       When the changes are published, data factory saves its ARM templates on the main publish branch (adf_publish by default). A file called ARMTemplateForFactory.json contains parameter names used for resources like key vault, storage, linked services, etc. These names are used in GitHub Actions workflow file to pass resource names for different environments.

6.       Once the GitHub Actions workflow file has been updated to reflect the parameter values for upper environment and changes pushed back to GitHub branch, the GitHub Actions workflow is started manually and changes pushed to the upper environment.

Some observations to make are as follows:

GitHub is configured on development data factory only.

Integration Runtime names and types need to stay the same across all environments.

Self-hosted Integration Runtime must be online in upper environments before deployment or else it will fail.

Key Vault secret names are kept the same across environments, only the vault name is parameterized.

Resource naming needs to avoid spaces.

Secrets stored in GitHub Secrets section.

There are two environments – dev and prod in this scenario.

Dev branch is used as collaboration branch.

Feature1 branch is used as a working branch.

Main branch is used as publish branch.

 

Thursday, March 30, 2023

 This continues from the previous post:

In addition to data security, monitoring plays an important role in maintaining the health and hygiene of data.

Azure Monitor is a centralized management interface for monitoring workloads anywhere. The monitoring data comprises of metrics and logs. With this information, there are built-in capabilities to support responses to the monitoring information in several ways. Monitoring everything from code through to the platform provides holistic insights.

The key monitoring capabilities include: Metrics Explorer to view and graph small, time-based metric data, Workbooks for visualization, reporting and analysis of monitoring data, Activity logs for REST API write actions performed on Azure resources, Azure monitor logs for advanced, holistic analytics, using Kusto query language, Monitoring insights for resource specific monitoring solutions, and alerts and action groups for alerting, automation and incident management.

The monitoring information cannot all be treated the same as for those from resources. For this reason, there are diagnostic settings available that help us to differentiate the treatment we provide to certain types of monitoring information.  Platform monitoring diagnostic setting helps us to route data for platform logs, and metrics. Similarly, there are multiple data categories available which enable us to treat them differently. Some of the treatments involve sending the data to a storage account to retain and analyze them, sending the data to a Log Analytics workspace for powerful analytics or sending the data to Event Hubs to stream to external systems.

One of the most interesting aspects of Azure Monitoring is that it collects metrics from Applications, Guest OS, Azure resource monitoring, Azure subscription monitoring, and Azure tenant monitoring to include the depth and breadth of the systems involved. Alerts and Autoscale help determine the appropriate thresholds and actions that become part of the monitoring stack, so the data and the intelligence are together and easily navigated via the dashboard.  Azure Dashboards provide a variety of eye-candy charts that better illustrate the data to the viewers than the results of a query. Workbooks provide a flexible canvas for data analysis and the creation of rich visual reports in the Azure Portal.  The analysis is not restricted to just these two. Power BI remains the robust solution to provide analysis and interactive visualizations across a variety of data sources and it can automatically import log data from Azure monitor. Azure Event Hubs is a streaming platform and event ingestion service which permits real-time analytics as opposed to batching or storage-based analysis. APIs from the Azure monitor help with reading and writing data as well as configure and retrieve alerts.


Wednesday, March 29, 2023

 One of the ways to secure data is to control access.  These involve generating access keys, shared access signatures, and Azure AD access. Access keys provide administrative access to an entire resource such as   storage accounts. Microsoft recommends these are only used for administrative purposes. Shared access signatures are like tokens that can be used to generate granular access to resources within a storage account. Access is provided to whoever or whatever has this signature. Role based access control can be used to control access to both the management and data layer. Shared access signatures can be generated to be used in association with an Azure AD identity, instead of being created with a storage account access key.  Access to files from domain joined devices are secured using the Azure AD identities that can be used for authentication and authorization.

Another way to secure the data is to protect the data using storage encryption, Azure Disk Encryption, and Immutable storage. Storage encryption involves server-side encryption for data at rest and Transport Layer Security based encryption for data in transit. The default encryption involves an Account Encryption Key which is Microsoft managed but security can be extended through the use of customer-managed keys which are usually stored in a Key Vault. The volume encryption encrypts the boot OS and data volumes to further protect the data.

A checklist helps with migrating sensitive data to the cloud and provides benefits to overcome the common pitfalls regardless of the source of the data. It serves merely as a blueprint for a smooth secure transition.

Characterizing permitted use is the first step for data teams need to take to address data protection for reporting. Modern privacy laws specify not only what constitutes sensitive data but also how the data can be used. Data obfuscation and redacting can help with protecting against exposure. In addition, data teams must classify the usages and the consumers. Once sensitive data is classified, and purpose-based usage scenarios are addressed, role-based access control must be defined to protect future growth.

Devising a strategy for governance is the next step; this is meant to prevent intruders and is meant to boost data protection by means of encryption and database management. Fine grained access control such as attribute or purpose-based ones also help in this regard.

Embracing a standard for defining data access policies can help to limit the explosion of mappings between users and the permissions for data access; this gains significance when a monolithic data management environment is migrated to the cloud. Failure to establish a standard for defining data access policies can lead to unauthorized data exposure.

When migrating to the cloud in a single stage with all at once data migration must be avoided as it is operationally risky. It is critical to develop a plan for incremental migration that facilitates development testing and deployment of a data protection framework which can be applied to ensure proper governance. Decoupling data protection and security policies from the underlying platform allows organizations to tolerate subsequent migrations.

There are different types of sanitizations such as redaction, masking, obfuscation, encryption tokenization and format preserving encryption. Among these static protection in which clear text values are sanitized and stored in their modified form and dynamic protection in which clear text data is transformed into a ciphertext are most used.

Finally defining and implementing data protection policies brings several additional processes such as validation, monitoring, logging, reporting, and auditing. Having the right tools and processes in place when migrating sensitive data to the cloud will allay concerns about compliance and provide proof that can be submitted to oversight agencies. 

Tuesday, March 28, 2023

 

Data Pipelines are specific to organizational needs, so it is hard to come up with a tried and tested methodology that suits all, but standard practice continues to be applicable across domains. One such principle is to focus on virtualizing the different sources of data, say into a data lake, so that there is one or at most a few pipeline paths. Another principle is to be diligent about consistency and standardization to prevent unwieldy or numerous customizations. For example, if a patient risk score needs to be calculated, then a general scoring logic must be first applied that is not source specific and then apply an override for a specific source. Reuse can be boosted by managing configurations stored in a database. This avoids a pipeline per data source antipattern.

Pipelines also need to support scalability. One approach to scale involves an event driven stack. Each step picks up its task from a messaging queue and also sends its results to a queue and the processing logic works on an event-by-event basis. Apache Kafka is a good option for this type of setup and works equally well for both stream processing and batch processing.

Another approach to support scalability involves the use of a data warehouse. These help to formalize extract-transform-load operations from diverse data sources and support many types of read-only analytical stacks.

Finally, on-premises solutions can be migrated to the cloud for scalability because of elasticity and higher rate limits. And there is transparency and pay-as-you-go pricing that appeals to the return on investment. Some apprehension about data security precedes many design decisions about on-cloud solutions but security and compliance in the cloud is unparalleled and provides better opportunities for hardening.

Monitoring and alerting increases transparency and visibility into the application and are crucial to health checks and troubleshooting. A centralized dashboard for metrics and alerts tremendously improves the operations of the pipeline. It also helps with notifications.

There are so many technological stacks and services to use in the public cloud that there is always some with missing expertise on the team. Development teams must focus on skills and internal cultural change. Some of the sub-optimal practices happen when leadership is not prioritizing cloud cost optimizations. For example, developers ignore small administrative tasks that may significantly improve operating costs, architects select suboptimal designs that are easier and faster to run but are more expensive to implement, the algorithms and code has not been streamlined and tightened to leverage the best practices in the cloud, deployment automation is neglected or even skipped altogether when they could have correctly adjusted the size of the resources deployed and finally, finance and procurement teams are viewing misplacing their focus on the numbers in the cloud bill and creating tension between them and the IT/development teams. A non-committal mindset towards cloud technologies is a missed opportunity for business leaders because long-term engagements are more cost friendly.

 

Monday, March 27, 2023

Azure data platform continued

 

While the discussion on SQL and NoSQL stacks for data storage and querying has relied on the separation of transactional processing versus analytical processing, there is another angle to this dichotomy from data science perspective.  The NoSQL storage, particularly key value stores are incredibly fast and efficient at ingesting data but their queries are inefficient. The SQL store, on the other hand, is just the opposite. They are efficient at querying the data but ingest data slowly and inefficiently.

Organizations are required to have data stacks that provide the results from heavy data processing in a limited time window. SQL databases are also overused for historical reasons. When organizations want the results of the processing within a fixed time window, the database’s inefficiency to intake data delays the actual processing which in turn creates an operational risk.  The limitation comes from the data structure used in the relational database. Storing a 1 TB table requires the B+ tree to grow to six levels. If the memory used by the database server is of the order of 125 GB, less than 25% of the data will remain in cache. For every insertion, it must read an average of three data blocks from disk to read the leaf node. This will cause the same number of blocks to be evicted from the cache. This dramatic increase in I/O makes these databases so inefficient to ingest data.

When the data is stored in the NoSQL stores, the querying is inefficient because a single horizontal data fragment could be spread across many files which increases duration in reading the data. This improves if there is an index associated with the data but the volume of data is still quite large. If a single B+ tree is assumed to have 1024 blocks, each search will need to access log (1024) = 10 blocks and there could be many files each with its own index, which means if there are 16 B+ trees, a total of 160 blocks would be read instead of the 10 blocks in one index in a database server. Cloud document stores are capable of providing high throughput and low latency for queries by providing a single container for unlimited sized data and charging based on reads and writes. To improve the performance of the cloud database, we must set the connection policy to direct mode, set the protocol to TCP, avoid startup latency on first request, collocate clients in the same Azure region for performance and increase the number of threads/tasks.

Ingestion is also dependent on the data sources. Different data sources often require different connectors but fortunately they come out-of-the box from cloud services. Many analytical stacks can easily connect to the storage via existing and available connectors reducing the need for integration. Services for analysis from the public cloud are rich, robust and very flexible to work with.

When companies want to generate value and revenue from accumulated data assets,  they are not looking to treat all data equally like some of the analytical systems do. Creating an efficient pipeline that can scale to different activities and using appropriate storage and analytical systems for specific types of data helps meet the business goals with rapid development. Data collected in these sources for insurance technology software could be as varied as user information, claim details, social media or government data, demographic information, current state of the user, medical history, income category or credit score, agent customer interaction, and call center or support tickets.  ML templates and BI tools might be different for each of these data categories but the practice of using data pipelines and data engineering best practices bring cloud-first approach that delivers on the performance requirements expected for those use cases.

Sunday, March 26, 2023

 Business process automations provide immense opportunities for improvements as ever and more so when there is pressure both inside and outside an organization as is the case in the insurance sector that infuse new ways of doing business and continually challenge traditional ways.  Some of these trends can be seen with examples, such as, when Lemonade has streamlined the claims experience by using AI and chatbots, Ethos issues life insurance policies in minutes, Hippo issues home quotes in under a minute, Telematics based auto insurance companies like Metromile and Root offer usage based insurance, Fabric offers value-added services in its niche market, Figo Pet empowers a completely digital pet insurance platform and Next Insurance simplifies small business insurance with a 100% online experience. In this sector, incumbent insurers have insufficient customer-centric strategies and while the establishment view the niche product from these incumbents as merely threats, they run the risk of increasing their debt with their outdated processes and practices which results in lower benefits to cost ratio. The gap also widens between the strategies offered by the incumbents and those provided by the established. Incumbents channel technology best practices and are not bound by industry traditions. Zero-touch processing, self-service and chatbots are deployed to improve customer -centric offerings. Industry trends are also pushing customers more and more to digital channels. The net result is that the business is increasingly driven towards digital-enabled omnichannel, and customer-aligned customer experience and established business have the advantage to strive for scale. This is evident from the increased engagement with digital leaders from inside and outside the industry and with the new and improved portfolio of digital offerings.

A digital transformation leader sees the opportunities and challenges for these business initiatives with a different point of view. It can be broken down into legacy and new business capabilities and with different approaches to tackle them. While established companies in this sector have already embraced the undisputed use of cloud technologies and have even met a few milestones on their cloud adoption roadmap, digital leaders increasingly face the challenges of migrating and modernizing legacy applications both for transactional processing as well as analytical reports. The journey towards the cloud has been anything but steady due to the diverse technological stack involved in these applications. A single application can be upwards of a hundred thousand lines of code and many can be considered boutique or esoteric in terms of their use of third-party statistical software and libraries. What use to be streamlined and tuned analytical queries against data warehouses, has continually evolved to using Big Data and stream analytics software often with home-grown applications or internal facing automations.

Saturday, March 25, 2023

 

A previous post introduced some of the best practices using Azure Data Platform. It covered various options about structured and unstructured storage. This article covers some of the considerations regarding data in transit.

Azure Data Factory is frequently used to extract-transform-load data to the cloud. Even if there are on-premises SSIS data tasks to perform, Azure Data Factory can help to migrate the data from on-premises to the cloud. There are different components within the Azure Data Factory that help to meet this goal. The Linked Services provide connections to external resources that contain datasets to work with. The Pipeline has one or more activities that can be triggered to control/transform the data. The Integration Runtime provides the compute environment for data integration execution that involves flow, transform and movement. This can be Azure-based, self-hosted and an integration of Azure-SSIS that can lift and shift existing SSIS workloads. The pipeline and activities define actions to perform on the data. For example, a pipeline could contain a set of activities that ingest and clean log data and then kick off a mapping data flow to analyze the log data. When the pipeline is deployed and scheduled, all the activities can be managed as a set instead of each one individually. Activities can be grouped into data movement activities, data transformation activities, and control activities. Each activity can take zero or more input and output datasets. Azure Data Factory enables us to author workflows that orchestrate complex ETL, ELT and data integration tasks in a flexible way that involves graphical and code-based data pipelines with Continuous Integration / Continuous Deployment support.

Let us take a scenario for a company that has a variety of data from a variety of sources and requires automations for data ingestion and analysis in the cloud. It has a variety of data available, from a variety of sources and requires expertise in business analysis, data engineering, and data science to define an analytics solution. For this purpose, it leverages the new data analytics platform in Azure that has been discussed so far. The current practice in this company captures a variety of data about manufacturing and marketing and stores it in a centralized repository. The size of this repository limits the data capture to about one week’s worth of data and supports data formats in the form of json, csv, and text. Additionally, data also exists in another cloud in a publicly available object storage. The company is expecting to meet the following objectives with regard to data storage, data movement, and data analytics and insights. The data storage must be such that months of data to the tune of petabytes can be stored and support access control at the file level. Data must be regularly ingested from both the on-premises and AWS. The existing connectivity and accessibility to data cannot be changed. The analytics platform must support Spark and be available to the other cloud. Security demands the workspace used for analytics must be made available only to the head office.

A possible solution for the above scenario is one that could store the data in the Azure Data Lake Storage Gen2 because it scales to petabytes of data and comes with hierarchical namespace and POSIX-like access control list. The data can be copied or moved with Azure Data Factory that has a self-hosted integration runtime that is running on-premises and can access the on-premises storage privately. Even when certain data might be in the other cloud, Azure Data Factory can leverage the built-in Azure Integration runtime to access it. There are many services to choose from for the analytics solution, but Databricks provides the analytics because it could potentially work in both public clouds. The premium plan for Databricks can restrict the workspace use only to the head office. It also supports Azure AD credential pass-through when it is to be used for securing data lake storage.

 

Friday, March 24, 2023

Some best practices using Azure Data Platform: 

Structured and Unstructured data required different storage and processing. Structured data is a fixed format data with schema, types, and relationships. It requires a lot of upfront planning and is equally difficult to modify afterwards. It is frequently used to store application data for online transaction processing. The semi-structured data is very flexible format with various models such as key/value, document, etc. The emphasis is on long-term flexibility and modifications and is best suited to dynamic applications such as social media. Media files, text files and office documents are most frequently used unstructured data. 

Azure storage accounts can store blobs, queues, files and blocks. The kind of storage account determines the supported storage services, performance tier, and pricing. Data can be replicated in the primary region to a secondary region. The access tier influences the pricing and access latency. The hot, cool or archive access tiering suits data ageing and supports lifecycle management. The Gen2 storage supports hierarchical namespace. Object data can be streamed globally, and at scale. Lower latency and higher throughput come with performance tiers. Replication and accessibility must be part of the design decisions. 

Companies that require relational databases have a variety of products and offerings from Azure. They can choose to directly host a database on Azure using a fully managed offering that supports common SQL Server features. Otherwise, they could choose to deploy a managed SQL Server instance. If they want even more parity and control with a traditional on-premises database, then they could create Azure SQL Server virtual machines which provide full control and access. Those three options are ordered in terms of trade-offs between cloud native and full control and access. Cloud native resources come with built-in backup, patching, and recovery and provide 99.995% availability guarantee. They also integrate with Azure Active Directory. Azure SQL managed instances are deployed on a managed virtual cluster that Microsoft manages. It provides a private IP address and support for most migrations to the cloud. The SQL purchasing models can be DTU-based for predictive and linear increase between compute and storage or vCore based for scattered and independent compute versus storage. vCore also supports Azure Hybrid Benefits which supports porting on-premises SQL Server licensing. The General Purpose Service Tier uses blob storage at about 5-10 ms latency, the Business Critical uses SSD at about 1-2 ms latency and 16TB database, and Hyperscale supports up to 100TB databases. Azure SQL virtual machines provide full control and access with relaxed limits. 

The Data Lake storage is ideal to store huge amounts of varied or unstructured data and is built on top of block blobs. It enables common analytical features and access and is especially useful to store large volumes of text with support for hierarchical namespaces. It is accessible by Hadoop services and supports a superset of POSIX for finer-grained access controls. Synapse Analytics combines data warehousing and big data analytics. It has tools for data integration from diverse sources and powers analytics with massively parallel processing. The resource pools can be SQL pool, Spark pool and Synapse supports pipelines for data movement and transformation using Azure Data Factory workflows and supports connectivity to CosmosDB for near real-time analytics. The models served by Synapse can be rich. Databricks are favored for Apache Spark-based big data and machine learning. Environments with Databricks can run SQL queries on a data lake, provide a collaborative workspace for working on big data pipelines and analytics and end-to-end integrations for ML experiments, model training and serving. The premium tier can help with management, security and monitoring with audit logs, notebooks, cluster, job RBAC, Azure AD passthrough and IP access lists. 

The right choices for cloud engineering can bring tremendous value to data engineering professionals. 

Thursday, March 23, 2023

Linux kernel extensions continued

 

Application development frequently encounters the need for background tasks or scheduled jobs. Long running tasks are often delegated to background workers. In fact, some models require a state reconciliation control loop that is best met by background workers. This idea is frequently encountered with infrastructure providers. For example, Kubernetes has a language that articulates state so that its control loop can reconcile these resources. The resources can be generated and infused with specific configuration and secret using a configMap generator and a secret generator respectively. It can take an existing application.properties file and generate a configMap that can be applied to new resources.

FUSE can also be used with remote folders. Mounting a remote folder is a great way to access information on a remote server. Mounting the folder into the filesystem will allow us to drag and drop files into the required folder and the information will then be transferred to the remote location. For example, the following commands can be used to set this up:

yum install epel-release -y

yum install fuse sshfs -y

modprobe fuse

lsmod | grep fuse

echo “modprobe fuse” >> /etc/rc.local

ssh root@198.162.2.9:/home/mount /home/remote

Wednesday, March 22, 2023

Linux Kernel Continued...

 

Linux also supports FUSE which is a user-space file-system framework. It consists of a kernel module (fuse.ko), a userspace library(libfuse.*) and a mount utility (fusermount). One of the most important features of FUSE is allowing secure non-privileged mounts. One example of this is the sshfs which is a secure network filesystem using the sftp protocol.

One of the common applications for FUSE filesystem is the use of a Watchdog to continuously monitor a folder to check for any new files or when an existing file is modified or deleted. As an example, if the size of the folder exceeds a limit, then it can be pruned. Watchdog is an open-source cross-platform python API library that can be used to monitor file systems. The Watchdog observer keeps monitoring the folder for any changes like file creation and when an event occurs, the event handler executes the event’s specified action.

Such usage is very common when there are a lot of files being uploaded to a file directory, let us say a hot folder and those files may never be used once they are processed. It helps to keep the file contents of the hot folder under a certain size limit for performance reasons. Therefore, another folder is created to roll over the contents from the hot folder. Let us call this folder the cold folder. It might so happen that processing might not have caught up with a file in the hot folder  and it is moved to the cold folder. The application then needs to check the hot folder first and then the cold folder and with the help of an attribute or a modification to the file name or the presence of an output file, detect if the file has been processed. The hot and cold folder are interchangeable for reading from and writing to the file. Since FUSE provides a bridge to the actual kernel interfaces, the library providing event handling interfaces can extend it to pass through the file operations without requiring the application to know whether the hot or cold folder is used. The only overrides to the operating system file system operations would be to resolve the appropriate folder.

Tuesday, March 21, 2023

Linux kernel continued

 

Linux supports several file systems. The Virtual File System Interface allows Linux to support many file systems via a common interface. It is designed to allow access to files as fast and efficiently as possible.

Ex2fs was the original file system, and it became widely popular allowing typical file operations such as to create, update, and delete files, directories, hard links, soft links, device special files, sockets, and pipes. It suffered from one limitation that if the system crashed, the entire file system would be validated and corrected for inconsistencies before it is remounted. This was improved with journaling where every file system operation is logged before the operation is executed and the log is replayed to bring the file system to consistency.

Linux Volume Managers and Redundant Array of Inexpensive Disks (RAID) provide a logical abstraction of a computer’s physical storage devices and can combine several disks into a single logical unit to provide increased total storage space as well as data redundancy. Even on a single disk, they can divide the space into multiple logical units, each for a different purpose.

Linux provides four different RAID levels. RAID-Linear which is a simple concatenation of disks that comprise the volume. Raid-0 is a simple striping where the data that is written is interleaved in equal-sized “chunks” across all disks in the volume. RAID-1 is mirroring where  all data is replicated on all disks in the volume. A RAID-1 volume created from n disks can survive the failure of n-1 of those disks. RAID-5 is striping with parity which is similar to RAID-0 but with one chunk in each stripe containing parity information instead of data. RAID-5 can survive the failure of any single disk in the volume.

A Volume-Group could be used to form a collection of disks also called Physical-Volumes. The storage space provided by these disks is then used to create Logical-Volumes. It is also resizable.  New volumes are easy to add as extents and the Logical Volumes can be expanded or shrinked and the data on the LVs can be moved around within the same Volume-Group.

Beyond the hard disk, keyboard and console that a Linux system supports by default, a user-level application can create device special files to access other hardware devices. They are mounted as device nodes in the /dev directory. Usually, these are of two types: a block device and a character device. Block devices allow block-level access to the data residing on a device and the character devices allow character-level access to the devices. The ls -l command will show a ‘b’ for block device and a ‘c’ for character device in the permission string. The virtual file system devfs is an alternative to these special devices. It reduces the system administrative task of creating device node for each device.  A system administrator can mount the devfs file system many times at different mount points but changes to a device node is reflected on all the mount points. The devfs namespace exists in the kernel even before it is mounted which makes the device node, to become available independently of the root file system.

Monday, March 20, 2023

Linux Kernel continued...

 

Interprocess communications aka IPC occurs with the help of signals and pipes. Linux also supports System V IPC mechanisms. Signals notify events to one or more processes and can be used as a primitive way of communication and synchronization between user processes. Signals can also be used for job control.  Processes can choose to ignore most of the signals except for the well-known SIGSTOP and SIGKILL. The first causes a process to halt its execution. The second causes a process to exit. Defaults actions are associated with signals that the kernel completes. Signals are not delivered to the process until it enters running state from ready state. When a process exits a system call, the signals are then delivered. Linux is POSIX compatible so the process can specify which signals are blocked when a particular signal handling routine is called.

A pipe is a unidirectional, ordered and unstructured stream of data. Writers add data at one end and readers get it from the other end. An example is the command “ls | less” which paginates the results of the directory listing.

UNIX System V introduced IPC mechanisms in 1983 which included message queues, semaphores, and shared memory. The mechanisms all share common authentication methods and Linux supports all three. Processes access these resources by passing a unique resource identifier to the kernel via system calls.

Message queues allow one or more processes to write messages, which will be read by one or more processes. They are more versatile than pipes because the unit is a message rather than an unformatted stream of bytes and messages can be prioritized based on a type association.

Semaphores are objects that support atomic operations such as set and test. They are counters for controlled access to shared resources by multiple processes. Semaphores are most often used as locking mechanisms but must be used carefully to avoid deadlocking such as when a thread holds on to a lock and never releases it.

Shared memory is a way to communicate when that memory appears in the virtual address spaces of the participating processes. Each process that wishes to share the memory must attach to virtual memory via a system call and similarly must detach from the memory when it no longer needs the memory.

Linux has a symmetrical multiprocessing model. A multiprocessing system consists of a number of processors communicating via a bus or a network. There are two types of multiprocessing systems: loosely coupled or tightly coupled. Loosely coupled systems consists of processors that operate standalone. Each processor has its own bus, memory, and I/O subsystem, and communicates with other processes through the network medium. Tightly coupled systems consists of processors that share memory, bus, devices and sometimes cache. These can be symmetric and asymmetric. Asymmetric systems have a single master processor that controls the others. Symmetric systems are subdivided into further classes consisting of dedicated and shared cache systems.

Ideally, an SMP System with n processors would perform n times better than a uniprocessor system but in reality, no SMP is 100% scalable.

SMP systems use locks where multiple processors execute multiple threads at the same time. Locking must be limited to the smallest time possible. Another common technique is to use finer grain locking so that instead of locking a table, only a few rows are locked at a time. Linux 2.6 removes most of the global locks and locking primitives are optimized for low overheads.

Multiprocessors demonstrate cache coherency problem because each processor has an individual cache, and multiple copies of certain data exist in the system which can get out of sync.

Processor affinity improves system performance because the data and the resources accessed by the code will stay local to the processor’s cache due to warmth. Affinity helps to use these rather than fetch repeatedly. Use of processor affinity is accentuated in Non-uniform Memory Access architectures where some resources can be closer to a processor than others.

Sunday, March 19, 2023

Linux Kernel

 

The kernel has two major responsibilities:

-          To interact with and control the system’s hardware components.

-          To provide an environment in which the application can run.

All the low-level hardware interactions are hidden from the user mode applications. The operating system evaluates each request and interacts with the hardware component on behalf of the application.

Contrary to the expectations around subsystems, the Linux kernel is monolithic. All of the subsystems are tightly integrated to form the whole kernel. This differs from microkernel architecture where the kernel provides bare minimal functionality, and the operating system layers are performed on top of microkernels as processes. Microkernels are generally slower due to message passing between various layers. But Linux kernels support modules which allow it to be extended. A module is an object that can be linked to the kernel at runtime.

System calls are what an application uses to interact with kernel resources. They are designed to ensure security and stability. An API provides a wrapper over the system calls so that the two can vary independently. There is no relation between the two and they are provided as libraries to applications.

The /proc file system provides the user with a view of the internal kernel data structures. It is a virtual file system used to fine tune the kernel’s performance as well as the overall system.

The various aspects of memory management in Linux includes address space, physical memory, memory mapping, paging and swapping.

One of the advantages of virtual memory is that each process thinks it has all the address space it needs. The isolation enables processes to run independently of one another. The virtual memory can be much larger than physical memory in the system. The application views the address space as a flat linear address space. It is divided into two parts: the user address space and the kernel address space. The range between the two depends on the system architecture. For 32 bit, the user space is 3GB and the kernel space is 1GB. The location of the split is determined by the PAGE_OFFSET kernel configuration variable.

The physical memory is architecture-independent and can be arranged into banks, with each bank being a particular distance from the processor. Linux VM represents this arrangement as a node. Each node is divided into blocks called zones that represent ranges within memory. There are three different zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. Each zone has its own use with the one named normal for kernel and the one named highmem for user data.

When memory mapping occurs, the kernel has one GB address space. The DMA and NORMAL ranges are directly mapped to this address space. This leaves only 128 MB of virtual address space and used for vmalloc and kmap. With systems that allow Physical Address Extension, handling physical memories in tens of gigabytes can be hard for Linux. The kernel handles high memory on a page-by-page basis.  It maps the page into a small virtual address space (kmap) window, operates on that page and unmaps the page. The 64 bit architecture do not have this problem because their address space is huge.

The virtual memory is implemented depending on the hardware. It is divided into fixed size chunks called pages. Virtual memory references are translated into addresses in physical memory using page tables. Different architectures and page sizes are accommodated using three-level paging mechanism involving Page Global Directory, Page Middle Directory, and Page Table. This address translation provides a way to separate the virtual address space of a process from the physical address space. If an address is not in virtual memory, it generates a page fault, which is handled by the kernel.  The kernel handles the fault and brings the page into main memory even if it involves replacement.

Swapping is the moving of an entire process to and from the secondary storage when the main memory is low but is generally not preferred because context switches are expensive. Instead, paging is preferred. Linux performs swapping at page level rather than at the process level and is used to expand the process address space and to circulate pages by discarding some of the less frequently used or unused pages and bringing in new pages. Since it writes to disk, the disk I/O is slow.

Saturday, March 18, 2023

 

Linux Kernel:

Linux Kernel is a small and special code within the core of the Linux Operating System and directly interacts with the hardware. It involves process management, process scheduling, system calls, interrupt handling, bottom halves, kernel synchronization and its techniques, memory management and process address space.

A process is the program being executed on the processor. Threads are the objects of activity within the process. Kernel schedules individual threads. Linux does not differentiate between thread and process. A multi-threaded program can have multiple processes. A process is created using the fork call. Fork call will return in the child process and in the parent process. At the time of the fork, all the resources are copied from the parent to the child. When the exec call is called, the new address space is loaded for the process.  Linux kernel maintains a doubly linked list of task structures pertaining to the processes and refers to them with process descriptors which are used to keep information regarding the processes. The size of the process structure depends on the architecture of the machine. For 32-bit machines, it is about 1.7KB. Task structure gets stored in memory using kernel stack for each process.  A process kernel stack has a low memory address and a high memory address. The stack grows from the high memory address to the low memory address and its front can be found with the stack pointer. The thread_struct and the task_struct are stored on this process address space towards the low memory address. PID helps to identify the process from one of thousands. The thread_info_struct is used to conserve memory because storing1.7KB in a 4KB process space uses up a lot. It has pointers to the task_struct structures. The pointer is a redirection to the actual data structure and uses up a very tiny space. The maximum number of processes in linux can be set in the configuration at pid_max in the nested directories of proc, sys and kernel. A current macro points to the currently executing task_struct structure. The processes can be in different process states. The first state to enter when the process is forked is the ready state. When the scheduler dispatches the task to run, it enters the running state and when the task exits, it is terminated. A task can switch between running and ready many times by going through an intermediary state called the waiting state or the interruptible state. In this state, the task sleeps on the waiting queue for a specific event. When an event occurs, and a task is woken up and placed back on the queue. This state is visible in the task_struct structure of every process. To manipulate the current process stack, there is an API to set_task_state. The process context is the context in which the kernel executes on behalf of the process. It is triggered by a system call. Current macro is not valid in the interrupt context. Init process is the first process that gets created which then forks other user space processes. The etc tab entries and init tab entries keep track of all the processes and daemons to create.  A process tree helps organize the processes. A copy on write is a technique which makes a copy of the address space when a child edits it. Until that time, all reading child processes can continue to use only one instance. The set of resources such as virtual memory, file system, and signals that can be shared are determined by the clone system call which is invoked as part of the fork system call. If the page tables need to be copied, then a vfork system call is called instead of fork. Kernel threads only run within the kernel and do not have an associated process space. Flush is an example of kernel thread. The ps -ef command lists all the kernel threads. All the tasks that were undertaken at the time of fork are reversed at the time of the process exit. The process descriptor will be removed when all the references to it are removed. A zombie process is not in the running process. A zombie process is one that is not in the running state, but its process descriptor still lingers. A process that exists before the child is a case where the child becomes parentless. The kernel provides the child with new parents.

Friday, March 17, 2023

 

SQL Schema

Shape

Table: Books

+----------------+---------+

| Column Name    | Type    |

+----------------+---------+

| book_id        | int     |

| name           | varchar |

| available_from | date    |

+----------------+---------+

book_id is the primary key of this table.

 

Table: Orders

+----------------+---------+

| Column Name    | Type    |

+----------------+---------+

| order_id       | int     |

| book_id        | int     |

| quantity       | int     |

| dispatch_date  | date    |

+----------------+---------+

order_id is the primary key of this table.

book_id is a foreign key to the Books table.

 

Write an SQL query that reports the books that have sold less than 10 copies in the last year, excluding books that have been available for less than one month from today. Assume today is 2019-06-23.

Return the result table in any order.

The query result format is in the following example.

 

Example 1:

Input:

Books table:

+---------+--------------------+----------------+

| book_id | name               | available_from |

+---------+--------------------+----------------+

| 1       | "Kalila And Demna" | 2010-01-01     |

| 2       | "28 Letters"       | 2012-05-12     |

| 3       | "The Hobbit"       | 2019-06-10     |

| 4       | "13 Reasons Why"   | 2019-06-01     |

| 5       | "The Hunger Games" | 2008-09-21     |

+---------+--------------------+----------------+

Orders table:

+----------+---------+----------+---------------+

| order_id | book_id | quantity | dispatch_date |

+----------+---------+----------+---------------+

| 1        | 1       | 2        | 2018-07-26    |

| 2        | 1       | 1        | 2018-11-05    |

| 3        | 3       | 8        | 2019-06-11    |

| 4        | 4       | 6        | 2019-06-05    |

| 5        | 4       | 5        | 2019-06-20    |

| 6        | 5       | 9        | 2009-02-02    |

| 7        | 5       | 8        | 2010-04-13    |

+----------+---------+----------+---------------+

Output:

+-----------+--------------------+

| book_id   | name               |

+-----------+--------------------+

| 1         | "Kalila And Demna" |

| 2         | "28 Letters"       |

| 5         | "The Hunger Games" |

+-----------+--------------------+

 

 

SELECT DISTINCT b.book_id, b.name

FROM books b

LEFT JOIN Orders o on b.book_id = o.book_id

GROUP BY b.book_id, b.name,

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date),  

DATEDIFF(day,  b.available_from, DATEADD(month, -1, '2019-06-23'))

HAVING SUM(o.quantity) IS NULL OR

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) < 0 OR

(DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) > 0 AND DATEDIFF(day,  b.available_from, DATEADD(month, -1, '2019-06-23')) > 0 AND SUM(o.quantity) < 10);

 


Case 1

Input

Books =

| book_id | name | available_from |
| ------- | ---------------- | -------------- |
| 1 | Kalila And Demna | 2010-01-01 |
| 2 | 28 Letters | 2012-05-12 |
| 3 | The Hobbit | 2019-06-10 |
| 4 | 13 Reasons Why | 2019-06-01 |
| 5 | The Hunger Games | 2008-09-21 |

Orders =

| order_id | book_id | quantity | dispatch_date |
| -------- | ------- | -------- | ------------- |
| 1 | 1 | 2 | 2018-07-26 |
| 2 | 1 | 1 | 2018-11-05 |
| 3 | 3 | 8 | 2019-06-11 |
| 4 | 4 | 6 | 2019-06-05 |
| 5 | 4 | 5 | 2019-06-20 |
| 6 | 5 | 9 | 2009-02-02 |
| 7 | 5 | 8 | 2010-04-13 |

Output

| book_id | name |
| ------- | ---------------- |
| 2 | 28 Letters |
| 1 | Kalila And Demna |
| 5 | The Hunger Games |

Expected

| book_id | name |
| ------- | ---------------- |
| 1 | Kalila And Demna |
| 2 | 28 Letters |
| 5 | The Hunger Games |