Cluster computing

Monday, March 20, 2023

Linux Kernel continued...

Interprocess communications aka IPC occurs with the help of signals and pipes. Linux also supports System V IPC mechanisms. Signals notify events to one or more processes and can be used as a primitive way of communication and synchronization between user processes. Signals can also be used for job control. Processes can choose to ignore most of the signals except for the well-known SIGSTOP and SIGKILL. The first causes a process to halt its execution. The second causes a process to exit. Defaults actions are associated with signals that the kernel completes. Signals are not delivered to the process until it enters running state from ready state. When a process exits a system call, the signals are then delivered. Linux is POSIX compatible so the process can specify which signals are blocked when a particular signal handling routine is called.

A pipe is a unidirectional, ordered and unstructured stream of data. Writers add data at one end and readers get it from the other end. An example is the command “ls | less” which paginates the results of the directory listing.

UNIX System V introduced IPC mechanisms in 1983 which included message queues, semaphores, and shared memory. The mechanisms all share common authentication methods and Linux supports all three. Processes access these resources by passing a unique resource identifier to the kernel via system calls.

Message queues allow one or more processes to write messages, which will be read by one or more processes. They are more versatile than pipes because the unit is a message rather than an unformatted stream of bytes and messages can be prioritized based on a type association.

Semaphores are objects that support atomic operations such as set and test. They are counters for controlled access to shared resources by multiple processes. Semaphores are most often used as locking mechanisms but must be used carefully to avoid deadlocking such as when a thread holds on to a lock and never releases it.

Shared memory is a way to communicate when that memory appears in the virtual address spaces of the participating processes. Each process that wishes to share the memory must attach to virtual memory via a system call and similarly must detach from the memory when it no longer needs the memory.

Linux has a symmetrical multiprocessing model. A multiprocessing system consists of a number of processors communicating via a bus or a network. There are two types of multiprocessing systems: loosely coupled or tightly coupled. Loosely coupled systems consists of processors that operate standalone. Each processor has its own bus, memory, and I/O subsystem, and communicates with other processes through the network medium. Tightly coupled systems consists of processors that share memory, bus, devices and sometimes cache. These can be symmetric and asymmetric. Asymmetric systems have a single master processor that controls the others. Symmetric systems are subdivided into further classes consisting of dedicated and shared cache systems.

Ideally, an SMP System with n processors would perform n times better than a uniprocessor system but in reality, no SMP is 100% scalable.

SMP systems use locks where multiple processors execute multiple threads at the same time. Locking must be limited to the smallest time possible. Another common technique is to use finer grain locking so that instead of locking a table, only a few rows are locked at a time. Linux 2.6 removes most of the global locks and locking primitives are optimized for low overheads.

Multiprocessors demonstrate cache coherency problem because each processor has an individual cache, and multiple copies of certain data exist in the system which can get out of sync.

Processor affinity improves system performance because the data and the resources accessed by the code will stay local to the processor’s cache due to warmth. Affinity helps to use these rather than fetch repeatedly. Use of processor affinity is accentuated in Non-uniform Memory Access architectures where some resources can be closer to a processor than others.

Sunday, March 19, 2023

Linux Kernel

The kernel has two major responsibilities:

- To interact with and control the system’s hardware components.

- To provide an environment in which the application can run.

All the low-level hardware interactions are hidden from the user mode applications. The operating system evaluates each request and interacts with the hardware component on behalf of the application.

Contrary to the expectations around subsystems, the Linux kernel is monolithic. All of the subsystems are tightly integrated to form the whole kernel. This differs from microkernel architecture where the kernel provides bare minimal functionality, and the operating system layers are performed on top of microkernels as processes. Microkernels are generally slower due to message passing between various layers. But Linux kernels support modules which allow it to be extended. A module is an object that can be linked to the kernel at runtime.

System calls are what an application uses to interact with kernel resources. They are designed to ensure security and stability. An API provides a wrapper over the system calls so that the two can vary independently. There is no relation between the two and they are provided as libraries to applications.

The /proc file system provides the user with a view of the internal kernel data structures. It is a virtual file system used to fine tune the kernel’s performance as well as the overall system.

The various aspects of memory management in Linux includes address space, physical memory, memory mapping, paging and swapping.

One of the advantages of virtual memory is that each process thinks it has all the address space it needs. The isolation enables processes to run independently of one another. The virtual memory can be much larger than physical memory in the system. The application views the address space as a flat linear address space. It is divided into two parts: the user address space and the kernel address space. The range between the two depends on the system architecture. For 32 bit, the user space is 3GB and the kernel space is 1GB. The location of the split is determined by the PAGE_OFFSET kernel configuration variable.

The physical memory is architecture-independent and can be arranged into banks, with each bank being a particular distance from the processor. Linux VM represents this arrangement as a node. Each node is divided into blocks called zones that represent ranges within memory. There are three different zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. Each zone has its own use with the one named normal for kernel and the one named highmem for user data.

When memory mapping occurs, the kernel has one GB address space. The DMA and NORMAL ranges are directly mapped to this address space. This leaves only 128 MB of virtual address space and used for vmalloc and kmap. With systems that allow Physical Address Extension, handling physical memories in tens of gigabytes can be hard for Linux. The kernel handles high memory on a page-by-page basis. It maps the page into a small virtual address space (kmap) window, operates on that page and unmaps the page. The 64 bit architecture do not have this problem because their address space is huge.

The virtual memory is implemented depending on the hardware. It is divided into fixed size chunks called pages. Virtual memory references are translated into addresses in physical memory using page tables. Different architectures and page sizes are accommodated using three-level paging mechanism involving Page Global Directory, Page Middle Directory, and Page Table. This address translation provides a way to separate the virtual address space of a process from the physical address space. If an address is not in virtual memory, it generates a page fault, which is handled by the kernel. The kernel handles the fault and brings the page into main memory even if it involves replacement.

Swapping is the moving of an entire process to and from the secondary storage when the main memory is low but is generally not preferred because context switches are expensive. Instead, paging is preferred. Linux performs swapping at page level rather than at the process level and is used to expand the process address space and to circulate pages by discarding some of the less frequently used or unused pages and bringing in new pages. Since it writes to disk, the disk I/O is slow.

Saturday, March 18, 2023

Linux Kernel:

Linux Kernel is a small and special code within the core of the Linux Operating System and directly interacts with the hardware. It involves process management, process scheduling, system calls, interrupt handling, bottom halves, kernel synchronization and its techniques, memory management and process address space.

A process is the program being executed on the processor. Threads are the objects of activity within the process. Kernel schedules individual threads. Linux does not differentiate between thread and process. A multi-threaded program can have multiple processes. A process is created using the fork call. Fork call will return in the child process and in the parent process. At the time of the fork, all the resources are copied from the parent to the child. When the exec call is called, the new address space is loaded for the process. Linux kernel maintains a doubly linked list of task structures pertaining to the processes and refers to them with process descriptors which are used to keep information regarding the processes. The size of the process structure depends on the architecture of the machine. For 32-bit machines, it is about 1.7KB. Task structure gets stored in memory using kernel stack for each process. A process kernel stack has a low memory address and a high memory address. The stack grows from the high memory address to the low memory address and its front can be found with the stack pointer. The thread_struct and the task_struct are stored on this process address space towards the low memory address. PID helps to identify the process from one of thousands. The thread_info_struct is used to conserve memory because storing1.7KB in a 4KB process space uses up a lot. It has pointers to the task_struct structures. The pointer is a redirection to the actual data structure and uses up a very tiny space. The maximum number of processes in linux can be set in the configuration at pid_max in the nested directories of proc, sys and kernel. A current macro points to the currently executing task_struct structure. The processes can be in different process states. The first state to enter when the process is forked is the ready state. When the scheduler dispatches the task to run, it enters the running state and when the task exits, it is terminated. A task can switch between running and ready many times by going through an intermediary state called the waiting state or the interruptible state. In this state, the task sleeps on the waiting queue for a specific event. When an event occurs, and a task is woken up and placed back on the queue. This state is visible in the task_struct structure of every process. To manipulate the current process stack, there is an API to set_task_state. The process context is the context in which the kernel executes on behalf of the process. It is triggered by a system call. Current macro is not valid in the interrupt context. Init process is the first process that gets created which then forks other user space processes. The etc tab entries and init tab entries keep track of all the processes and daemons to create. A process tree helps organize the processes. A copy on write is a technique which makes a copy of the address space when a child edits it. Until that time, all reading child processes can continue to use only one instance. The set of resources such as virtual memory, file system, and signals that can be shared are determined by the clone system call which is invoked as part of the fork system call. If the page tables need to be copied, then a vfork system call is called instead of fork. Kernel threads only run within the kernel and do not have an associated process space. Flush is an example of kernel thread. The ps -ef command lists all the kernel threads. All the tasks that were undertaken at the time of fork are reversed at the time of the process exit. The process descriptor will be removed when all the references to it are removed. A zombie process is not in the running process. A zombie process is one that is not in the running state, but its process descriptor still lingers. A process that exists before the child is a case where the child becomes parentless. The kernel provides the child with new parents.

Friday, March 17, 2023

SQL Schema

Table: Books

+----------------+---------+

| Column Name | Type |

+----------------+---------+

| book_id | int |

| name | varchar |

| available_from | date |

+----------------+---------+

book_id is the primary key of this table.

Table: Orders

+----------------+---------+

| Column Name | Type |

+----------------+---------+

| order_id | int |

| book_id | int |

| quantity | int |

| dispatch_date | date |

+----------------+---------+

order_id is the primary key of this table.

book_id is a foreign key to the Books table.

Write an SQL query that reports the books that have sold less than 10 copies in the last year, excluding books that have been available for less than one month from today. Assume today is 2019-06-23.

Return the result table in any order.

The query result format is in the following example.

Example 1:

Input:

Books table:

+---------+--------------------+----------------+

| book_id | name | available_from |

+---------+--------------------+----------------+

| 1 | "Kalila And Demna" | 2010-01-01 |

| 2 | "28 Letters" | 2012-05-12 |

| 3 | "The Hobbit" | 2019-06-10 |

| 4 | "13 Reasons Why" | 2019-06-01 |

| 5 | "The Hunger Games" | 2008-09-21 |

+---------+--------------------+----------------+

Orders table:

+----------+---------+----------+---------------+

+----------+---------+----------+---------------+

| 1 | 1 | 2 | 2018-07-26 |

| 2 | 1 | 1 | 2018-11-05 |

| 3 | 3 | 8 | 2019-06-11 |

| 4 | 4 | 6 | 2019-06-05 |

| 5 | 4 | 5 | 2019-06-20 |

| 6 | 5 | 9 | 2009-02-02 |

| 7 | 5 | 8 | 2010-04-13 |

+----------+---------+----------+---------------+

Output:

+-----------+--------------------+

| book_id | name |

+-----------+--------------------+

| 1 | "Kalila And Demna" |

| 2 | "28 Letters" |

| 5 | "The Hunger Games" |

+-----------+--------------------+

SELECT DISTINCT b.book_id, b.name

FROM books b

LEFT JOIN Orders o on b.book_id = o.book_id

GROUP BY b.book_id, b.name,

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date),

DATEDIFF(day, b.available_from, DATEADD(month, -1, '2019-06-23'))

HAVING SUM(o.quantity) IS NULL OR

DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) < 0 OR

(DATEDIFF(day, DATEADD(year, -1, '2019-06-23'), o.dispatch_date) > 0 AND DATEDIFF(day, b.available_from, DATEADD(month, -1, '2019-06-23')) > 0 AND SUM(o.quantity) < 10);

Case 1

Input

Books =

| book_id | name | available_from |
| ------- | ---------------- | -------------- |
| 1 | Kalila And Demna | 2010-01-01 |
| 2 | 28 Letters | 2012-05-12 |
| 3 | The Hobbit | 2019-06-10 |
| 4 | 13 Reasons Why | 2019-06-01 |
| 5 | The Hunger Games | 2008-09-21 |

Orders =

| order_id | book_id | quantity | dispatch_date |
| -------- | ------- | -------- | ------------- |
| 1 | 1 | 2 | 2018-07-26 |
| 2 | 1 | 1 | 2018-11-05 |
| 3 | 3 | 8 | 2019-06-11 |
| 4 | 4 | 6 | 2019-06-05 |
| 5 | 4 | 5 | 2019-06-20 |
| 6 | 5 | 9 | 2009-02-02 |
| 7 | 5 | 8 | 2010-04-13 |

Output

| book_id | name |
| ------- | ---------------- |
| 2 | 28 Letters |
| 1 | Kalila And Demna |
| 5 | The Hunger Games |

Expected

| book_id | name |
| ------- | ---------------- |
| 1 | Kalila And Demna |
| 2 | 28 Letters |
| 5 | The Hunger Games |

Thursday, March 16, 2023

Improvements to Azure from application modernization purposes:

As fears for a global slowdown are gripping the tech industry, organizations planning their digital transformation must do more with less. Two improvements are suggested to the Azure public cloud in this essay. First, the development of a tool that can extract the interfaces from legacy applications source code and stage them for a microservice transformation. Second, the rollout of a pre-assembled and pre-configured set of Azure resources that makes it easy to deploy various applications.

Azure already has significant innovations as a cost-effective differentiation from its nearest competitor and these two improvements will help those organizations with a charter for cloud adoption to make the leap.

Azure has claims to provide savings of up to 54% over running applications on-premises and 35% over running them on AWS as per their media reports. Streamlined operations, simplified administration and proximity are the other additional benefits. Built-in tools from Visual Studio and MSSQL provide convenience to migrations for applications and databases respectively. The differentiating features for Azure over its competitor for the purposes of such savings include the Hybrid benefit and the TCO calculator. The Hybrid Benefit is a licensing offer that helps migration to Azure by applying existing licenses to Windows Azure, SQL Server and Linux subscriptions. Additionally, services like Azure Arc help to use Azure Kubernetes Service and Azure Stack for Hyperconverged clustering solution to run virtualized workloads on-premises which makes it easy to consolidate aging infrastructure and connect to Azure for cloud services. The TCO calculator helps to understand the cost areas that affect the current applications today such as server hardware, software licenses, electricity, and labor. It recommends a set of equivalent services in Azure that will support the applications and helps to create a customized business case to justify migration to Azure. All it takes is a set of three steps: enter a few details about the current infrastructure, review the assumptions and receive a summary with supporting analysis.

The features asked here include analysis for legacy applications that can explain how to convert the applications to a microservice architecture. An extractor that generates the KDM model from legacy application code can be automated by virtue of understanding the interfaces and whether they are candidates for segregation into microservices. Dedicated parsers can help with this code-to-model transformation. The restructuring phase aims at deriving an enriched conceptual technology independent specification of the legacy system in a knowledge model KDM from the information stored inside the models generated on the previous phase. KDM is an OMG standard and can involve up to four layers: Infrastructure layer, Program Elements layer, resource layer and abstractions layer. Each layer is dedicated to a particular application viewpoint. The forward engineering is a process of moving from high-level abstractions by means of transformational techniques to automatically obtain representation on a new platform such as microservices or as constructs in a programming language such as interfaces and classes. Even the user interface can go through forward engineering into a Rich single page application with a new representation describing the organization and positioning of components. Segregation of interfaces into microservices is easier with well-known patterns such as model-view-controller. Data access, profiling and instrumentation or bookkeeping at interface level can also add useful information to the organization of interfaces for the purpose of extracting microservices. Software measurement metamodels have played a significant role in forward engineering.

The second ask is about the deployment of hosts and resources native to the cloud. Azure can provide dedicated blueprints that include policies and templates for the purposes of hosting a specific type of modernized application. Microservices, for instance, requires load balancers, cache, containers, ingress and data connections and a modernized application can be more easily deployed when the developers do not have to author the infrastructure and can rely on pre-assembled resources specific to their needs. Azure Service Fabric became a veritable resource to suit such purpose but the ask here is to create a blueprint that also establishes the size of the stamp.

These asks are like full-service for application migration and modernization teams.

Wednesday, March 15, 2023

Problem 1: Triangle Judgement

SQL Schema

Table: Triangle

+-------------+------+
| Column Name | Type |
+-------------+------+
| x           | int |
| y           | int |
| z           | int |
+-------------+------+
(x, y, z) is the primary key column for this table.
Each row of this table contains the lengths of three line segments.

Write an SQL query to report for every three-line segments whether they can form a triangle.

Return the result table in any order.

The query result format is in the following example.

Example 1:

Input:
Triangle table:
+----+----+----+
| x | y | z |
+----+----+----+
| 13 | 15 | 30 |
| 10 | 20 | 15 |
+----+----+----+
Output:
+----+----+----+----------+
| x | y | z | triangle |
+----+----+----+----------+
| 13 | 15 | 30 | No |
| 10 | 20 | 15 | Yes |
+----+----+----+----------+

SELECT x, y, z,

CASE

WHEN x + y > z AND y + z > x AND x + z > y THEN 'Yes'

ELSE 'No'

END AS `triangle`

FROM Triangle;

Case 1

Input

Triangle =

| x | y | z | | -- | -- | -- | | 13 | 15 | 30 | | 10 | 20 | 15 |

Output

| x | y | z | triangle | | -- | -- | -- | -------- | | 13 | 15 | 30 | No | | 10 | 20 | 15 | Yes |

Tuesday, March 14, 2023

Shrinking budgets pose tremendous challenge to organizations with their digital transformation initiatives and cloud adoption roadmap. Technology decision makers must decide what to do with legacy applications that have proliferated prior to the pandemic. There are three main choices available: maintain the status quo and do nothing, migrate and modernize the applications to a modern cloud-based environment or rewrite and replace them. The last one might be tempting given the various capabilities introduced by both AWS and Azure and refreshed knowledge base about the application to be transformed but lift-and-shift costs have been brought down by both the clouds.

As a specific example, significant cost savings can be achieved with just migrating legacy ASP.Net applications from on-premises to the cloud. Traditional .NET applications are well poised for migration by virtue of the .NET runtime on which they run. Azure has claims to provide savings of up to 54% over running applications on-premises and 35% over running them on AWS as per their media reports. Streamlined operations, simplified administration and proximity are the other additional benefits. Built-in tools from Visual Studio and MSSQL provide convenience to migrations for applications and databases respectively.

One of the key differences between the migrations to either public cloud is the offering for Hybrid Benefit from Azure. The Hybrid Benefit is a licensing offer that helps migration to Azure by applying existing licenses to Windows Azure, SQL Server and Linux subscriptions that can realize substantial cost savings. Additionally, services like Azure Arc help to use Azure Kubernetes Service and Azure Stack for Hyperconverged clustering solution to run virtualized workloads on-premises which makes it easy to consolidate aging infrastructure and connect to Azure for cloud services.

Another difference between the migrations to either public cloud is the offering of a calculator to calculate Total Cost of Ownership by Azure. The TCO calculator helps to understand the cost areas that affect the current applications today such as server hardware, software licenses, electricity and labor. It recommends a set of equivalent services in Azure that will support the applications. The analysis shows each cost area with an estimate of the on-premises spending versus the spending in Azure. There are several cost categories that either decrease or go away completely when moving workloads to the cloud. Finally, it helps to create a customized business case to justify migration to Azure. All it takes is a set of three steps: enter a few details about the current infrastructure, review the assumptions and receive a summary with supporting analysis.

The only limitation that an organization faces is one that is self-imposed. Organizations and big company departments might be averse to their employees increasing their cloud budget to anything beyond a thousand dollars a month. This is not the only gap. Business owners cite those existing channels of supply and demand are becoming savvy in their competition with the cloud while the architects do not truly enforce the right practice to keep the overall budget of cloud computing expenses to be under a limit. Employees and resource users are being secured by role-based access control but the privilege to manage subscriptions is granted to those users which allows them to disproportionately escalate costs.

When this is overcome, the benefits outweigh the costs and apprehension.