Sunday, October 15, 2023

 

Linux Kernel:

Linux Kernel is a small and special code within the core of the Linux Operating System and directly interacts with the hardware. It involves process management, process scheduling, system calls, interrupt handling, bottom halves, kernel synchronization and its techniques, memory management and process address space.

A process is the program being executed on the processor. Threads are the objects of activity within the process. Kernel schedules individual threads. Linux does not differentiate between thread and process. A multi-threaded program can have multiple processes. A process is created using the fork call. Fork call will return in the child process and in the parent process. At the time of the fork, all the resources are copied from the parent to the child. When the exec call is called, the new address space is loaded for the process.  Linux kernel maintains a doubly linked list of task structures pertaining to the processes and refers to them with process descriptors which are used to keep information regarding the processes. The size of the process structure depends on the architecture of the machine. For 32-bit machines, it is about 1.7KB. Task structure gets stored in memory using kernel stack for each process.  A process kernel stack has a low memory address and a high memory address. The stack grows from the high memory address to the low memory address and its front can be found with the stack pointer. The thread_struct and the task_struct are stored on this process address space towards the low memory address. PID helps to identify the process from one of thousands. The thread_info_struct is used to conserve memory because storing1.7KB in a 4KB process space uses up a lot. It has pointers to the task_struct structures. The pointer is a redirection to the actual data structure and uses up a very tiny space. The maximum number of processes in linux can be set in the configuration at pid_max in the nested directories of proc, sys and kernel. A current macro points to the currently executing task_struct structure. The processes can be in different process states. The first state to enter when the process is forked is the ready state. When the scheduler dispatches the task to run, it enters the running state and when the task exits, it is terminated. A task can switch between running and ready many times by going through an intermediary state called the waiting state or the interruptible state. In this state, the task sleeps on the waiting queue for a specific event. When an event occurs, and a task is woken up and placed back on the queue. This state is visible in the task_struct structure of every process. To manipulate the current process stack, there is an API to set_task_state. The process context is the context in which the kernel executes on behalf of the process. It is triggered by a system call. Current macro is not valid in the interrupt context. Init process is the first process that gets created which then forks other user space processes. The etc tab entries and init tab entries keep track of all the processes and daemons to create.  A process tree helps organize the processes. A copy on write is a technique which makes a copy of the address space when a child edits it. Until that time, all reading child processes can continue to use only one instance. The set of resources such as virtual memory, file system, and signals that can be shared are determined by the clone system call which is invoked as part of the fork system call. If the page tables need to be copied, then a vfork system call is called instead of fork. Kernel threads only run within the kernel and do not have an associated process space. Flush is an example of kernel thread. The ps -ef command lists all the kernel threads. All the tasks that were undertaken at the time of fork are reversed at the time of the process exit. The process descriptor will be removed when all the references to it are removed. A zombie process is not in the running process. A zombie process is one that is not in the running state, but its process descriptor still lingers. A process that exists before the child is a case where the child becomes parentless. The kernel provides the child with new parents.

The kernel has two major responsibilities:

-          To interact with and control the system’s hardware components.

-          To provide an environment in which the application can run.

All the low-level hardware interactions are hidden from the user mode applications. The operating system evaluates each request and interacts with the hardware component on behalf of the application.

Contrary to the expectations around subsystems, the Linux kernel is monolithic. All of the subsystems are tightly integrated to form the whole kernel. This differs from microkernel architecture where the kernel provides bare minimal functionality, and the operating system layers are performed on top of microkernels as processes. Microkernels are generally slower due to message passing between various layers. But Linux kernels support modules which allow it to be extended. A module is an object that can be linked to the kernel at runtime.

System calls are what an application uses to interact with kernel resources. They are designed to ensure security and stability. An API provides a wrapper over the system calls so that the two can vary independently. There is no relation between the two and they are provided as libraries to applications.

The /proc file system provides the user with a view of the internal kernel data structures. It is a virtual file system used to fine tune the kernel’s performance as well as the overall system.

The various aspects of memory management in Linux includes address space, physical memory, memory mapping, paging and swapping.

One of the advantages of virtual memory is that each process thinks it has all the address space it needs. The isolation enables processes to run independently of one another. The virtual memory can be much larger than physical memory in the system. The application views the address space as a flat linear address space. It is divided into two parts: the user address space and the kernel address space. The range between the two depends on the system architecture. For 32 bit, the user space is 3GB and the kernel space is 1GB. The location of the split is determined by the PAGE_OFFSET kernel configuration variable.

The physical memory is architecture-independent and can be arranged into banks, with each bank being a particular distance from the processor. Linux VM represents this arrangement as a node. Each node is divided into blocks called zones that represent ranges within memory. There are three different zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. Each zone has its own use with the one named normal for kernel and the one named highmem for user data.

When memory mapping occurs, the kernel has one GB address space. The DMA and NORMAL ranges are directly mapped to this address space. This leaves only 128 MB of virtual address space and used for vmalloc and kmap. With systems that allow Physical Address Extension, handling physical memories in tens of gigabytes can be hard for Linux. The kernel handles high memory on a page-by-page basis.  It maps the page into a small virtual address space (kmap) window, operates on that page and unmaps the page. The 64 bit architecture do not have this problem because their address space is huge.

The virtual memory is implemented depending on the hardware. It is divided into fixed size chunks called pages. Virtual memory references are translated into addresses in physical memory using page tables. Different architectures and page sizes are accommodated using three-level paging mechanism involving Page Global Directory, Page Middle Directory, and Page Table. This address translation provides a way to separate the virtual address space of a process from the physical address space. If an address is not in virtual memory, it generates a page fault, which is handled by the kernel.  The kernel handles the fault and brings the page into main memory even if it involves replacement.

Swapping is the moving of an entire process to and from the secondary storage when the main memory is low but is generally not preferred because context switches are expensive. Instead, paging is preferred. Linux performs swapping at page level rather than at the process level and is used to expand the process address space and to circulate pages by discarding some of the less frequently used or unused pages and bringing in new pages. Since it writes to disk, the disk I/O is slow.

Interprocess communications aka IPC occurs with the help of signals and pipes. Linux also supports System V IPC mechanisms. Signals notify events to one or more processes and can be used as a primitive way of communication and synchronization between user processes. Signals can also be used for job control.  Processes can choose to ignore most of the signals except for the well-known SIGSTOP and SIGKILL. The first causes a process to halt its execution. The second causes a process to exit. Defaults actions are associated with signals that the kernel completes. Signals are not delivered to the process until it enters running state from ready state. When a process exits a system call, the signals are then delivered. Linux is POSIX compatible so the process can specify which signals are blocked when a particular signal handling routine is called.

A pipe is a unidirectional, ordered and unstructured stream of data. Writers add data at one end and readers get it from the other end. An example is the command “ls | less” which paginates the results of the directory listing.

UNIX System V introduced IPC mechanisms in 1983 which included message queues, semaphores, and shared memory. The mechanisms all share common authentication methods and Linux supports all three. Processes access these resources by passing a unique resource identifier to the kernel via system calls.

Message queues allow one or more processes to write messages, which will be read by one or more processes. They are more versatile than pipes because the unit is a message rather than an unformatted stream of bytes and messages can be prioritized based on a type association.

Semaphores are objects that support atomic operations such as set and test. They are counters for controlled access to shared resources by multiple processes. Semaphores are most often used as locking mechanisms but must be used carefully to avoid deadlocking such as when a thread holds on to a lock and never releases it.

Shared memory is a way to communicate when that memory appears in the virtual address spaces of the participating processes. Each process that wishes to share the memory must attach to virtual memory via a system call and similarly must detach from the memory when it no longer needs the memory.

Linux has a symmetrical multiprocessing model. A multiprocessing system consists of a number of processors communicating via a bus or a network. There are two types of multiprocessing systems: loosely coupled or tightly coupled. Loosely coupled systems consists of processors that operate standalone. Each processor has its own bus, memory, and I/O subsystem, and communicates with other processes through the network medium. Tightly coupled systems consists of processors that share memory, bus, devices and sometimes cache. These can be symmetric and asymmetric. Asymmetric systems have a single master processor that controls the others. Symmetric systems are subdivided into further classes consisting of dedicated and shared cache systems.

Ideally, an SMP System with n processors would perform n times better than a uniprocessor system but in reality, no SMP is 100% scalable.

SMP systems use locks where multiple processors execute multiple threads at the same time. Locking must be limited to the smallest time possible. Another common technique is to use finer grain locking so that instead of locking a table, only a few rows are locked at a time. Linux 2.6 removes most of the global locks and locking primitives are optimized for low overheads.

Multiprocessors demonstrate cache coherency problem because each processor has an individual cache, and multiple copies of certain data exist in the system which can get out of sync.

Processor affinity improves system performance because the data and the resources accessed by the code will stay local to the processor’s cache due to warmth. Affinity helps to use these rather than fetch repeatedly. Use of processor affinity is accentuated in Non-uniform Memory Access architectures where some resources can be closer to a processor than others.

Linux supports several file systems. The Virtual File System Interface allows Linux to support many file systems via a common interface. It is designed to allow access to files as fast and efficiently as possible.

Ex2fs was the original file system, and it became widely popular allowing typical file operations such as to create, update, and delete files, directories, hard links, soft links, device special files, sockets, and pipes. It suffered from one limitation that if the system crashed, the entire file system would be validated and corrected for inconsistencies before it is remounted. This was improved with journaling where every file system operation is logged before the operation is executed and the log is replayed to bring the file system to consistency.

Linux Volume Managers and Redundant Array of Inexpensive Disks (RAID) provide a logical abstraction of a computer’s physical storage devices and can combine several disks into a single logical unit to provide increased total storage space as well as data redundancy. Even on a single disk, they can divide the space into multiple logical units, each for a different purpose.

Linux provides four different RAID levels. RAID-Linear which is a simple concatenation of disks that comprise the volume. Raid-0 is a simple striping where the data that is written is interleaved in equal-sized “chunks” across all disks in the volume. RAID-1 is mirroring where  all data is replicated on all disks in the volume. A RAID-1 volume created from n disks can survive the failure of n-1 of those disks. RAID-5 is striping with parity which is similar to RAID-0 but with one chunk in each stripe containing parity information instead of data. RAID-5 can survive the failure of any single disk in the volume.

A Volume-Group could be used to form a collection of disks also called Physical-Volumes. The storage space provided by these disks is then used to create Logical-Volumes. It is also resizable.  New volumes are easy to add as extents and the Logical Volumes can be expanded or shrinked and the data on the LVs can be moved around within the same Volume-Group.

Beyond the hard disk, keyboard and console that a Linux system supports by default, a user-level application can create device special files to access other hardware devices. They are mounted as device nodes in the /dev directory. Usually, these are of two types: a block device and a character device. Block devices allow block-level access to the data residing on a device and the character devices allow character-level access to the devices. The ls -l command will show a ‘b’ for block device and a ‘c’ for character device in the permission string. The virtual file system devfs is an alternative to these special devices. It reduces the system administrative task of creating device node for each device.  A system administrator can mount the devfs file system many times at different mount points but changes to a device node is reflected on all the mount points. The devfs namespace exists in the kernel even before it is mounted which makes the device node, to become available independently of the root file system.

Linux also supports FUSE which is a user-space file-system framework. It consists of a kernel module (fuse.ko), a userspace library(libfuse.*) and a mount utility (fusermount). One of the most important features of FUSE is allowing secure non-privileged mounts. One example of this is the sshfs which is a secure network filesystem using the sftp protocol.

No comments:

Post a Comment