Linux Kernel:
Linux Kernel is a small and special code within the core of
the Linux Operating System and directly interacts with the hardware. It
involves process management, process scheduling, system calls, interrupt
handling, bottom halves, kernel synchronization and its techniques, memory
management and process address space.
A process is the program being executed on the processor.
Threads are the objects of activity within the process. Kernel schedules
individual threads. Linux does not differentiate between thread and process. A
multi-threaded program can have multiple processes. A process is created using
the fork call. Fork call will return in the child process and in the parent
process. At the time of the fork, all the resources are copied from the parent
to the child. When the exec call is called, the new address space is loaded for
the process. Linux kernel maintains a
doubly linked list of task structures pertaining to the processes and refers to
them with process descriptors which are used to keep information regarding the
processes. The size of the process structure depends on the architecture of the
machine. For 32-bit machines, it is about 1.7KB. Task structure gets stored in
memory using kernel stack for each process.
A process kernel stack has a low memory address and a high memory
address. The stack grows from the high memory address to the low memory address
and its front can be found with the stack pointer. The thread_struct and the
task_struct are stored on this process address space towards the low memory
address. PID helps to identify the process from one of thousands. The
thread_info_struct is used to conserve memory because storing1.7KB in a 4KB
process space uses up a lot. It has pointers to the task_struct structures. The
pointer is a redirection to the actual data structure and uses up a very tiny
space. The maximum number of processes in linux can be set in the configuration
at pid_max in the nested directories of proc, sys and kernel. A current macro
points to the currently executing task_struct structure. The processes can be
in different process states. The first state to enter when the process is
forked is the ready state. When the scheduler dispatches the task to run, it
enters the running state and when the task exits, it is terminated. A task can
switch between running and ready many times by going through an intermediary
state called the waiting state or the interruptible state. In this state, the
task sleeps on the waiting queue for a specific event. When an event occurs,
and a task is woken up and placed back on the queue. This state is visible in
the task_struct structure of every process. To manipulate the current process
stack, there is an API to set_task_state. The process context is the context in
which the kernel executes on behalf of the process. It is triggered by a system
call. Current macro is not valid in the interrupt context. Init process is the
first process that gets created which then forks other user space processes.
The etc tab entries and init tab entries keep track of all the processes and
daemons to create. A process tree helps
organize the processes. A copy on write is a technique which makes a copy of
the address space when a child edits it. Until that time, all reading child
processes can continue to use only one instance. The set of resources such as
virtual memory, file system, and signals that can be shared are determined by
the clone system call which is invoked as part of the fork system call. If the
page tables need to be copied, then a vfork system call is called instead of
fork. Kernel threads only run within the kernel and do not have an associated
process space. Flush is an example of kernel thread. The ps -ef command lists
all the kernel threads. All the tasks that were undertaken at the time of fork
are reversed at the time of the process exit. The process descriptor will be
removed when all the references to it are removed. A zombie process is not in
the running process. A zombie process is one that is not in the running state,
but its process descriptor still lingers. A process that exists before the
child is a case where the child becomes parentless. The kernel provides the
child with new parents.
The kernel has two major responsibilities:
-
To interact with and control the system’s
hardware components.
-
To provide an environment in which the
application can run.
All the low-level hardware interactions are hidden from the
user mode applications. The operating system evaluates each request and
interacts with the hardware component on behalf of the application.
Contrary to the expectations around subsystems, the Linux
kernel is monolithic. All of the subsystems are tightly integrated to form the
whole kernel. This differs from microkernel architecture where the kernel
provides bare minimal functionality, and the operating system layers are
performed on top of microkernels as processes. Microkernels are generally
slower due to message passing between various layers. But Linux kernels support
modules which allow it to be extended. A module is an object that can be linked
to the kernel at runtime.
System calls are what an application uses to interact with
kernel resources. They are designed to ensure security and stability. An API
provides a wrapper over the system calls so that the two can vary
independently. There is no relation between the two and they are provided as
libraries to applications.
The /proc file system provides the user with a view of the
internal kernel data structures. It is a virtual file system used to fine tune
the kernel’s performance as well as the overall system.
The various aspects of memory management in Linux includes
address space, physical memory, memory mapping, paging and swapping.
One of the advantages of virtual memory is that each process
thinks it has all the address space it needs. The isolation enables processes
to run independently of one another. The virtual memory can be much larger than
physical memory in the system. The application views the address space as a
flat linear address space. It is divided into two parts: the user address space
and the kernel address space. The range between the two depends on the system
architecture. For 32 bit, the user space is 3GB and the kernel space is 1GB.
The location of the split is determined by the PAGE_OFFSET kernel configuration
variable.
The physical memory is architecture-independent and can be
arranged into banks, with each bank being a particular distance from the
processor. Linux VM represents this arrangement as a node. Each node is divided
into blocks called zones that represent ranges within memory. There are three
different zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. Each zone has its own
use with the one named normal for kernel and the one named highmem for user
data.
When memory mapping occurs, the kernel has one GB address
space. The DMA and NORMAL ranges are directly mapped to this address space.
This leaves only 128 MB of virtual address space and used for vmalloc and kmap.
With systems that allow Physical Address Extension, handling physical memories
in tens of gigabytes can be hard for Linux. The kernel handles high memory on a
page-by-page basis. It maps the page
into a small virtual address space (kmap) window, operates on that page and
unmaps the page. The 64 bit architecture do not have this problem because their
address space is huge.
The virtual memory is implemented depending on the hardware.
It is divided into fixed size chunks called pages. Virtual memory references
are translated into addresses in physical memory using page tables. Different
architectures and page sizes are accommodated using three-level paging mechanism
involving Page Global Directory, Page Middle Directory, and Page Table. This
address translation provides a way to separate the virtual address space of a
process from the physical address space. If an address is not in virtual
memory, it generates a page fault, which is handled by the kernel. The kernel handles the fault and brings the
page into main memory even if it involves replacement.
Swapping is the moving of an entire process to and from the
secondary storage when the main memory is low but is generally not preferred
because context switches are expensive. Instead, paging is preferred. Linux
performs swapping at page level rather than at the process level and is used to
expand the process address space and to circulate pages by discarding some of
the less frequently used or unused pages and bringing in new pages. Since it
writes to disk, the disk I/O is slow.
Interprocess communications aka IPC occurs with the help of
signals and pipes. Linux also supports System V IPC mechanisms. Signals notify
events to one or more processes and can be used as a primitive way of
communication and synchronization between user processes. Signals can also be
used for job control. Processes can
choose to ignore most of the signals except for the well-known SIGSTOP and
SIGKILL. The first causes a process to halt its execution. The second causes a
process to exit. Defaults actions are associated with signals that the kernel
completes. Signals are not delivered to the process until it enters running
state from ready state. When a process exits a system call, the signals are then
delivered. Linux is POSIX compatible so the process can specify which signals
are blocked when a particular signal handling routine is called.
A pipe is a unidirectional, ordered and unstructured stream
of data. Writers add data at one end and readers get it from the other end. An
example is the command “ls | less” which paginates the results of the directory
listing.
UNIX System V introduced IPC mechanisms in 1983 which
included message queues, semaphores, and shared memory. The mechanisms all
share common authentication methods and Linux supports all three. Processes
access these resources by passing a unique resource identifier to the kernel
via system calls.
Message queues allow one or more processes to write messages,
which will be read by one or more processes. They are more versatile than pipes
because the unit is a message rather than an unformatted stream of bytes and
messages can be prioritized based on a type association.
Semaphores are objects that support atomic operations such
as set and test. They are counters for controlled access to shared resources by
multiple processes. Semaphores are most often used as locking mechanisms but
must be used carefully to avoid deadlocking such as when a thread holds on to a
lock and never releases it.
Shared memory is a way to communicate when that memory
appears in the virtual address spaces of the participating processes. Each
process that wishes to share the memory must attach to virtual memory via a
system call and similarly must detach from the memory when it no longer needs
the memory.
Linux has a symmetrical multiprocessing model. A multiprocessing
system consists of a number of processors communicating via a bus or a network.
There are two types of multiprocessing systems: loosely coupled or tightly
coupled. Loosely coupled systems consists of processors that operate
standalone. Each processor has its own bus, memory, and I/O subsystem, and
communicates with other processes through the network medium. Tightly coupled
systems consists of processors that share memory, bus, devices and sometimes
cache. These can be symmetric and asymmetric. Asymmetric systems have a single master
processor that controls the others. Symmetric systems are subdivided into
further classes consisting of dedicated and shared cache systems.
Ideally, an SMP System with n processors would perform n
times better than a uniprocessor system but in reality, no SMP is 100%
scalable.
SMP systems use locks where multiple processors execute
multiple threads at the same time. Locking must be limited to the smallest time
possible. Another common technique is to use finer grain locking so that
instead of locking a table, only a few rows are locked at a time. Linux 2.6
removes most of the global locks and locking primitives are optimized for low
overheads.
Multiprocessors demonstrate cache coherency problem because
each processor has an individual cache, and multiple copies of certain data exist
in the system which can get out of sync.
Processor affinity improves system performance because the
data and the resources accessed by the code will stay local to the processor’s
cache due to warmth. Affinity helps to use these rather than fetch repeatedly. Use
of processor affinity is accentuated in Non-uniform Memory Access architectures
where some resources can be closer to a processor than others.
Linux supports several file systems. The Virtual File System
Interface allows Linux to support many file systems via a common interface. It
is designed to allow access to files as fast and efficiently as possible.
Ex2fs was the original file system, and it became widely
popular allowing typical file operations such as to create, update, and delete files,
directories, hard links, soft links, device special files, sockets, and pipes.
It suffered from one limitation that if the system crashed, the entire file
system would be validated and corrected for inconsistencies before it is
remounted. This was improved with journaling where every file system operation
is logged before the operation is executed and the log is replayed to bring the
file system to consistency.
Linux Volume Managers and Redundant Array of Inexpensive Disks
(RAID) provide a logical abstraction of a computer’s physical storage devices
and can combine several disks into a single logical unit to provide increased
total storage space as well as data redundancy. Even on a single disk, they can
divide the space into multiple logical units, each for a different purpose.
Linux provides four different RAID levels. RAID-Linear which
is a simple concatenation of disks that comprise the volume. Raid-0 is a simple
striping where the data that is written is interleaved in equal-sized “chunks”
across all disks in the volume. RAID-1 is mirroring where all data is replicated on all disks in the
volume. A RAID-1 volume created from n disks can survive the failure of n-1 of
those disks. RAID-5 is striping with parity which is similar to RAID-0 but with
one chunk in each stripe containing parity information instead of data. RAID-5
can survive the failure of any single disk in the volume.
A Volume-Group could be used to form a collection of disks
also called Physical-Volumes. The storage space provided by these disks is then
used to create Logical-Volumes. It is also resizable. New volumes are easy to add as extents and the
Logical Volumes can be expanded or shrinked and the data on the LVs can be
moved around within the same Volume-Group.
Beyond the hard disk, keyboard and console that a Linux
system supports by default, a user-level application can create device special
files to access other hardware devices. They are mounted as device nodes in the
/dev directory. Usually, these are of two types: a block device and a character
device. Block devices allow block-level access to the data residing on a device
and the character devices allow character-level access to the devices. The ls
-l command will show a ‘b’ for block device and a ‘c’ for character device in
the permission string. The virtual file system devfs is an alternative to these
special devices. It reduces the system administrative task of creating device
node for each device. A system
administrator can mount the devfs file system many times at different mount
points but changes to a device node is reflected on all the mount points. The
devfs namespace exists in the kernel even before it is mounted which makes the
device node, to become available independently of the root file system.
Linux also supports FUSE which is a user-space file-system
framework. It consists of a kernel module (fuse.ko), a userspace
library(libfuse.*) and a mount utility (fusermount). One of the most important
features of FUSE is allowing secure non-privileged mounts. One example of this
is the sshfs which is a secure network filesystem using the sftp protocol.
No comments:
Post a Comment