Cluster computing

Saturday, April 19, 2014

We continue to review the WinPCap today. We talked about the statistics mode. Next we talk about the packets injection mode. Both BPF and NPF have write capabilities that allow the user sending raw packets to the network. However, libpcap did not support this, thus BPF was never used for this purpose. Unix applications uses raw sockets instead. Win32 has very limited support for raw sockets. This is where WinPCap provides a standard and consistent set of functions for packet injection. The packet data is copied to the kernel buffer and then sent to the network over the NDIS.
There is a well-known libnet packet assembly library that was used to add a layer for packet construction and injection on top of WinPCap.
We now talk about the packet.dll which is one of the modules that comes with WinPCap and provides a common system independent API to enable programs to capture packets on a variety of windows operating system flavors, architectures, editions and versions. Packet.dll includes several additional functionalities such as low level operations on the network adapters and the dynamic loading of the drivers and some hardware counters.
Note that packet.dll interfaces only with the Network packet filter and does not partiticpate in statistics mode directly off the NDIS.
Besides the NPF and the packet module, the third module that is not OS dependent is the WPcap.dll. This has high level functions such as filter generation and user level buffering, plus advanced features such as statistics and packet injection. This set of API is more granular than the low level APIs exposed by the packet module where there was almost a one to one mapping between the APIs and the kernel mode calls. The higher level functions of WPCap module are more user-friendly and a single call can translate to several packet module calls.
Win32 network architecture relies on NDIS which is the Network Driver Interface Specification which handles the interaction between the NIC drivers and protocol drivers. Internally, NDIS creates an artificial world of its own abstracting the protocol drivers from the complexity of the hardware. This implies that the same protocol stack can now work with a variety of technologies.
While BPF requires NIC device drivers, NPF does not have the same luxury as a BPF driver specification. Instead NPF locates the network tap as protocol driver on top of the NDIS structures. This pseudo device works the same way as virtual private networking protocol drivers. Due to its location, NPF has the restriction that it cannot tap the lower level packets that do not reach the NDIS upper layer as for example point to point protocol including link control protocols, network control protocols including auth and encryption etc. As a comparision, the Packet Socket in a Linux kernel also suffers from a similar problem.
Note that NPF system calls are all blocking in nature. Besides, NDIS is performant in that it doesn't copy the entire packet if no protocol drivers require it. Thus NPF provides a clean and isolated plugin for an efficient packet capture on commercial systems.

The other significant difference between a WinPCap NPF and a BSD BPF is that the former has a ring buffer for every user application. that copies data from the kernel mode to the user mode The copy operation to the user mode buffer happens via a single read call thereby reducing the transitions between the user mode and the kernel mode. This kind of buffer allows the storing of network bursts because it makes more memory available than the BPF.
The kernel buffer is also larger than in BPF. If the application is not able to read as fast as the driver captures for a limited time interval, the capturing process is penalized. The size of the user buffer is important because it determines the maximum amount of data that can be copied from the kernel space in a single system call. A smaller buffer is generally suitable for real-time applications since it guarantees that the kernel will copy the memory as soon as the application makes it available. Thus NPF is more configurable in that it allows users to choose between efficiency and responsiveness.
Another configuration parameter is the timeout between read values. By default the timeout is 1 sec and the minimum amount of data is 16K. This is referred to as delayed write.
One of the core issues of any network analysis and packet capturing is that this is a very CPU intensive task and network packets can overwhelm the CPU. This situation is obviously even worse on faster networks. The typical approach to improving speed include filtering engines and I-copy architectures, which avoid copying packets between kernel space and user space by mapping the kernel buffer in the application's memory. The advantages of this shared buffer may be limited if a user still makes one system call for every packet which results in high number of context switches.
WinPCap introduces the notion that the monitoring not only needs no copying but also pushes it down to the kernel avoiding both data transfer and processing at user mode.
Applications need not call the libpcap APIs to get the data. They can also use the statistics mode of the NPF. Statistics mode avoids packet copies and it implements a 0-copy mechanism - statistic is performed when packet is still in NIC driver's memory, then the packet is discarded. Moreover, the number of context switches is kept the lowest because the results are returned to the user by a single system call. The syntax for requesting this statistics is same as in libpcap but doesn't have to go through the libpcap. These are some of the differences.

Friday, April 18, 2014

We will continue to review WinPCap architecture today. We mentioned that the WinPCap has three main components similar to the BSD capturing components. These are:
the libpcap library based programmability component of WinPCap.
a Berkeley Packet Filter with its kernel level buffer that keeps all the data.
a Network Tap that snoops all packets flowing through the network.

We will review these components in detail.
A packet satisfying the filter is copied to the kernel buffer. The kernel buffer is subdivided into two small buffers store and hold that are used to keep the captured packets These are runtime allocations of blocks of memory. The store buffer is used to keep the data coming from the network adapter and the second one is used to copy the packets to the user buffer. If the store buffer is full and the user buffer is empty, BPF swaps them. In this way, user level applications does not interfere with the adapters device driver.

Since BPF has proved to be a powerful and stable architecture, the basic structure of WinPCap has these three components as well. However, WinPCap has significant differences in the structure and in the behavior of the capture stack. It can be seen as the evolution of BPF.
The filtering process starts with the libpcap compatible component that is able to accept a user-defined filter and compiles them into a set of pseudo instructions. The kernel module then has to execute these instructions against all incoming packets.

The WinPCap NPF as opposed to BPF uses a circular ring buffer as kernel buffer. It follows a sliding window protocol. and is a bit more challenging to maintain than that in BPF because data copies can have varying sizes and the bytes copied is updated while the transferring to user mode happens and not after. The kernel has higher priority and can pre-empt the user mode copying. This implementation allows more memory to be used by kernel as compared to only half of the memory used in the BPF swapping buffers. The entire kernel buffer is copied by means of a single read and reducing the number of system calls and therefore the number of context switches between user and kernel mode.

We are reading from the paper on an architecture for high performance network analysis by Risso and Degioanni

Thursday, April 17, 2014

Next we look at the WinPCap architecture. WinPCap adds a similar functionality to what libpcap or tcpdump does to flavors of Unix. There have been other modules in Windows and some with available APIs and each one with a kernel mode driver, however they suffer of severe limitations. However, Netmon API is not freely available and its extensibility is limited. And it did not support sending packets. In this architecture we review some of these functionalities as well.
WinPCap was the first open system for packet capture on Win32 and it fills an important gap between Unix and Windows. Furthermore WinPCap puts performance at the first place.WinPCap consists of a kernel mode component to select packets and a user mode library to deliver them to the applications. The library also provides a low level network access and allows programmers to avoid kernel level programming. WinPCap includes an optimized kernel mode driver called Netgroup packet filter and a set of user-level libraries that are libpcap compatible. From the outset, libpcap compatibility was important to WinPCap so that unix applications could be ported over.
We now look at the BSD capturing components. Getting and sending data over the low level network interfaces was an important objective of BSD. There are three components to BSD. The first block Berkeley Packet Filter is the kernel level component for packet used to store packets coming from the kernel.The Network Tap is designed to snoop all packets flowing through the network and it reads the interface through the interface driver. It is followed by the filter which analyzes incoming packets The Libpcap library is the third component. A packet satisfying the filter for Network Tap is copied to the kernel buffer in the BSD. The user has direct access to each of these three layers. The user accesses the Network Interface Card driver with other protocol stacks to send and receive data. The user code can directly access the BPF. Lastly the applications can write user code calls to libpcap.

We now look at the optimization techniques with the Berkeley Packet filter in TcpDump. Optimizing the code generation was required because users could specify compound queries. In such queries the generated code was redundant and highly inefficient. As a comparison, if we had taken a simple query such as tcp src port 100, then the resulting code would have been a linear decision tree with stteps evaluating down from IP, frag, TCP, sport, 100, true with each progress down only on true.
IF we wanted the same for both directions, we would have two such trees with an OR in between and evaluating one below the other. If the query had been both in same directions such as with tcp port 100 and 200, then we would have both trees one evaluating to the other. and this combination evaluating one below the other. In this case, we would have had a significant code bloat. This is highly redundant and inefficient. This is where the optimization techniques are used which have their roots in compiler design. One such technique is called the dominator technique. What this does, it eliminates common subexpressions. So if the exit from one is the same as entry to another, we can replace that sub-expression with a third. While this traditional technique does optimize the code, it does not address the branching because data could flow to this sub-expression from both . If we look at the edge relationships instead of node relationships, we can do even more optimization. When we assume a particular sub-expression has already been evaluated by the expression above, then we can bypass that sub-expression and directly move to tthe outcome of the sub-expression. This creates new opportunities and when we repeat the cycle for all sub-expressions, at each stage, we could eliminated redundancies. An interesting observation here is that this exercise for optimizing code could also help us detect unreachable code and simplify the graph. Now we take the example above and remove the redundancy of opcode nodes and instead replace the edged to move directly to the outcome of the sub-expressions above. In this case we had the linear decision tree of IP, frag, TCP, common and we remove three sets of those copies and adding edges directly from them to the outcome. We also add edges from src port 100 to dest port 100 and dest port 200 as well as an edge from dest port 100 to src port 200 and completing the branches from the remaining nodes to the outcomes. We only have two outcomes true or false in the leaf level and all the nodes will be connected to it via edges. This covers the optimization in the Berkeley packet filter.

Wednesday, April 16, 2014

We now look at the compiler/optimizer in TcpDump. It initially introduced two layers of logic
- a lower layer that would handle predicates with multiple values
- an upper layer that would handle the combinations of the lower layer expressions.
The lower layer sees it as key value pairs or an atomic predicate. For example, it sees it as ip host x or y. It could also see the predicate tcp port 80 or 1024.
The upper layer sees it as ip host x or y and (tcp port 80 or 1024)
But this was not working. It tried to introduce paranthesis for grouping but this was still harder on the user.
The solution instead was to have a single level of logic. i.e. the predicate or values can both be part of the expression. The expression could be either predicate or val or both.
This made the grammar easy but it made code generation tricky.
BPF parser maintained a stack of symbol, field and code.
The expression was a predicate or an expression operator a predicate or a unary expression.
The predicate was the field value.
The code generation now takes the field value and updates its stack as it goes deeper through the expression evaluation. At each step it generates the corresponding code.
To evaluate an expression " ip src host x or y and tcp dst port z "
it would push ip one level down into the stack, followed by src followed by host.
when it comes to the value x, it would push protocol selector as field followed by a wrapper for the value. These two would be popped and pushed with a predicate, field and code value
Since we have an 'or' that would get pushed on top of this existing expression followed by a wrapper for the value y.
These three levels would then be replaced by a predicate, field and value corresponding to the protocol selector with a different value.
Having parsed the expression to ip address and its values, we now push and onto the stack and parse the 'tcp dst port z' similarly.
Finally we have pushed the following items on the stack, 1) expr as sym, ISH as fld and C2 as code followed by 2) AND as sym followed by 3) field as sym and TDP as fld and lastly 4) val(z) as sym

Tuesday, April 15, 2014

Today we look at how libpcap/tcpdump works. Libpcap is a system independent use mode packet capturing system. Libpcap is open source and can be used with other application. Note that this library captures traffic unicast to an interface as is the case with TCP. All such traffic to and from the computer can be monitored. That is why if the interface is plugged into a switch, it may not capture traffic. Even if the machine is connected to a hub the hub could be switched network and it may not work. When switches replicate all the traffic on all ports to a single port then the analyzer can capture packets on that port to sniff all traffic.
What distinguishes tcpdump is that it "filters" packet before they come up the stack.
Tcpdump compiles high-level filter specification into low-level code that filters packets at driver level. The kernel module used is called the Berkeley Packet Filter. The berkeley packet filter sits right between the NIC (network interface card) and the tcp stack in the kernel. This packet filter copies packets to tcpdump. The filter also blocks traffic that would otherwise appear as noise to tcpdump. This BPF can be considered to be a virtual machine. It has an architecture with an accumulator (A) and index register (X), a packet based memory model and an arithmetic and conditional logic. For a packet capture of a TCP flow, the filter works something like this :
Is the ethernet packet type IP ? (Load ether into A)
Is IP src address 10.0.0.20 ? (Load IP src address into A)
Is the IP dest address 10.0.0.20 ? (Load IP dest address into A)
Is the IP Protocol TCP ? (Load the protocol into A)
Is it first or only frag ? (Load the frag num into A)
Is TCP src port FTP ? (Load the port 20 into index register X)
(Load the port from packet to A)
Is TCP dest port FTP ? (Load the port 20 into index register X)
(Load the dest port into A)
This virtual model is flexible but we don't want to write low-level filters. So a higher level filter language is available.
We specify rules such as src ip src port dest ip dest port and let the compiler and optimizer translate to the code.
The BPF filter language starts from a basic predicate which is true if and only if the specified packet field equals the indicated value.
Courtesy : libpcap : an architecture by Steve McCanne