[技术问答] Linux kernel design patterns - part 2(zz) [复制链接]

DriverEntry
UID: 143979

元老版主

Rank: 11

声望: 3110
寄托币: 48275
注册时间: 2003-9-1
精华: 44
帖子: 1491

发表于 2009-6-29 00:04:18 |显示全部楼层

本帖最后由 DriverEntry 于 2009-6-29 00:07 编辑

Linux kernel design patterns - part 2June 12, 2009
This article was contributed by Neil Brown
original post: http://lwn.net/Articles/336255/

Last week wediscussed the value of enunciating kernel design patterns and looked at thedesign patterns surrounding reference counts. This week we will look at a verydifferent aspect of coding and see why the kernel has special needs, and howthose needs have been addressed by successful approaches. The topic under themicroscope today is complex data structures.
By "complex data structures" we mean objects that are composed ofa variable number of simpler objects, possibly a homogeneous collection,possibly a heterogeneous collection. In general it is a structure to whichobjects can be added, from which objects can be removed, and in which objectscan be found. The preferred way to work with such data structures when we studythem in an introductory programming course is to use Abstract Data Types.

Abstract Data Types
The idea behind Abstract Data Types is to encapsulate the entireimplementation of a data structure, and provide just a well defined interfacefor manipulating it. The benefit of this approach is that it provides a cleanseparation. The data type can be implemented with no knowledge of theapplication which might end up using it, and the application can be implementedwith no knowledge of the implementation details of the data type. Both sidessimply write code based on the interface which works like a contract toexplicitly state what is required and what can be expected.
On the other hand, one of the costs of this approach is tightly connectedwith the abstract nature of the interface. The point of abstracting aninterface is to remove unnecessary details. This is good for introductorycomputing students, but is bad for kernel programmers. In kernel programming.performance is very important, coming as a close third after correctness andmaintainability, and sometimes taking precedence over maintainability. Not allcode paths in the kernel are performance-critical, but many are, and thedevelopment process benefits from the same data structures being using in bothperformance critical and less critical paths. So it is essential that datatypes are not overly abstracted, but that all details of the implementation arevisible so that the programmer can make optimal choices when using them.
So the first principle of data structures in the kernel is not to hidedetail. To see how this applies, and to discover further principles from whichto extract patterns, we will explore a few of the more successful data typesused in the Linux kernel.

Linked Lists
Starting simply, the first data type we will explore are doubly linkedlists. These are implemented by a single include file, <linux/list.h>. There is no separate".c" file with any library of support routines. All of the code forhandling linked lists is simple enough to be implemented using inlinefunctions. Thus it is very easy to use this implementation in any other(GPLv2-licensed) project.
There are two aspects of the "list.h" lists which are worth notingas they point to possible patterns. The first is struct list_head, which serves not only as the head of alist, but also as the anchor in items that are on a list. Your author has seenother linked list implementations which required that the first and secondelement in any data structures meant to be stored in lists be the"next" and "previous" pointers, so that common list-walkingcode could be used on a variety of different data structures. Linux kernellists do not suffer from this restriction. The list_headstructure can be embedded anywhere in a data structure, and the list_heads from a number of instances ofthat structure can be linked together. The containing structure can be foundfrom a ->next or ->prev pointer using the list_entry() macro.
There are at least two benefits of this approach. One is that the programmerstill has full control of placement of fields in the structure in case theyneed to put important fields close together to improve cache utilization. Theother is that a structure can easily be on two or more lists quite independently,simply by having multiple struct list_headfields.
This practice of embedding one structure inside another and using container_of() (which is the general formof list_entry()) to get theparent from the child structure is quite common in the Linux kernel and issomewhat reminiscent of object oriented programming. The container is like asubtype of the embedded structure.
The other noteworthy aspect of list.h is the proliferation of"for_each" macros - the macros that make it easy to walk along a listlooking at each item in turn. There are 20 of them (and that isn't counting thefour more in rculist.h whichI'll choose to ignore in the hope of brevity).
There are a few different reasons for this. The simplest are that

We sometimes want to walk the list in the "reverse" direction (following the "prev" link). There are five macros that go backward, and 15 that go forward.
We sometimes want to start in the middle of a list and "continue" on from there, so we have four "continue" macros and three "from" macros which interpret that starting point slightly differently.
We sometimes want to work with the struct list_head embedded in the target structure, but often we really want to use the list_entry() macro to get the enclosing structure; we find it easiest if the "for_each" macro does that for us. This provides the "entry" versions of the "for_each" macro, of which there are 13 (more than half).

Getting to the more subtle reasons, we sometimes want to be able to deletethe "current" item without upsetting the walk through the list. Thisrequires that a copy of the "next" pointer be taken before providing"this" entry to be acted upon, thus yielding the eight"safe" macros. An "ADT" style implementation of linked listswould quite likely only provide "safe" versions so as to hide thesedetails. However kernel programmers don't want to waste the storage or effortfor that extra step in the common case were it isn't needed.
Then there is the fact that we actually have two subtly different types oflinked lists. Regular linked lists use structlist_head as the head of the list. This structure contains apointer to the start and to the end. In some use cases, finding the end of thelist is not needed, and being able to halve the size of the head of the list isvery valuable. One typical use case of that kind is a hash table where allthese heads need to go in an array. To meet this need, we have the hlist, which is very similar to theregular list, except that only one pointer is needed in struct hlist_head. This accounts for sixof the different "for_each" macros.
If we had every possibly combination of forward or reverse, continue or not,entry or not, safe or not, and regular or hlist, we would have 32 differentmacros. In fact, only 19 of these appear to have been needed and, thus, coded.We certainly could code the remaining eleven, but as having code that is neverused tends to be frowned upon, it hasn't happened.
The observant reader will have noticed a small discrepancy in some of theabove numbers. Of the 20 macros, there is one that doesn't fit the abovepatterns, and it drives home the point that was made earlier about kernelprogrammers valuing performance. This final "for_each" macro is __list_for_each(). All of the other macrosuse the "prefetch" function to suggest that the CPU starts fetchingthe ->next pointer at thestart of each iteration so that it will already be available in cache when thenext iteration starts (though the "safe" macros actually fetch itrather than prefetch it). While this will normally improve performance, thereare cases when it will slow things down. When the walk of the list will almostalways abort very early - usually only considering the first item - theprefetch will often be wasted effort. In these cases (currently all in thenetworking code) the __list_for_each()macro is available. It does not prefetch anything. Thus people having verystrict performance goals can have a better chance of getting the performancethey want.
So from this simple data structure we can see two valuable patterns that areworth following.

Embedded Anchor: A good way to include generic objects in a data structure is to embed an anchor in them and build the data structure around the anchors. The object can be found from the anchor using container_of().
Broad Interfaces: Don't fall for the trap of thinking that "one size fits all". While having 20 or more macros that all do much the same thing is uncommon, it can be a very appropriate way of dealing with the complexity of finding the optimal solution. Trying to squeeze all possibilities into one narrow interface can be inefficient and choosing not to provide for all possibilities is counter-productive. Having all the permutations available encourages developers to use the right tool for the job and not to compromise. In 2.6.30-rc4, there are nearly 3000 uses of list_for_each_entry(), about 1000 of list_for_each_entry_safe(), nearly 500 of list_for_each(), and less than 1000 of all the rest put together. The fact that some are used rarely in no way reduces their importance.

RB-trees
Our next data structure is the RB-Tree or red-black tree. This is asemi-balanced, binary search tree that generally provides order"log(n)" search, insert, and delete operations. It is implemented in <linux/rbtree.h> and lib/rbtree.c. It has strong similaritiesto the list.h lists in that itembeds an anchor (struct rb_node)in each data structure and builds the tree from those.
The interesting thing to note about rbtree is that there is no searchfunction. Searching an rbtree is really a very simple operation and can beimplemented in just a few lines as shown by the examples at the top of rbtree.h. A search function certainly couldbe written, but the developer chose not to. The main reason, which should comeas no surprise, is performance. To write a search function, you need to passthe "compare" function into that search function. To do that in C,you would need to pass a function pointer. As compare functions are often verysimple, the cost of following the function pointer and making the function callwould often swamp the cost of doing the comparison itself. It turns out thathaving the whole search operation compiled as one function makes for moreefficient code. The same performance could possibly be achieved using inlinefunctions or macros, but given that the search function itself is so short, ithardly seems worthwhile.
Note also that rbtree doesn't exactly provide an insert function either.Rather, the developer needs to code a search; if the search fails, the new nodemust be inserted at the point where it was found not to exist and the tree mustbe rebalanced. There are functions for this final insertion and rebalancing asthey are certainly complex enough to deserve separate functions.
By giving the developer the responsibility for search and for some ofinsert, the rbtree library actually is giving a lot of valuable freedom. Thepattern of "search for an entry but if it isn't there, insert one" isfairly common. However the details of what happens between the"search" and "add" phases is not always the same and so notsomething that can easily be encoded in a library. By providing the basic toolsand leaving the details up to the specific situation, users of rbtree findthemselves liberated, rather than finding themselves fighting with a librarythat almost-but-not-quite does what they want.
So this example of rbtrees re-enforces the "embedded anchors"pattern and suggests a pattern that providing tools is sometimes much moreuseful than providing complete solutions. In this case, the base datastructures and the tools required for insert, remove, and rebalance areprovided, but the complete solution can still be tailored to suit each case.
This pattern also describes the kernel's approach to hash tables. These area very common data structure, but there is nothing that looks even vaguely likea definitive implementation. Rather the basic building blocks of the hlist andthe array are available along with some generic functions for calculating ahash (<linux/hash.h>).Connecting these to fit a given purpose is up to the developer.
So we have another pattern:

Tool Box: Sometimes it is best not to provide a complete solution for a generic service, but rather to provide a suite of tools that can be used to build custom solutions.

Radix tree
Our last data structure is the Radixtree. The Linux kernel actually has two radix tree implementations. One isin <linux/idr.h> and lib/idr.c, the other in <linux/radix-tree.h> and lib/radix-tree.c. Both provide a mappingfrom a number (unsigned long) toan arbitrary pointer (void *),though radix-tree also allows up to two "tags" to be stored with eachentry. For the purposes of this article we will only be looking at one of theimplementations (the one your author is most familiar with) - the radix-treeimplementation.
Radix-tree follows the pattern we saw in list.hof having multiple interfaces rather than trying to pack lots of differentneeds into the one interface. list.hhas 20 "for_each" macros; radix-tree has six "lookup"functions, depending on whether we want just one item or a range (ganglookups), or whether we want to use the tags to restrict the search (taglookups) and whether we want to find the place where the pointer is stored,rather than the pointer that is stored there (this is needed for the subtlelocking strategies of the page cache).
However radix-tree does not follow the embedded anchor pattern of theearlier data structures, and that is why it is interesting. For lists andrbtree, the storage needed for managing the data structure is exactlyproportional to the number of items in the data structure on a one-to-onebasis, so keeping this storage in those item works perfectly. For a radix-tree,the storage needed is a number of little arrays, each of which refers to anumber of items. So embedding these arrays, one each, in the items cannot work.This means that, unlike list.h and rbtree, radix-tree will sometimes need toallocate some memory in order to be able to add items to the data structure.This has some interesting consequences.
In the previous data structures (lists and rbtrees), we made no mention oflocking. If locking is needed, then the user of the data structure is likely toknow the specific needs so all locking details are left up to the caller (wecall that "caller locking" as opposed to "callee locking".Caller locking is more common and generally preferred). This is fine for listsand rbtrees as nothing that they do internally is affected particularly bylocking.
This is not the case if memory allocation is needed, though. If a processneeds to allocate memory, it is possible that it will need to sleep while thememory management subsystem writes data out to storage to make memoryavailable. There are various locks (such as spinlocks) that may not be heldwhile a process sleeps. So there is the possibility for significant interactionbetween the need to allocate memory internally to the radix-tree code, and theneed to hold locks outside the radix-tree code.
The obvious solution to this problem (once the problem is understood) is topreallocate the maximum amount of memory needed before taking the lock. This isimplemented within radix-tree with the radix_tree_preload()function. It manages a per-CPU pool of available radix_tree nodes and makes sure the pool is full and willnot be used by any other radix-tree operation. Thus, bracketing radix_tree_insert() with calls to radix_tree_preload() and radix_tree_preload_end() ensures that the radix_tree_insert() call will not fail dueto lack of memory (though the radix_tree_preload()might) and that there will be no unpleasant interactions with locking.

Summary
So we can now summarize our list of design patterns that we have found thatwork well with data structures (and elsewhere) in the kernel. Those that havealready been detailed are briefly included in this list too for completeness.

Embedded Anchor. This is very useful for lists, and can be generalized as can be seen if you explore kobjects (an exercise left to the reader).
Broad Interfaces. This reminds us that trying to squeeze lots of use-cases in to one function call is not necessary - just provide lots of function calls (with helpful and (hopefully) consistent names).
Tool Box. Sometimes it is best not to provide a complete solution for a generic service, but rather to provide a suite of tools that can be used to build custom solutions.
Caller Locks. When there is any doubt, choose to have the caller take locks rather than the callee. This puts more control in that hands of the client of a function.
Preallocate Outside Locks. This is in some ways fairly obvious. But it is very widely used within the kernel, so stating it explicitly is a good idea.

Next week we will complete our investigation of kernel design patterns bytaking a higher level view and look at some patterns relating to the design ofwhole subsystems.

Exercises
For those who would like to explore these ideas further, here are somestarting points.

Make a list of all data structures that are embedded, by exploring all uses of the "container_of" macro. Of these, make a list of pairs that are both embedded in the same structure (modeling multiple inheritance). Comment on how this reflects on the general usefulness of multiple inheritance.
Write a implementation of skiplists that would be suitable for in-kernel use. Consider the applicability of each of the patterns discussed in this article. Extra credit for leveraging list.h lists.
Linux contains a mempool library which radix-tree chooses not to use, preferring it's own simple pool (in radix_tree_preload). Examine the consequences of changing radix-tree to use mempool, and of changing mempool to be usable by radix-tree. Provide patches to revert this design choice in radix-tree, or a pattern to explain this design choice.
Compare the radix-tree and idr implementations to see if one could be implemented using the other without sacrificing correctness, maintainability or performance. Provide either an explanation of why they should stay separate, or a patch to replace one with the other.