If you publish or distribute this book commercially, donations, royalties, and/or printed copies are greatly appreciated by the author and the \href{https://tldp.org/}{Linux Documentation Project} (LDP).
Contributing in this way shows your support for free software and the LDP. If you have questions or comments, please contact the address above.
-\subsection{Authorship}
+\section{Authorship}
\label{sec:authorship}
The Linux Kernel Module Programming Guide was initially authored by Ori Pomerantz for Linux v2.2.
Jim Huang then undertook the task of updating the guide for recent Linux versions (v5.0 and beyond),
along with revising the LaTeX document.
-\subsection{Acknowledgements}
+\section{Acknowledgements}
\label{sec:acknowledgements}
The following people have contributed corrections or good suggestions:
\input{contrib}
\end{flushleft}
-\subsection{What Is A Kernel Module?}
+\section{What Is A Kernel Module?}
\label{sec:kernelmod}
Involvement in the development of Linux kernel modules requires a foundation in the C programming language and a track record of creating conventional programs intended for process execution.
requiring direct integration of new functionalities into the kernel image.
This approach leads to larger kernels and necessitates kernel rebuilding and subsequent system rebooting when new functionalities are desired.
-\subsection{Kernel module package}
+\section{Kernel module package}
\label{sec:packages}
Linux distributions provide the commands \sh|modprobe|, \sh|insmod| and \sh|depmod| within a package.
sudo pacman -S gcc kmod
\end{codebash}
-\subsection{What Modules are in my Kernel?}
+\section{What Modules are in my Kernel?}
\label{sec:modutils}
To discover what modules are already loaded within your current kernel use the command \sh|lsmod|.
sudo lsmod | grep fat
\end{codebash}
-\subsection{Is there a need to download and compile the kernel?}
+\section{Is there a need to download and compile the kernel?}
\label{sec:buildkernel}
To effectively follow this guide, there is no obligatory requirement for performing such actions.
Nonetheless, a prudent approach involves executing the examples within a test distribution on a virtual machine,
thus mitigating any potential risk of disrupting the system.
-\subsection{Before We Begin}
+\section{Before We Begin}
\label{sec:preparation}
Before delving into code, certain matters require attention.
Variances exist among individuals' systems, and distinct personal approaches are evident.
more detailed steps for \href{https://wiki.debian.org/SecureBoot}{SecureBoot} can be explored and followed.
\end{enumerate}
-\section{Headers}
+\chapter{Headers}
\label{sec:headers}
Before you can build anything you'll need to install the header files for your kernel.
sudo dnf install kernel-devel kernel-headers
\end{codebash}
-\section{Examples}
+\chapter{Examples}
\label{sec:examples}
All the examples from this document are available within the \verb|examples| subdirectory.
If there are any compile errors then you might have a more recent kernel version or need to install the corresponding kernel header files.
-\section{Hello World}
+\chapter{Hello World}
\label{sec:helloworld}
-\subsection{The Simplest Module}
+\section{The Simplest Module}
\label{sec:org2d3e245}
Most people learning programming start out with some sort of "\emph{hello world}" example.
I don't know what happens to people who break with this tradition, but I think it is safer not to find out.
\end{quote}
\end{enumerate}
-\subsection{Hello and Goodbye}
+\section{Hello and Goodbye}
\label{hello_n_goodbye}
In early kernel versions you had to use the \cpp|init_module| and \cpp|cleanup_module| functions, as in the first hello world example, but these days you can name those anything you want by using the \cpp|module_init| and \cpp|module_exit| macros.
These macros are defined in \src{include/linux/module.h}.
For those who are not, the \verb|obj-$(CONFIG_FOO)| entries you see everywhere expand into \verb|obj-y| or \verb|obj-m|, depending on whether the \verb|CONFIG_FOO| variable has been set to \verb|y| or \verb|m|.
While we are at it, those were exactly the kind of variables that you have set in the \verb|.config| file in the top-level directory of Linux kernel source tree, the last time when you said \sh|make menuconfig| or something like that.
-\subsection{The \_\_init and \_\_exit Macros}
+\section{The \_\_init and \_\_exit Macros}
\label{init_n_exit}
The \cpp|__init| macro causes the init function to be discarded and its memory freed once the init function finishes for built-in drivers, but not loadable modules.
If you think about when the init function is invoked, this makes perfect sense.
\samplec{examples/hello-3.c}
-\subsection{Licensing and Module Documentation}
+\section{Licensing and Module Documentation}
\label{modlicense}
Honestly, who loads or even cares about proprietary modules?
If you do then you might have seen something like this:
\samplec{examples/hello-4.c}
-\subsection{Passing Command Line Arguments to a Module}
+\section{Passing Command Line Arguments to a Module}
\label{modparam}
Modules can take command line arguments, but not with the argc/argv you might be used to.
insmod: ERROR: could not insert module hello-5.ko: Invalid parameters
\end{verbatim}
-\subsection{Modules Spanning Multiple Files}
+\section{Modules Spanning Multiple Files}
\label{modfiles}
Sometimes it makes sense to divide a kernel module between several source files.
The first five lines are nothing special, but for the last example we will need two lines.
First we invent an object name for our combined module, second we tell \sh|make| what object files are part of that module.
-\subsection{Building modules for a precompiled kernel}
+\section{Building modules for a precompiled kernel}
\label{precompiled}
Obviously, we strongly suggest you to recompile your kernel, so that you can enable a number of useful debugging features, such as forced module unloading (\cpp|MODULE_FORCE_UNLOAD|): when this option is enabled, you can force the kernel to unload a module even when it believes it is unsafe, via a \sh|sudo rmmod -f module| command.
This option can save you a lot of time and a number of reboots during the development of a module.
If you do not desire to actually compile the kernel, you can interrupt the build process (CTRL-C) just after the SPLIT line, because at that time, the files you need are ready.
Now you can turn back to the directory of your module and compile it: It will be built exactly according to your current kernel settings, and it will load into it without any errors.
-\section{Preliminaries}
-\subsection{How modules begin and end}
+\chapter{Preliminaries}
+\section{How modules begin and end}
\label{sec:module_init_exit}
A program usually begins with a \cpp|main()| function, executes a bunch of instructions and terminates upon completion of those instructions.
Kernel modules work a bit differently. A module always begin with either the \cpp|init_module| or the function you specify with \cpp|module_init| call.
Every module must have an entry function and an exit function.
Since there's more than one way to specify entry and exit functions, I will try my best to use the terms ``entry function'' and ``exit function'', but if I slip and simply refer to them as \cpp|init_module| and \cpp|cleanup_module|, I think you will know what I mean.
-\subsection{Functions available to modules}
+\section{Functions available to modules}
\label{sec:avail_func}
Programmers use functions they do not define all the time.
A prime example of this is \cpp|printf()|.
You can even write modules to replace the kernel's system calls, which we will do shortly.
Crackers often make use of this sort of thing for backdoors or trojans, but you can write your own modules to do more benign things, like have the kernel write Tee hee, that tickles! every time someone tries to delete a file on your system.
-\subsection{User Space vs Kernel Space}
+\section{User Space vs Kernel Space}
\label{sec:user_kernl_space}
A kernel is all about access to resources, whether the resource in question happens to be a video card, a hard drive or even memory.
Programs often compete for the same resource. As I just saved this document, updatedb started updating the locate database.
The library function calls one or more system calls, and these system calls execute on the library function's behalf, but do so in supervisor mode since they are part of the kernel itself.
Once the system call completes its task, it returns and execution gets transferred back to user mode.
-\subsection{Name Space}
+\section{Name Space}
\label{sec:namespace}
When you write a small C program, you use variables which are convenient and make sense to the reader.
If, on the other hand, you are writing routines which will be part of a bigger problem, any global variables you have are part of a community of other peoples' global variables; some of the variable names can clash.
The file \verb|/proc/kallsyms| holds all the symbols that the kernel knows about and which are therefore accessible to your modules since they share the kernel's codespace.
-\subsection{Code space}
+\section{Code space}
\label{sec:codespace}
Memory management is a very complicated subject and the majority of O'Reilly's \href{https://www.oreilly.com/library/view/understanding-the-linux/0596005652/}{Understanding The Linux Kernel} exclusively covers memory management!
We are not setting out to be experts on memory managements, but we do need to know a couple of facts to even begin worrying about writing real modules.
There are things called microkernels which have modules which get their own codespace.
The \href{https://www.gnu.org/software/hurd/}{GNU Hurd} and the \href{https://fuchsia.dev/fuchsia-src/concepts/kernel}{Zircon kernel} of Google Fuchsia are two examples of a microkernel.
-\subsection{Device Drivers}
+\section{Device Drivers}
\label{sec:device_drivers}
One class of module is the device driver, which provides functionality for hardware like a serial port.
On Unix, each piece of hardware is represented by a file located in \verb|/dev| named a device file which provides the means to communicate with the hardware.
Sometimes two device files with the same major but different minor number can actually represent the same piece of physical hardware.
So just be aware that the word ``hardware'' in our discussion can mean something very abstract.
-\section{Character Device drivers}
+\chapter{Character Device drivers}
\label{sec:chardev}
-\subsection{The file\_operations Structure}
+\section{The file\_operations Structure}
\label{sec:file_operations}
The \cpp|file_operations| structure is defined in \src{include/linux/fs.h}, and holds pointers to functions defined by the driver that perform various operations on the device.
Each field of the structure corresponds to the address of some function defined by the driver to handle a requested operation.
Additionally, since Linux v5.6, the \cpp|proc_ops| structure was introduced to replace the use of the \cpp|file_operations| structure when registering proc handlers.
See more information in the \ref{sec:proc_ops} section.
-\subsection{The file structure}
+\section{The file structure}
\label{sec:file_struct}
Each device is represented in the kernel by a file structure, which is defined in \src{include/linux/fs.h}.
Most of the entries you see, like struct dentry are not used by device drivers, and you can ignore them.
This is because drivers do not fill file directly; they only use structures contained in file which are created elsewhere.
-\subsection{Registering A Device}
+\section{Registering A Device}
\label{sec:register_device}
As discussed earlier, char devices are accessed through device files, usually located in \verb|/dev|.
This is by convention. When writing a driver, it is OK to put the device file in your current directory.
To find an example using the interface, you can see \verb|ioctl.c| described in section \ref{sec:device_files}.
-\subsection{Unregistering A Device}
+\section{Unregistering A Device}
\label{sec:unregister_device}
We can not allow the kernel module to be \sh|rmmod|'ed whenever root feels like it.
If the device file is opened by a process and then we remove the kernel module, using the file would cause a call to the memory location where the appropriate function (read/write) used to be.
It is important to keep the counter accurate; if you ever do lose track of the correct usage count, you will never be able to unload the module; it's now reboot time, boys and girls.
This is bound to happen to you sooner or later during a module's development.
-\subsection{chardev.c}
+\section{chardev.c}
\label{sec:chardev_c}
The next code sample creates a char driver named \verb|chardev|.
You can dump its device file.
\samplec{examples/chardev.c}
-\subsection{Writing Modules for Multiple Kernel Versions}
+\section{Writing Modules for Multiple Kernel Versions}
\label{sec:modules_for_versions}
The system calls, which are the major interface the kernel shows to the processes, generally stay the same across versions.
A new system call may be added, but usually the old ones will behave exactly like they used to.
The way to do this to compare the macro \cpp|LINUX_VERSION_CODE| to the macro \cpp|KERNEL_VERSION|.
In version \verb|a.b.c| of the kernel, the value of this macro would be \(2^{16}a+2^{8}b+c\).
-\section{The /proc File System}
+\chapter{The /proc File System}
\label{sec:procfs}
In Linux, there is an additional mechanism for the kernel and kernel modules to send information to processes --- the \verb|/proc| file system.
Originally designed to allow easy access to information about processes (hence the name), it is now used by every bit of the kernel which has something interesting to report, such as \verb|/proc/modules| which provides the list of modules and \verb|/proc/meminfo| which gathers memory usage statistics.
\samplec{examples/procfs1.c}
-\subsection{The proc\_ops Structure}
+\section{The proc\_ops Structure}
\label{sec:proc_ops}
The \cpp|proc_ops| structure is defined in \src{include/linux/proc\_fs.h} in Linux v5.6+.
In older kernels, it used \cpp|file_operations| for custom hooks in \verb|/proc| file system, but it contains some members that are unnecessary in VFS, and every time VFS expands \cpp|file_operations| set, \verb|/proc| code comes bloated.
On the other hand, not only the space, but also some operations were saved by this structure to improve its performance.
For example, the file which never disappears in \verb|/proc| can set the \cpp|proc_flag| as \cpp|PROC_ENTRY_PERMANENT| to save 2 atomic ops, 1 allocation, 1 free in per open/read/close sequence.
-\subsection{Read and Write a /proc File}
+\section{Read and Write a /proc File}
\label{sec:read_write_procfs}
We have seen a very simple example for a \verb|/proc| file where we only read the file \verb|/proc/helloworld|.
It is also possible to write in a \verb|/proc| file.
\samplec{examples/procfs2.c}
-\subsection{Manage /proc file with standard filesystem}
+\section{Manage /proc file with standard filesystem}
\label{sec:manage_procfs}
We have seen how to read and write a \verb|/proc| file with the \verb|/proc| interface.
But it is also possible to manage \verb|/proc| file with inodes.
Well, first of all keep in mind, there are rumors around, claiming that procfs is on its way out, consider using \verb|sysfs| instead.
Consider using this mechanism, in case you want to document something kernel related yourself.
-\subsection{Manage /proc file with seq\_file}
+\section{Manage /proc file with seq\_file}
\label{sec:manage_procfs_with_seq_file}
As we have seen, writing a \verb|/proc| file may be quite ``complex''.
So to help people writing \verb|/proc| file, there is an API named \cpp|seq_file| that helps formatting a \verb|/proc| file for output.
You can also read the code of \src{fs/seq\_file.c} in the linux kernel.
-\section{sysfs: Interacting with your module}
+\chapter{sysfs: Interacting with your module}
\label{sec:sysfs}
\emph{sysfs} allows you to interact with the running kernel from userspace by reading or setting variables inside of modules.
This can be useful for debugging purposes, or just as an interface for applications or scripts.
After a bit of mission creep, it is now the glue that holds much of the device model and its sysfs interface together.
For more information about kobject and sysfs, see \src{Documentation/driver-api/driver-model/driver.rst} and \url{https://lwn.net/Articles/51437/}.
-\section{Talking To Device Files}
+\chapter{Talking To Device Files}
\label{sec:device_files}
Device files are supposed to represent physical devices.
Most physical devices are used for output as well as input, so there has to be some mechanism for device drivers in the kernel to get the output to send to the device from processes.
\samplec{examples/other/userspace_ioctl.c}
-\section{System Calls}
+\chapter{System Calls}
\label{sec:syscall}
So far, the only thing we've done was to use well defined kernel mechanisms to register \verb|/proc| files and device handlers.
This is fine if you want to do something the kernel programmers thought you'd want, such as write a device driver.
\samplec{examples/syscall.c}
-\section{Blocking Processes and threads}
+\chapter{Blocking Processes and threads}
\label{sec:blocking_process_thread}
-\subsection{Sleep}
+\section{Sleep}
\label{sec:sleep}
What do you do when somebody asks you for something you can not do right away?
If you are a human being and you are bothered by a human being, the only thing you can say is: "\emph{Not right now, I'm busy. Go away!}".
\samplec{examples/other/cat_nonblock.c}
-\subsection{Completions}
+\section{Completions}
\label{sec:completion}
Sometimes one thing should happen before another within a module having multiple threads.
Rather than using \sh|/bin/sleep| commands, the kernel has another way to do this which allows timeouts or interrupts to also happen.
There are other variations upon the \cpp|wait_for_completion| function, which include timeouts or being interrupted, but this basic mechanism is enough for many common situations without adding a lot of complexity.
-\section{Avoiding Collisions and Deadlocks}
+\chapter{Avoiding Collisions and Deadlocks}
\label{sec:synchronization}
If processes running on different CPUs or in different threads try to access the same memory, then it is possible that strange things can happen or your system can lock up.
To avoid this, various types of mutual exclusion kernel functions are available.
These indicate if a section of code is "locked" or "unlocked" so that simultaneous attempts to run it can not happen.
-\subsection{Mutex}
+\section{Mutex}
\label{sec:mutex}
You can use kernel mutexes (mutual exclusions) in much the same manner that you might deploy them in userland.
This may be all that is needed to avoid collisions in most cases.
\samplec{examples/example_mutex.c}
-\subsection{Spinlocks}
+\section{Spinlocks}
\label{sec:spinlock}
As the name suggests, spinlocks lock up the CPU that the code is running on, taking 100\% of its resources.
Because of this you should only use the spinlock mechanism around code which is likely to take no more than a few milliseconds to run and so will not noticeably slow anything down from the user's point of view.
\samplec{examples/example_spinlock.c}
-\subsection{Read and write locks}
+\section{Read and write locks}
\label{sec:rwlock}
Read and write locks are specialised kinds of spinlocks so that you can exclusively read from something or write to something.
Like the earlier spinlocks example, the one below shows an "irq safe" situation in which if other functions were triggered from irqs which might also read and write to whatever you are concerned with then they would not disrupt the logic.
\samplec{examples/example_rwlock.c}
Of course, if you know for sure that there are no functions triggered by irqs which could possibly interfere with your logic then you can use the simpler \cpp|read_lock(&myrwlock)| and \cpp|read_unlock(&myrwlock)| or the corresponding write functions.
-\subsection{Atomic operations}
+\section{Atomic operations}
\label{sec:atomics}
If you are doing simple arithmetic: adding, subtracting or bitwise operations, then there is another way in the multi-CPU and multi-hyperthreaded world to stop other parts of the system from messing with your mojo.
By using atomic operations you can be confident that your addition, subtraction or bit flip did actually happen and was not overwritten by some other shenanigans.
\end{itemize}
% FIXME: we should rewrite this section
-\section{Replacing Print Macros}
+\chapter{Replacing Print Macros}
\label{sec:print_macros}
-\subsection{Replacement}
+\section{Replacement}
% FIXME: cross-reference
In Section \ref{sec:preparation}, it was noted that the X Window System and kernel module programming are not conducive to integration.
This remains valid during the development of kernel modules.
\samplec{examples/print_string.c}
-\subsection{Flashing keyboard LEDs}
+\section{Flashing keyboard LEDs}
\label{sec:flash_kb_led}
In certain conditions, you may desire a simpler and more direct way to communicate to the external world.
Flashing keyboard LEDs can be such a solution: It is an immediate way to attract attention or to display a status condition.
Adding debug code can change the situation enough to make the bug seem to disappear.
Thus, you should keep debug code to a minimum and make sure it does not show up in production code.
-\section{Scheduling Tasks}
+\chapter{Scheduling Tasks}
\label{sec:scheduling_tasks}
There are two main ways of running tasks: tasklets and work queues.
Tasklets are a quick and easy way of scheduling a single function to be run.
For example, when triggered from an interrupt, whereas work queues are more complicated but also better suited to running multiple things in a sequence.
-\subsection{Tasklets}
+\section{Tasklets}
\label{sec:tasklet}
Here is an example tasklet module.
The \cpp|tasklet_fn| function runs for a few seconds.
Now developers are proceeding with the API changes and the macro \cpp|DECLARE_TASKLET_OLD| exists for compatibility.
For further information, see \url{https://lwn.net/Articles/830964/}.
-\subsection{Work queues}
+\section{Work queues}
\label{sec:workqueue}
To add a task to the scheduler we can use a workqueue.
The kernel then uses the Completely Fair Scheduler (CFS) to execute work within the queue.
\samplec{examples/sched.c}
-\section{Interrupt Handlers}
+\chapter{Interrupt Handlers}
\label{sec:interrupt_handler}
-\subsection{Interrupt Handlers}
+\section{Interrupt Handlers}
\label{sec:irq}
Except for the last chapter, everything we did in the kernel so far we have done as a response to a process asking for it, either by dealing with a special file, sending an \cpp|ioctl()|, or issuing a system call.
But the job of the kernel is not just to respond to process requests.
The flags can include \cpp|SA_SHIRQ| to indicate you are willing to share the IRQ with other interrupt handlers (usually because a number of hardware devices sit on the same IRQ) and \cpp|SA_INTERRUPT| to indicate this is a fast interrupt.
This function will only succeed if there is not already a handler on this IRQ, or if you are both willing to share.
-\subsection{Detecting button presses}
+\section{Detecting button presses}
\label{sec:detect_button}
Many popular single board computers, such as Raspberry Pi or Beagleboards, have a bunch of GPIO pins.
Attaching buttons to those and then having a button press do something is a classic case in which you might need to use interrupts,
\samplec{examples/intrpt.c}
-\subsection{Bottom Half}
+\section{Bottom Half}
\label{sec:bottom_half}
Suppose you want to do a bunch of stuff inside of an interrupt routine.
A common way to do that without rendering the interrupt unavailable for a significant duration is to combine it with a tasklet.
\samplec{examples/bottomhalf.c}
-\section{Crypto}
+\chapter{Crypto}
\label{sec:crypto}
At the dawn of the internet, everybody trusted everybody completely\ldots{}but that did not work out so well.
When this guide was originally written, it was a more innocent era in which almost nobody actually gave a damn about crypto - least of all kernel developers.
That is certainly no longer the case now.
To handle crypto stuff, the kernel has its own API enabling common methods of encryption, decryption and your favourite hash functions.
-\subsection{Hash functions}
+\section{Hash functions}
\label{sec:hashfunc}
Calculating and checking the hashes of things is a common operation.
sudo rmmod cryptosha256
\end{codebash}
-\subsection{Symmetric key encryption}
+\section{Symmetric key encryption}
\label{sec:org2fab20b}
Here is an example of symmetrically encrypting a string using the AES algorithm and a password.
\samplec{examples/cryptosk.c}
-\section{Virtual Input Device Driver}
+\chapter{Virtual Input Device Driver}
\label{sec:vinput}
The input device driver is a module that provides a way to communicate with the interaction device via the event.
For example, the keyboard can send the press or release event to tell the kernel what we want to do.
% TODO: Add vts.c and vmouse.c example
-\section{Standardizing the interfaces: The Device Model}
+\chapter{Standardizing the interfaces: The Device Model}
\label{sec:device_model}
Up to this point we have seen all kinds of modules doing all kinds of things, but there was no consistency in their interfaces with the rest of the kernel.
To impose some consistency such that there is at minimum a standardized way to start, suspend and resume a device model was added.
\samplec{examples/devicemodel.c}
-\section{Optimizations}
+\chapter{Optimizations}
\label{sec:optimization}
-\subsection{Likely and Unlikely conditions}
+\section{Likely and Unlikely conditions}
\label{sec:likely_unlikely}
Sometimes you might want your code to run as quickly as possible, especially if it is handling an interrupt or doing something which might cause noticeable latency.
If your code contains boolean conditions and if you know that the conditions are almost always likely to evaluate as either \cpp|true| or \cpp|false|,
That avoids flushing the processor pipeline.
The opposite happens if you use the \cpp|likely| macro.
-\subsection{Static keys}
+\section{Static keys}
\label{sec:static_keys}
Static keys allow us to enable or disable kernel code paths based on the runtime state of key. Its APIs have been available since 2010 (most architectures are already supported), use self-modifying code to eliminate the overhead of cache and branch prediction.
The most typical use case of static keys is for performance-sensitive kernel code, such as tracepoints, context switching, networking, etc. These hot paths of the kernel often contain branches and can be optimized easily using this technique.
In some cases, the key is enabled or disabled at initialization and never changed, we can declare a static key as read-only, which means that it can only be toggled in the module init function. To declare a read-only static key, we can use the \cpp|DEFINE_STATIC_KEY_FALSE_RO| or \cpp|DEFINE_STATIC_KEY_TRUE_RO| macro instead. Attempts to change the key at runtime will result in a page fault.
For more information, see \href{https://www.kernel.org/doc/Documentation/static-keys.txt}{Static keys}
-\section{Common Pitfalls}
+\chapter{Common Pitfalls}
\label{sec:opitfall}
\subsection{Using standard libraries}
\label{sec:disabling_interrupts}
You might need to do this for a short time and that is OK, but if you do not enable them afterwards, your system will be stuck and you will have to power it off.
-\section{Where To Go From Here?}
+\chapter{Where To Go From Here?}
\label{sec:where_to_go}
For people seriously interested in kernel programming, I recommend \href{https://kernelnewbies.org}{kernelnewbies.org} and the \src{Documentation} subdirectory within the kernel source code which is not always easy to understand but can be a starting point for further investigation.
Also, as Linus Torvalds said, the best way to learn the kernel is to read the source code yourself.