Adding a System Call to the Linux Kernel
These are mostly notes for myself.
This blog describes adding a new system call for debugging/experimental purposes.
For a thorough explanation and walkthrough of adding a new system call, read the official kernel documentation here: Adding a New System Call. The official documentation details how to make the new syscall optional, adding fallback stub implementations, recommendations for backwards compatability, testing, and more.
Setup
See this previous post: Building and Running the Linux Kernel.
Syscalls
Several files need to be updated to add a new syscall:
new/path/to/implementation
- Syscall function implementation- If the implementation location is a new file/directory, a few Makefile changes are needed too
include/linux/syscalls.h
- Function prototype- Generic and architecture-specific system call tables:
include/uapi/asm-generic/unistd.h
- Generic syscall table in the user-space APIarch/x86/entry/syscalls/syscall_64.tbl
- Architecture-specific syscall table
Function Implementation
First, create a new directory with its own Makefile
and a hello_world.c
:
# from the root of the linux source tree
mkdir hello_world
touch hello_world/{Makefile,hello_world.c}
Add this to the Makefile
:
obj-y := hello_world.o
Details on Linux Makefiles
In hello_world/hello_world.c
, define the new system call implementation with the
SYSCALL_DEFINE<N>
macro, where the N
is the number of arguments the
syscall accepts:
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/syscalls.h>
SYSCALL_DEFINE2(hello_world, const char*, message, unsigned long, len) {
char *kmessage;
int res;
if (len > 0x1000) {
pr_err("User message length was unreasonably long\n");
return -EINVAL;
}
kmessage = kzalloc(len + 1, GFP_KERNEL);
if (!kmessage) {
pr_err("Failed to allocate %08lux bytes\n", len + 1);
return -ENOMEM;
}
res = copy_from_user(kmessage, message, len);
if (res) {
return -EFAULT;
}
pr_info("IN HELLO WORLD! Message: %s\n", kmessage);
return 0;
}
A few things to note about this code:
- You ABSOLUTELY SHOULD NOT directly access user memory! Use the
copy_from_user
/copy_to_user
/get_user
/put_user
functions. - Allocating memory in linux can be done with a variety of calls. It’s worth reading through the official Memory Allocation Guide in the linux kernel documentation.
- Return values are the negative value of error codes, or
0
for success. pr_<LOG_LEVEL>(...)
functions are macros that wrap theprintk(KERN_<LOG_LEVEL> ...)
functions. See Message logging with printk for details.
Function Prototype
The function declaration itself lives in include/linux/syscalls.h
:
asmlinkage long sys_hello_world(const char* message, unsigned long len);
This is the main entrypoint for the new system call. These are always prefixed
by sys_
.
System Call Table
Both a generic syscall table (include/uapi/asm-generic/unistd.h
) and
architecture-specific syscall tables exist (arch/x86/entry/syscalls/syscall_64.tbl
).
The syscall should be added to both the generic table, and all architectures that
should support it.
Changes to the generic syscall table (don’t forget to increment __NR_syscalls
!):
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c48b0ae3ba3..395bd563d6f2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
#define __NR_set_mempolicy_home_node 450
__SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
+#define __NR_hello_world 451
+__SYSCALL(__NR_hello_world, sys_hello_world)
+
#undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
Changes to the x86 syscall table:
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..e44d6d33e7ce 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
448 common process_mrelease sys_process_mrelease
449 common futex_waitv sys_futex_waitv
450 common set_mempolicy_home_node sys_set_mempolicy_home_node
+451 common hello_world sys_hello_world
(Re)Building the Kernel
Rebuild the kernel once you’ve made your changes. If you want to clean everything
first (including configuration files), run make mrproper
, reconfigure, and then
rebuild.
Testing the Syscall
Building on the previous post about building the kernel, we’ll use qemu to debug the syscall.
First we’ll need something in userland to actually make the syscall. We’ll use the
syscall(2)
function to call our new syscall directly:
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char **argv) {
const char* message = "Hello, world!";
int res;
printf("Going to call hello_world, syscall 451\n");
res = syscall(451, message, strlen(message));
printf("Tried to call it! Result: %d\n", res);
}
Compile with
gcc hello_world_test.c -o hello_world_test
Run the linux kernel with qemu-system-x86_64
and run our hello_world_test
binary:
qemu-system-x86_64 \
-kernel arch/x86_64/boot/bzImage \
-nographic \
-append "console=ttyS0" \
-m 1024 \
-initrd initfs \
--enable-kvm \
-cpu host \
-s -S \
-fsdev local,path=$(pwd),security_model=none,id=test_dev \
-device virtio-9p,fsdev=test_dev,mount_tag=test_mount
Attach gdb via a remote debugging session (in another shell):
$> gdb vmlinux
(gdb) target remote :1234
(gdb) c
Back in the qemu shell, mount the 9p
shared directory to the /shared
directory within our running vm. This makes the
root of the linux source tree (or wherever $(pwd)
was when you ran
qemu-system-x86
) accessible within the VM:
(initramfs) mkdir /shared
(initramfs) mount -t 9p -o trans=virtio test_mount /shared/ -oversion=9p2000.L,posixacl,msize=512000,cache=loose
In the qemu shell, you should now be able to run the built hello_world_test
binary
from the /shared
directory:
(initramfs) /shared/hello_world_test
Going to call hello_world, syscall 451
[ 38.305249] hello_world: IN HELLO WORLD! Message: Hello, world!
Tried to call it! Result: 0
[ 38.308184] hello_world_tes (186) used greatest stack depth: 27392 bytes left
We did it!
Experimenting with KASAN
You can see KASAN in action by modifying our hello world syscall implementation.
Change the pr_info("IN HELLO WORLD! Message: %s\n", kmessage);
call to directly
use the message
pointer from userspace:
...
pr_info("IN HELLO WORLD! Message: %s\n", message);
...
Rebuild, and rerun the hello_world_test
binary. You should see something like
this:
(initramfs) /shared/hello_world_test
Going to call hello_world, syscall 451
[ 35.050963] ==================================================================
[ 35.052699] BUG: KASAN: user-memory-access in string+0xf1/0x1f0
[ 35.054292] Read of size 1 at addr 000055b5e1bed008 by task hello_world_tes/189
[ 35.055996]
[ 35.056351] CPU: 0 PID: 189 Comm: hello_world_tes Not tainted 5.18.0-rc5-g30c8e80f7932-dirty #3
[ 35.058492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[ 35.060354] Call Trace:
[ 35.060893] <TASK>
[ 35.061336] dump_stack_lvl+0x34/0x44
[ 35.062101] kasan_report+0xab/0x120
[ 35.062839] ? string+0xf1/0x1f0
[ 35.063526] string+0xf1/0x1f0
[ 35.064167] ? ip6_addr_string_sa+0x400/0x400
[ 35.065067] ? __rcu_read_unlock+0x43/0x60
[ 35.065929] ? __is_insn_slot_addr+0x56/0x80
[ 35.066817] vsnprintf+0x4c8/0x930
[ 35.067774] ? pointer+0x690/0x690
[ 35.068683] ? kvm_sched_clock_read+0x14/0x40
[ 35.069592] ? sched_clock_cpu+0x15/0x130
[ 35.070419] vprintk_store+0x330/0x610
[ 35.071238] ? printk_sprint+0xb0/0xb0
[ 35.072011] ? kasan_save_stack+0x2e/0x40
[ 35.072834] ? kasan_save_stack+0x1e/0x40
[ 35.073654] ? __kasan_kmalloc+0x81/0xa0
[ 35.074597] ? __do_sys_hello_world+0x29/0x60
[ 35.075590] ? do_syscall_64+0x3b/0x90
[ 35.076370] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 35.077442] ? new_sync_write+0x22a/0x310
[ 35.078266] ? new_sync_read+0x300/0x300
[ 35.079421] ? fsnotify+0x930/0x930
[ 35.080362] vprintk_emit+0xb1/0x220
[ 35.081138] _printk+0xad/0xde
[ 35.081859] ? swsusp_close.cold+0xc/0xc
[ 35.082666] ? kasan_unpoison+0x23/0x50
[ 35.083476] ? __kasan_slab_alloc+0x2c/0x80
[ 35.084326] __do_sys_hello_world.cold+0x27/0x49
[ 35.085271] do_syscall_64+0x3b/0x90
[ 35.086002] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 35.087057] RIP: 0033:0x7f6c313d067d
[ 35.087885] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[ 35.092050] RSP: 002b:00007ffc637aa178 EFLAGS: 00000212 ORIG_RAX: 00000000000001c3
[ 35.093562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6c313d067d
[ 35.094994] RDX: 0000000000000008 RSI: 000000000000000d RDI: 000055b5e1bed008
[ 35.096453] RBP: 00007ffc637aa1a0 R08: 000055b5e2b052a0 R09: 00007ffc637aa298
[ 35.098002] R10: 0000000000000027 R11: 0000000000000212 R12: 000055b5e1bec0a0
[ 35.099429] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 35.100898] </TASK>
[ 35.101502] ==================================================================
[ 35.103068] Disabling lock debugging due to kernel taint
[ 35.050940] hello_world: IN HELLO WORLD! Message: Hello, world!
Tried to call it! Result: 0
Read the Kernel Address Sanitizer (KASAN)
official documentation to read about different types of KASAN, how to configure
it, etc. E.g., adding kasan.fault=panic
to the boot parameters will cause a
panic instead of only printing the bug report:
$> qemu-system-x86_64 ... -append "console=ttyS0 kasan.fault=panic"` ...
...
(initramfs) /shared/hello_world_test
Going to call hello_world, syscall 451
[ 34.539545] ==================================================================
[ 34.541429] BUG: KASAN: user-memory-access in string+0xf1/0x1f0
[ 34.542816] Read of size 1 at addr 000055828f622008 by task hello_world_tes/186
[ 34.544588]
[ 34.544974] CPU: 0 PID: 186 Comm: hello_world_tes Not tainted 5.18.0-rc5-g30c8e80f7932-dirty #5
[ 34.547159] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[ 34.549074] Call Trace:
[ 34.549651] <TASK>
[ 34.550126] dump_stack_lvl+0x34/0x44
[ 34.550908] kasan_report+0xab/0x120
[ 34.551663] ? string+0xf1/0x1f0
[ 34.552349] string+0xf1/0x1f0
[ 34.553023] ? ip6_addr_string_sa+0x400/0x400
[ 34.553939] ? __rcu_read_unlock+0x43/0x60
[ 34.554802] ? __is_insn_slot_addr+0x56/0x80
[ 34.555697] vsnprintf+0x4c8/0x930
[ 34.556416] ? pointer+0x690/0x690
[ 34.557157] ? kvm_sched_clock_read+0x14/0x40
[ 34.558081] ? sched_clock_cpu+0x15/0x130
[ 34.558962] vprintk_store+0x330/0x610
[ 34.559831] ? printk_sprint+0xb0/0xb0
[ 34.560656] ? kasan_save_stack+0x2e/0x40
[ 34.561591] ? kasan_save_stack+0x1e/0x40
[ 34.562530] ? __kasan_kmalloc+0x81/0xa0
[ 34.563451] ? __do_sys_hello_world+0x29/0x60
[ 34.564467] ? do_syscall_64+0x3b/0x90
[ 34.565362] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 34.566518] ? new_sync_write+0x22a/0x310
[ 34.567436] ? new_sync_read+0x300/0x300
[ 34.568306] ? fsnotify+0x930/0x930
[ 34.569138] vprintk_emit+0xb1/0x220
[ 34.569993] _printk+0xad/0xde
[ 34.570691] ? swsusp_close.cold+0xc/0xc
[ 34.571603] ? kasan_unpoison+0x23/0x50
[ 34.572448] ? __kasan_slab_alloc+0x2c/0x80
[ 34.573425] __do_sys_hello_world.cold+0x27/0x49
[ 34.574486] do_syscall_64+0x3b/0x90
[ 34.575283] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 34.576415] RIP: 0033:0x7fecf528a67d
[ 34.577281] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[ 34.581628] RSP: 002b:00007ffca671b8d8 EFLAGS: 00000212 ORIG_RAX: 00000000000001c3
[ 34.583384] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fecf528a67d
[ 34.584889] RDX: 0000000000000008 RSI: 000000000000000d RDI: 000055828f622008
[ 34.586363] RBP: 00007ffca671b900 R08: 0000558290f4f2a0 R09: 00007ffca671b9f8
[ 34.587948] R10: 0000000000000027 R11: 0000000000000212 R12: 000055828f6210a0
[ 34.589584] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 34.591140] </TASK>
[ 34.591596] ==================================================================
[ 34.593183] Kernel panic - not syncing: kasan.fault=panic set ...
[ 34.594580] CPU: 0 PID: 186 Comm: hello_world_tes Not tainted 5.18.0-rc5-g30c8e80f7932-dirty #5
[ 34.596414] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[ 34.598310] Call Trace:
[ 34.598888] <TASK>
[ 34.599376] dump_stack_lvl+0x34/0x44
[ 34.600219] panic+0x19a/0x345
[ 34.600875] ? panic_print_sys_info.part.0+0x5b/0x5b
[ 34.601987] ? preempt_count_sub+0xf/0xb0
[ 34.602853] ? string+0xf1/0x1f0
[ 34.603592] end_report.part.0+0x54/0x69
[ 34.604411] kasan_report+0xba/0x120
[ 34.605245] ? string+0xf1/0x1f0
[ 34.606006] string+0xf1/0x1f0
[ 34.606704] ? ip6_addr_string_sa+0x400/0x400
[ 34.607671] ? __rcu_read_unlock+0x43/0x60
[ 34.608594] ? __is_insn_slot_addr+0x56/0x80
[ 34.609591] vsnprintf+0x4c8/0x930
[ 34.610360] ? pointer+0x690/0x690
[ 34.611117] ? kvm_sched_clock_read+0x14/0x40
[ 34.612044] ? sched_clock_cpu+0x15/0x130
[ 34.612989] vprintk_store+0x330/0x610
[ 34.613833] ? printk_sprint+0xb0/0xb0
[ 34.614692] ? kasan_save_stack+0x2e/0x40
[ 34.615579] ? kasan_save_stack+0x1e/0x40
[ 34.616455] ? __kasan_kmalloc+0x81/0xa0
[ 34.617359] ? __do_sys_hello_world+0x29/0x60
[ 34.618349] ? do_syscall_64+0x3b/0x90
[ 34.619177] ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 34.620325] ? new_sync_write+0x22a/0x310
[ 34.621189] ? new_sync_read+0x300/0x300
[ 34.621984] ? fsnotify+0x930/0x930
[ 34.622688] vprintk_emit+0xb1/0x220
[ 34.623406] _printk+0xad/0xde
[ 34.624024] ? swsusp_close.cold+0xc/0xc
[ 34.624813] ? kasan_unpoison+0x23/0x50
[ 34.625594] ? __kasan_slab_alloc+0x2c/0x80
[ 34.626440] __do_sys_hello_world.cold+0x27/0x49
[ 34.627365] do_syscall_64+0x3b/0x90
[ 34.628088] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 34.629123] RIP: 0033:0x7fecf528a67d
[ 34.629845] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[ 34.633512] RSP: 002b:00007ffca671b8d8 EFLAGS: 00000212 ORIG_RAX: 00000000000001c3
[ 34.634994] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fecf528a67d
[ 34.636396] RDX: 0000000000000008 RSI: 000000000000000d RDI: 000055828f622008
[ 34.637806] RBP: 00007ffca671b900 R08: 0000558290f4f2a0 R09: 00007ffca671b9f8
[ 34.639194] R10: 0000000000000027 R11: 0000000000000212 R12: 000055828f6210a0
[ 34.640600] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 34.642026] </TASK>
[ 34.642605] Kernel Offset: 0x21a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbffff)
[ 34.644693] ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]---
You can debug the kernel panic with gdb by setting a hardware breakpoint on
the panic
function. You’ll have to do this without kaslr though by adding a
nokaslr
option to the boot parameters:
$> qemu-system-x86_64 ... -append "console=ttyS0 kasan.fault=panic nokaslr"` ...
and in gdb:
$> gdb vmlinux
...
(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in gdt_page ()
(gdb) hb panic
Hardware assisted breakpoint 1 at 0xffffffff82420a19: file kernel/panic.c, line 187.
(gdb) c
Continuing.
Breakpoint 1, panic (fmt=fmt@entry=0xffffffff83113ed8 "kasan.fault=panic set ...\n") at kernel/panic.c:187
187 {
(gdb)