Adding a System Call to the Linux Kernel

These are mostly notes for myself.

This blog describes adding a new system call for debugging/experimental purposes.

For a thorough explanation and walkthrough of adding a new system call, read the official kernel documentation here: Adding a New System Call. The official documentation details how to make the new syscall optional, adding fallback stub implementations, recommendations for backwards compatability, testing, and more.

Setup

See this previous post: Building and Running the Linux Kernel.

Syscalls

Several files need to be updated to add a new syscall:

  • new/path/to/implementation - Syscall function implementation
    • If the implementation location is a new file/directory, a few Makefile changes are needed too
  • include/linux/syscalls.h - Function prototype
  • Generic and architecture-specific system call tables:
    • include/uapi/asm-generic/unistd.h - Generic syscall table in the user-space API
    • arch/x86/entry/syscalls/syscall_64.tbl - Architecture-specific syscall table

Function Implementation

First, create a new directory with its own Makefile and a hello_world.c:

# from the root of the linux source tree
mkdir hello_world
touch hello_world/{Makefile,hello_world.c}

Add this to the Makefile:

obj-y := hello_world.o
Details on Linux Makefiles
I found it very useful to read the kernel documentation on its Makefile structure: Linux Kernel Makefiles

In hello_world/hello_world.c, define the new system call implementation with the SYSCALL_DEFINE<N> macro, where the N is the number of arguments the syscall accepts:

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/syscalls.h>

SYSCALL_DEFINE2(hello_world, const char*, message, unsigned long, len) {
  char *kmessage;
  int res;

  if (len > 0x1000) {
    pr_err("User message length was unreasonably long\n");
    return -EINVAL;
  }

  kmessage = kzalloc(len + 1, GFP_KERNEL);
  if (!kmessage) {
    pr_err("Failed to allocate %08lux bytes\n", len + 1);
    return -ENOMEM;
  }

  res = copy_from_user(kmessage, message, len);
  if (res) {
    return -EFAULT;
  }

  pr_info("IN HELLO WORLD! Message: %s\n", kmessage);

  return 0;
}

A few things to note about this code:

  1. You ABSOLUTELY SHOULD NOT directly access user memory! Use the copy_from_user/copy_to_user/get_user/put_user functions.
  2. Allocating memory in linux can be done with a variety of calls. It’s worth reading through the official Memory Allocation Guide in the linux kernel documentation.
  3. Return values are the negative value of error codes, or 0 for success.
  4. pr_<LOG_LEVEL>(...) functions are macros that wrap the printk(KERN_<LOG_LEVEL> ...) functions. See Message logging with printk for details.

Function Prototype

The function declaration itself lives in include/linux/syscalls.h:

asmlinkage long sys_hello_world(const char* message, unsigned long len);

This is the main entrypoint for the new system call. These are always prefixed by sys_.

System Call Table

Both a generic syscall table (include/uapi/asm-generic/unistd.h) and architecture-specific syscall tables exist (arch/x86/entry/syscalls/syscall_64.tbl). The syscall should be added to both the generic table, and all architectures that should support it.

Changes to the generic syscall table (don’t forget to increment __NR_syscalls!):

diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c48b0ae3ba3..395bd563d6f2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_hello_world 451
+__SYSCALL(__NR_hello_world, sys_hello_world)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452

Changes to the x86 syscall table:

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..e44d6d33e7ce 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@
 448    common  process_mrelease        sys_process_mrelease
 449    common  futex_waitv             sys_futex_waitv
 450    common  set_mempolicy_home_node sys_set_mempolicy_home_node
+451    common  hello_world     sys_hello_world

(Re)Building the Kernel

Rebuild the kernel once you’ve made your changes. If you want to clean everything first (including configuration files), run make mrproper, reconfigure, and then rebuild.

Testing the Syscall

Building on the previous post about building the kernel, we’ll use qemu to debug the syscall.

First we’ll need something in userland to actually make the syscall. We’ll use the syscall(2) function to call our new syscall directly:

#include <stdio.h>
#include <unistd.h>
#include <string.h>

int main(int argc, char **argv) {
  const char* message = "Hello, world!";
  int res;

  printf("Going to call hello_world, syscall 451\n");
  res = syscall(451, message, strlen(message));
  printf("Tried to call it! Result: %d\n", res);
}

Compile with

gcc hello_world_test.c -o hello_world_test

Run the linux kernel with qemu-system-x86_64 and run our hello_world_test binary:

qemu-system-x86_64 \
    -kernel arch/x86_64/boot/bzImage \
    -nographic \
    -append "console=ttyS0" \
    -m 1024 \
    -initrd initfs \
    --enable-kvm \
    -cpu host \
    -s -S \
    -fsdev local,path=$(pwd),security_model=none,id=test_dev \
    -device virtio-9p,fsdev=test_dev,mount_tag=test_mount

Attach gdb via a remote debugging session (in another shell):

$> gdb vmlinux
(gdb) target remote :1234
(gdb) c

Back in the qemu shell, mount the 9p shared directory to the /shared directory within our running vm. This makes the root of the linux source tree (or wherever $(pwd) was when you ran qemu-system-x86) accessible within the VM:

(initramfs) mkdir /shared
(initramfs) mount -t 9p -o trans=virtio test_mount /shared/ -oversion=9p2000.L,posixacl,msize=512000,cache=loose

In the qemu shell, you should now be able to run the built hello_world_test binary from the /shared directory:

(initramfs) /shared/hello_world_test
Going to call hello_world, syscall 451
[   38.305249] hello_world: IN HELLO WORLD! Message: Hello, world!
Tried to call it! Result: 0
[   38.308184] hello_world_tes (186) used greatest stack depth: 27392 bytes left

We did it!

Experimenting with KASAN

You can see KASAN in action by modifying our hello world syscall implementation. Change the pr_info("IN HELLO WORLD! Message: %s\n", kmessage); call to directly use the message pointer from userspace:

...
pr_info("IN HELLO WORLD! Message: %s\n", message);
...

Rebuild, and rerun the hello_world_test binary. You should see something like this:

(initramfs) /shared/hello_world_test                                                                                   
Going to call hello_world, syscall 451                                                                                 
[   35.050963] ==================================================================                                      
[   35.052699] BUG: KASAN: user-memory-access in string+0xf1/0x1f0                                                     
[   35.054292] Read of size 1 at addr 000055b5e1bed008 by task hello_world_tes/189                                     
[   35.055996]                                                                                                         
[   35.056351] CPU: 0 PID: 189 Comm: hello_world_tes Not tainted 5.18.0-rc5-g30c8e80f7932-dirty #3                     
[   35.058492] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014                         
[   35.060354] Call Trace:                                                                                             
[   35.060893]  <TASK>                                                                                                 
[   35.061336]  dump_stack_lvl+0x34/0x44                                                                               
[   35.062101]  kasan_report+0xab/0x120                                                                                
[   35.062839]  ? string+0xf1/0x1f0                                                                                    
[   35.063526]  string+0xf1/0x1f0                                                                                      
[   35.064167]  ? ip6_addr_string_sa+0x400/0x400                                                                       
[   35.065067]  ? __rcu_read_unlock+0x43/0x60                                                                          
[   35.065929]  ? __is_insn_slot_addr+0x56/0x80                                                                        
[   35.066817]  vsnprintf+0x4c8/0x930                                                                                  
[   35.067774]  ? pointer+0x690/0x690                                                                                  
[   35.068683]  ? kvm_sched_clock_read+0x14/0x40                                                                       
[   35.069592]  ? sched_clock_cpu+0x15/0x130                                                                           
[   35.070419]  vprintk_store+0x330/0x610                                                                              
[   35.071238]  ? printk_sprint+0xb0/0xb0                                                                              
[   35.072011]  ? kasan_save_stack+0x2e/0x40                                                                           
[   35.072834]  ? kasan_save_stack+0x1e/0x40                                                                           
[   35.073654]  ? __kasan_kmalloc+0x81/0xa0                                                                            
[   35.074597]  ? __do_sys_hello_world+0x29/0x60                                                                       
[   35.075590]  ? do_syscall_64+0x3b/0x90                                                                              
[   35.076370]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae                                                             
[   35.077442]  ? new_sync_write+0x22a/0x310                                                                           
[   35.078266]  ? new_sync_read+0x300/0x300                                                                            
[   35.079421]  ? fsnotify+0x930/0x930                                                                                 
[   35.080362]  vprintk_emit+0xb1/0x220                                                                                
[   35.081138]  _printk+0xad/0xde                                                                                      
[   35.081859]  ? swsusp_close.cold+0xc/0xc                                                                            
[   35.082666]  ? kasan_unpoison+0x23/0x50                                                                             
[   35.083476]  ? __kasan_slab_alloc+0x2c/0x80                                                                         
[   35.084326]  __do_sys_hello_world.cold+0x27/0x49                                                                    
[   35.085271]  do_syscall_64+0x3b/0x90                                                                                
[   35.086002]  entry_SYSCALL_64_after_hwframe+0x44/0xae                                                               
[   35.087057] RIP: 0033:0x7f6c313d067d                                                                                
[   35.087885] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[   35.092050] RSP: 002b:00007ffc637aa178 EFLAGS: 00000212 ORIG_RAX: 00000000000001c3                                  
[   35.093562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6c313d067d                                       
[   35.094994] RDX: 0000000000000008 RSI: 000000000000000d RDI: 000055b5e1bed008                                       
[   35.096453] RBP: 00007ffc637aa1a0 R08: 000055b5e2b052a0 R09: 00007ffc637aa298                                       
[   35.098002] R10: 0000000000000027 R11: 0000000000000212 R12: 000055b5e1bec0a0                                       
[   35.099429] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000                                       
[   35.100898]  </TASK>                                                                                                
[   35.101502] ==================================================================                                      
[   35.103068] Disabling lock debugging due to kernel taint                                                            
[   35.050940] hello_world: IN HELLO WORLD! Message: Hello, world!                                                     
Tried to call it! Result: 0

Read the Kernel Address Sanitizer (KASAN) official documentation to read about different types of KASAN, how to configure it, etc. E.g., adding kasan.fault=panic to the boot parameters will cause a panic instead of only printing the bug report:

$> qemu-system-x86_64 ... -append "console=ttyS0 kasan.fault=panic"` ...

...

(initramfs) /shared/hello_world_test
Going to call hello_world, syscall 451 
[   34.539545] ==================================================================
[   34.541429] BUG: KASAN: user-memory-access in string+0xf1/0x1f0
[   34.542816] Read of size 1 at addr 000055828f622008 by task hello_world_tes/186
[   34.544588]                                             
[   34.544974] CPU: 0 PID: 186 Comm: hello_world_tes Not tainted 5.18.0-rc5-g30c8e80f7932-dirty #5
[   34.547159] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[   34.549074] Call Trace:                                 
[   34.549651]  <TASK>                                     
[   34.550126]  dump_stack_lvl+0x34/0x44                                                                               
[   34.550908]  kasan_report+0xab/0x120                                                                                
[   34.551663]  ? string+0xf1/0x1f0                                                                                    
[   34.552349]  string+0xf1/0x1f0                                                                                      
[   34.553023]  ? ip6_addr_string_sa+0x400/0x400                                                                       
[   34.553939]  ? __rcu_read_unlock+0x43/0x60                                                                          
[   34.554802]  ? __is_insn_slot_addr+0x56/0x80                                                                        
[   34.555697]  vsnprintf+0x4c8/0x930                      
[   34.556416]  ? pointer+0x690/0x690                                                                                  
[   34.557157]  ? kvm_sched_clock_read+0x14/0x40                                                                       
[   34.558081]  ? sched_clock_cpu+0x15/0x130                                                                           
[   34.558962]  vprintk_store+0x330/0x610                                                                              
[   34.559831]  ? printk_sprint+0xb0/0xb0                  
[   34.560656]  ? kasan_save_stack+0x2e/0x40               
[   34.561591]  ? kasan_save_stack+0x1e/0x40  
[   34.562530]  ? __kasan_kmalloc+0x81/0xa0        
[   34.563451]  ? __do_sys_hello_world+0x29/0x60       
[   34.564467]  ? do_syscall_64+0x3b/0x90               
[   34.565362]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   34.566518]  ? new_sync_write+0x22a/0x310                                                                           
[   34.567436]  ? new_sync_read+0x300/0x300                                                                            
[   34.568306]  ? fsnotify+0x930/0x930                                                                                 
[   34.569138]  vprintk_emit+0xb1/0x220                                                                                
[   34.569993]  _printk+0xad/0xde                                                                                      
[   34.570691]  ? swsusp_close.cold+0xc/0xc                                                                            
[   34.571603]  ? kasan_unpoison+0x23/0x50                                                                             
[   34.572448]  ? __kasan_slab_alloc+0x2c/0x80             
[   34.573425]  __do_sys_hello_world.cold+0x27/0x49                                                                    
[   34.574486]  do_syscall_64+0x3b/0x90                                                                                
[   34.575283]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   34.576415] RIP: 0033:0x7fecf528a67d
[   34.577281] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[   34.581628] RSP: 002b:00007ffca671b8d8 EFLAGS: 00000212 ORIG_RAX: 00000000000001c3
[   34.583384] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fecf528a67d
[   34.584889] RDX: 0000000000000008 RSI: 000000000000000d RDI: 000055828f622008
[   34.586363] RBP: 00007ffca671b900 R08: 0000558290f4f2a0 R09: 00007ffca671b9f8
[   34.587948] R10: 0000000000000027 R11: 0000000000000212 R12: 000055828f6210a0
[   34.589584] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   34.591140]  </TASK>                                    
[   34.591596] ==================================================================
[   34.593183] Kernel panic - not syncing: kasan.fault=panic set ...
[   34.594580] CPU: 0 PID: 186 Comm: hello_world_tes Not tainted 5.18.0-rc5-g30c8e80f7932-dirty #5
[   34.596414] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[   34.598310] Call Trace:
[   34.598888]  <TASK>
[   34.599376]  dump_stack_lvl+0x34/0x44
[   34.600219]  panic+0x19a/0x345
[   34.600875]  ? panic_print_sys_info.part.0+0x5b/0x5b
[   34.601987]  ? preempt_count_sub+0xf/0xb0
[   34.602853]  ? string+0xf1/0x1f0
[   34.603592]  end_report.part.0+0x54/0x69
[   34.604411]  kasan_report+0xba/0x120
[   34.605245]  ? string+0xf1/0x1f0
[   34.606006]  string+0xf1/0x1f0
[   34.606704]  ? ip6_addr_string_sa+0x400/0x400
[   34.607671]  ? __rcu_read_unlock+0x43/0x60
[   34.608594]  ? __is_insn_slot_addr+0x56/0x80
[   34.609591]  vsnprintf+0x4c8/0x930
[   34.610360]  ? pointer+0x690/0x690
[   34.611117]  ? kvm_sched_clock_read+0x14/0x40
[   34.612044]  ? sched_clock_cpu+0x15/0x130
[   34.612989]  vprintk_store+0x330/0x610
[   34.613833]  ? printk_sprint+0xb0/0xb0
[   34.614692]  ? kasan_save_stack+0x2e/0x40
[   34.615579]  ? kasan_save_stack+0x1e/0x40
[   34.616455]  ? __kasan_kmalloc+0x81/0xa0
[   34.617359]  ? __do_sys_hello_world+0x29/0x60
[   34.618349]  ? do_syscall_64+0x3b/0x90
[   34.619177]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[   34.620325]  ? new_sync_write+0x22a/0x310
[   34.621189]  ? new_sync_read+0x300/0x300
[   34.621984]  ? fsnotify+0x930/0x930
[   34.622688]  vprintk_emit+0xb1/0x220
[   34.623406]  _printk+0xad/0xde
[   34.624024]  ? swsusp_close.cold+0xc/0xc
[   34.624813]  ? kasan_unpoison+0x23/0x50
[   34.625594]  ? __kasan_slab_alloc+0x2c/0x80
[   34.626440]  __do_sys_hello_world.cold+0x27/0x49
[   34.627365]  do_syscall_64+0x3b/0x90
[   34.628088]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   34.629123] RIP: 0033:0x7fecf528a67d
[   34.629845] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 48
[   34.633512] RSP: 002b:00007ffca671b8d8 EFLAGS: 00000212 ORIG_RAX: 00000000000001c3
[   34.634994] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fecf528a67d
[   34.636396] RDX: 0000000000000008 RSI: 000000000000000d RDI: 000055828f622008
[   34.637806] RBP: 00007ffca671b900 R08: 0000558290f4f2a0 R09: 00007ffca671b9f8
[   34.639194] R10: 0000000000000027 R11: 0000000000000212 R12: 000055828f6210a0
[   34.640600] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[   34.642026]  </TASK>
[   34.642605] Kernel Offset: 0x21a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbffff)
[   34.644693] ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]---

You can debug the kernel panic with gdb by setting a hardware breakpoint on the panic function. You’ll have to do this without kaslr though by adding a nokaslr option to the boot parameters:

$> qemu-system-x86_64 ... -append "console=ttyS0 kasan.fault=panic nokaslr"` ...

and in gdb:

$> gdb vmlinux
...
(gdb) target remote :1234
Remote debugging using :1234
0x000000000000fff0 in gdt_page ()
(gdb) hb panic
Hardware assisted breakpoint 1 at 0xffffffff82420a19: file kernel/panic.c, line 187.
(gdb) c
Continuing.

Breakpoint 1, panic (fmt=fmt@entry=0xffffffff83113ed8 "kasan.fault=panic set ...\n") at kernel/panic.c:187
187     {
(gdb)
comments powered by Disqus