Stack Traces and Mystery Addresses (blk-rq-qos Crash)

Author: Omar Sandoval
Date: February 12, 2025

This is a hands-on tutorial walking through a real Linux kernel bug that caused kernel crashes in production. We’ll read kernel code and use a few important drgn techniques for reading stack traces and interpreting memory in order to identify the root cause of the bug.

We saw this crash on storage workloads on multiple kernel versions, up to and including the latest at the time, Linux 6.11. The kernel logs all implicated something in the block layer.

A core dump and debugging symbols are provided for you to follow along with.

This tutorial is also available as a video:

Setup

Follow the Installation instructions to get drgn.

Download and extract the tutorial files:

$ curl -L https://github.com/osandov/drgn/releases/download/tutorial-assets/blk_rq_qos_crash_tutorial.tar.zst \
  | zstd -d | tar -x

This will create a directory named blk_rq_qos_crash_tutorial. Enter it:

$ cd blk_rq_qos_crash_tutorial

Then, run drgn as follows. It will print a version banner and automatically import the relevant Helpers:

$ drgn -c vmcore -s vmlinux --main-symbols
drgn 0.0.30 (using Python 3.13.1, elfutils 0.192, with libkdumpfile)
For help, type help(drgn).
>>> import drgn
>>> from drgn import FaultError, NULL, Object, alignof, cast, container_of, execscript, implicit_convert, offsetof, reinterpret, sizeof, stack_trace
>>> from drgn.helpers.common import *
>>> from drgn.helpers.linux import *

In another window, check out the source code for Linux 6.11. For example, run git checkout v6.11 in an existing Linux repo, or run:

$ git clone -b v6.11 --depth 1 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
...
$ cd linux

Now we can dive into the core dump.

Starting With Dmesg

The kernel log buffer is usually the first place to look when debugging a crash. In drgn, call print_dmesg() and scroll up until you find the line starting with BUG:. You should see the following trace:

>>> print_dmesg()
...
[   18.051123] BUG: kernel NULL pointer dereference, address: 00000000000006fc
[   18.051597] #PF: supervisor write access in kernel mode
[   18.051936] #PF: error_code(0x0002) - not-present page
[   18.052241] PGD 0 P4D 0
[   18.052336] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
[   18.052629] CPU: 0 UID: 0 PID: 906 Comm: fio Kdump: loaded Not tainted 6.11.0 #1
[   18.053123] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[   18.053739] RIP: 0010:_raw_spin_lock_irqsave+0x36/0x70
[   18.054059] Code: 04 25 28 00 00 00 48 89 44 24 08 48 c7 04 24 00 00 00 00 9c 8f 04 24 48 8b 1c 24 fa 65 ff 05 89 2a b7 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 1e 65 48 8b 04 25 28 00 00 00 48 3b 44 24 08 75 17
[   18.055467] RSP: 0000:ffffc900011abcd0 EFLAGS: 00010046
[   18.055788] RAX: 0000000000000000 RBX: 0000000000000082 RCX: 0000000000000001
[   18.056260] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00000000000006fc
[   18.056725] RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000015000e
[   18.057202] R10: ffff888002fa5900 R11: ffffffff81312090 R12: 0000000000000003
[   18.057669] R13: ffff888002d4b678 R14: 00000000000006fc R15: 0000000000000003
[   18.058138] FS:  00007f1ee66c06c0(0000) GS:ffff888005a00000(0000) knlGS:0000000000000000
[   18.058677] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.059039] CR2: 00000000000006fc CR3: 0000000002f4a005 CR4: 0000000000770ef0
[   18.059508] PKRU: 55555554
[   18.059614] Call Trace:
[   18.059700]  <TASK>
[   18.059782]  ? __die_body+0x16/0x60
[   18.059982]  ? page_fault_oops+0x31e/0x3a0
[   18.060205]  ? exc_page_fault+0x55/0xa0
[   18.060409]  ? asm_exc_page_fault+0x26/0x30
[   18.060640]  ? __pfx_wbt_inflight_cb+0x10/0x10
[   18.060892]  ? _raw_spin_lock_irqsave+0x36/0x70
[   18.061150]  try_to_wake_up+0x3e/0x400
[   18.061342]  rq_qos_wake_function+0x4d/0x60
[   18.061572]  __wake_up_common+0x42/0x80
[   18.061770]  __wake_up_common_lock+0x33/0x60
[   18.062007]  wbt_done+0x60/0x80
[   18.062152]  __rq_qos_done+0x22/0x40
[   18.062330]  blk_mq_free_request+0x62/0xb0
[   18.062551]  virtblk_done+0x99/0x120
[   18.062731]  vring_interrupt+0x71/0x80
[   18.062928]  vp_interrupt+0xa8/0xe0
[   18.063100]  __handle_irq_event_percpu+0x89/0x1b0
[   18.063373]  handle_irq_event_percpu+0xf/0x40
[   18.063614]  handle_irq_event+0x30/0x50
[   18.063831]  handle_fasteoi_irq+0xaa/0x1b0
[   18.064051]  __common_interrupt+0x3a/0xb0
[   18.064266]  common_interrupt+0x3d/0x90
[   18.064462]  asm_common_interrupt+0x26/0x40
[   18.064691] RIP: 0033:0x7f1ef33679b9
[   18.064886] Code: ff 48 85 c0 0f 84 32 35 00 00 48 8b bd b8 f9 ff ff 4c 89 b5 80 f9 ff ff 48 89 07 4c 01 f8 48 89 85 78 f9 ff ff e9 8d ca ff ff <48> 8b 85 60 fa ff ff 48 8d 50 08 48 89 95 60 fa ff ff e9 c7 d5 ff
[   18.066333] RSP: 002b:00007f1ee66baad0 EFLAGS: 00000212
[   18.066624] RAX: 00007f1ee66bad56 RBX: 00007f1ee66bb1d0 RCX: 00007f1ee66bad56
[   18.066999] RDX: 0000000000000030 RSI: 00000000000f12b3 RDI: 000000000000000a
[   18.067476] RBP: 00007f1ee66bb1a0 R08: 000000000000002c R09: 0000000000000000
[   18.068003] R10: 00007f1ef348dfe0 R11: 0000000000000020 R12: 0000000000000020
[   18.068482] R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000001
[   18.069005]  </TASK>
[   18.069097] CR2: 00000000000006fc

BUG: kernel NULL pointer dereference, address: 00000000000006fc tells us that the kernel crashed because it dereferenced a null pointer and tried to access the address 0x6fc.

RIP: 0010:_raw_spin_lock_irqsave+0x36/0x70 tells us that the bad access happened in the function _raw_spin_lock_irqsave(). Below that, the stack trace tells us how we got there.

Tip

Ignore call trace lines starting with ?. These are stale function addresses on the stack that are not part of the actual call trace. They are printed as a hint/fail-safe, but they are misleading as often as not.

We’ll look at the trace in more detail with drgn soon, but we can see that we got an interrupt for a disk I/O completion, which then tried to wake up a task and acquire a spinlock.

Stack Trace in drgn

Now let’s look at drgn’s view of the stack trace. Save the stack trace of the crashed thread:

>>> trace = prog.crashed_thread().stack_trace()

And print it:

>>> trace
#0  arch_atomic_try_cmpxchg (./arch/x86/include/asm/atomic.h:107:9)
#1  raw_atomic_try_cmpxchg_acquire (./include/linux/atomic/atomic-arch-fallback.h:2170:9)
#2  atomic_try_cmpxchg_acquire (./include/linux/atomic/atomic-instrumented.h:1302:9)
#3  queued_spin_lock (./include/asm-generic/qspinlock.h:111:6)
#4  do_raw_spin_lock (./include/linux/spinlock.h:187:2)
#5  __raw_spin_lock_irqsave (./include/linux/spinlock_api_smp.h:111:2)
#6  _raw_spin_lock_irqsave (kernel/locking/spinlock.c:162:9)
#7  class_raw_spinlock_irqsave_constructor (./include/linux/spinlock.h:551:1)
#8  try_to_wake_up (kernel/sched/core.c:4051:2)
#9  rq_qos_wake_function (block/blk-rq-qos.c:223:2)
#10 __wake_up_common (kernel/sched/wait.c:89:9)
#11 __wake_up_common_lock (kernel/sched/wait.c:106:14)
#12 wbt_done (block/blk-wbt.c:259:3)
#13 __rq_qos_done (block/blk-rq-qos.c:39:4)
#14 rq_qos_done (block/blk-rq-qos.h:122:3)
#15 blk_mq_free_request (block/blk-mq.c:737:2)
#16 virtblk_done (drivers/block/virtio_blk.c:367:5)
#17 vring_interrupt (drivers/virtio/virtio_ring.c:2595:3)
#18 vp_vring_interrupt (drivers/virtio/virtio_pci_common.c:82:7)
#19 vp_interrupt (drivers/virtio/virtio_pci_common.c:113:9)
#20 __handle_irq_event_percpu (kernel/irq/handle.c:158:9)
#21 handle_irq_event_percpu (kernel/irq/handle.c:193:11)
#22 handle_irq_event (kernel/irq/handle.c:210:8)
#23 handle_fasteoi_irq (kernel/irq/chip.c:720:2)
#24 generic_handle_irq_desc (./include/linux/irqdesc.h:173:2)
#25 handle_irq (arch/x86/kernel/irq.c:247:3)
#26 call_irq_handler (arch/x86/kernel/irq.c:259:3)
#27 __common_interrupt (arch/x86/kernel/irq.c:285:6)
#28 common_interrupt (arch/x86/kernel/irq.c:278:1)
#29 asm_common_interrupt+0x26/0x2b (./arch/x86/include/asm/idtentry.h:693)
#30 0x7f1ef33679b9

Notice that drgn’s stack trace includes information not in the kernel trace, namely:

  1. File names and line and column numbers. These are very useful for navigating the code that you’re debugging.

  2. Inlined function calls. For example, frames 0-5 are all inlined calls, and frame 6 was the last actual call. You can verify this by printing each frame individually:

    >>> trace[0]
    #0 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in arch_atomic_try_cmpxchg at ./arch/x86/include/asm/atomic.h:107:9 (inlined)
    >>> trace[1]
    #1 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in raw_atomic_try_cmpxchg_acquire at ./include/linux/atomic/atomic-arch-fallback.h:2170:9 (inlined)
    >>> trace[2]
    #2 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in atomic_try_cmpxchg_acquire at ./include/linux/atomic/atomic-instrumented.h:1302:9 (inlined)
    >>> trace[3]
    #3 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in queued_spin_lock at ./include/asm-generic/qspinlock.h:111:6 (inlined)
    >>> trace[4]
    #4 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in do_raw_spin_lock at ./include/linux/spinlock.h:187:2 (inlined)
    >>> trace[5]
    #5 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in __raw_spin_lock_irqsave at ./include/linux/spinlock_api_smp.h:111:2 (inlined)
    >>> trace[6]
    #6 at 0xffffffff814b6446 (_raw_spin_lock_irqsave+0x36/0x68) in _raw_spin_lock_irqsave at kernel/locking/spinlock.c:162:9
    

    Notice that frames 0-5 end with (inlined), and all of the frames have the same instruction pointer, 0xffffffff814b6446.

Tracing Local Variables

Next, let’s walk through the stack trace to figure out where the null pointer came from.

Frames 0-2 are low-level atomic operations:

#0  arch_atomic_try_cmpxchg (./arch/x86/include/asm/atomic.h:107:9)
#1  raw_atomic_try_cmpxchg_acquire (./include/linux/atomic/atomic-arch-fallback.h:2170:9)
#2  atomic_try_cmpxchg_acquire (./include/linux/atomic/atomic-instrumented.h:1302:9)

That’s essentially a fancy memory access, so let’s skip those frames. Frame 3 is in queued_spin_lock(), the kernel’s spinlock implementation:

#3  queued_spin_lock (./include/asm-generic/qspinlock.h:111:6)

In your window with the Linux source code, open include/asm-generic/qspinlock.h and jump to line 111:

include/asm-generic/qspinlock.h
107static __always_inline void queued_spin_lock(struct qspinlock *lock)
108{
109        int val = 0;
110
111        if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
112                return;
113
114        queued_spin_lock_slowpath(lock, val);
115}

Notice that it accesses the lock parameter. Print it in drgn:

>>> trace[3]["lock"]
(struct qspinlock *)0x6fc

This matches the address from the BUG message in dmesg!

Now let’s find out where lock came from. Frames 4-7 wrap the low-level spinlock implementation:

#4  do_raw_spin_lock (./include/linux/spinlock.h:187:2)
#5  __raw_spin_lock_irqsave (./include/linux/spinlock_api_smp.h:111:2)
#6  _raw_spin_lock_irqsave (kernel/locking/spinlock.c:162:9)
#7  class_raw_spinlock_irqsave_constructor (./include/linux/spinlock.h:551:1)

Feel free to open the source code for these, but we can quickly check that the lock simply gets passed through:

>>> trace[4]["lock"]
(raw_spinlock_t *)0x6fc
>>> trace[5]["lock"]
(raw_spinlock_t *)0x6fc
>>> trace[6]["lock"]
(raw_spinlock_t *)0x6fc

class_raw_spinlock_irqsave_constructor() is slightly different. It is generated by a macro and doesn’t use the name lock:

>>> trace[7]["lock"]
Traceback (most recent call last):
  ...
KeyError: 'lock'

Let’s list all of its local variables and make a guess:

>>> trace[7].locals()
['l', '_t']
>>> trace[7]["l"]
(raw_spinlock_t *)0x6fc

Tip

Use drgn.StackFrame.locals() to get the list of parameters and local variables in a stack frame when finding the implementation of the function is inconvenient.

The caller must have passed 0x6fc. Let’s look at it. The next frame is in try_to_wake_up():

#8  try_to_wake_up (kernel/sched/core.c:4051:2)

Open kernel/sched/core.c at line 4051:

kernel/sched/core.c
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
...
        scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {

It is acquiring pi_lock in a task_struct (using a scoped guard). Print the task_struct:

>>> trace[8]["p"]
(struct task_struct *)0x0

There’s our null pointer! But where did 0x6fc come from? Look at the offset of pi_lock in struct task_struct:

>>> hex(offsetof(prog.type("struct task_struct"), "pi_lock"))
'0x6fc'

Or do the inverse and see what’s at offset 0x6fc in struct task_struct:

>>> member_at_offset(prog.type("struct task_struct"), 0x6fc)
'pi_lock.raw_lock.val.counter or pi_lock.raw_lock.locked or pi_lock.raw_lock.locked_pending'

Tip

Use offsetof() and member_at_offset() to decipher pointers to struct members.

So where did p come from? Let’s look at the caller, rq_qos_wake_function(), in frame 9:

#9  rq_qos_wake_function (block/blk-rq-qos.c:223:2)

Open block/blk-rq-qos.c at line 223:

block/blk-rq-qos.c
206static int rq_qos_wake_function(struct wait_queue_entry *curr,
207                                unsigned int mode, int wake_flags, void *key)
208{
209        struct rq_qos_wait_data *data = container_of(curr,
210                                                     struct rq_qos_wait_data,
211                                                     wq);
212
213        /*
214         * If we fail to get a budget, return -1 to interrupt the wake up loop
215         * in __wake_up_common.
216         */
217        if (!data->cb(data->rqw, data->private_data))
218                return -1;
219
220        data->got_token = true;
221        smp_wmb();
222        list_del_init(&curr->entry);
223        wake_up_process(data->task);
224        return 1;
225}

(Note: wake_up_process() doesn’t show up in the stack trace because of tail call elimination. This may be fixed in a future release of drgn.)

p came from data->task. Print data:

>>> trace[9]["data"]
*(struct rq_qos_wait_data *)0xffffc900011b3558 = {
        .wq = (struct wait_queue_entry){
                .flags = (unsigned int)2168637095,
                .private = (void *)0xffff888002d6c000,
                .func = (wait_queue_func_t)0x0,
                .entry = (struct list_head){
                        .next = (struct list_head *)0xffff888002d6c000,
                        .prev = (struct list_head *)0xffff888002da2100,
                },
        },
        .task = (struct task_struct *)0xffff888000fd6001,
        .rqw = (struct rq_wait *)0xffffc900011b3a30,
        .cb = (acquire_inflight_cb_t *)0xffff888002763030,
        .private_data = (void *)0x1,
        .got_token = (bool)201,
}

Notice that data->task is NOT null. Print the comm member, which should be the thread name:

>>> trace[9]["data"].task.comm
(char [16])""

Instead, it’s empty. This doesn’t appear to be a valid task_struct.

Identifying Mystery Addresses

If data->task isn’t a valid task_struct, then what is it? Pass it to identify_address() to answer that:

>>> identify_address(trace[9]["data"].task)
'slab object: buffer_head+0x1'

It’s a pointer to a completely unrelated type.

Since our problem seems to stem from data, pass it to identify_address() to see where it comes from:

>>> identify_address(trace[9]["data"])
'vmap stack: 909 (fio) +0x3558'

This means that data is on the stack of the task with PID 909.

Tip

Use identify_address() to figure out what an unknown address refers to.

Other Stacks

Notice that we’ve seen three possibilities for data->task:

  1. When it was passed to wake_up_process(), it was NULL.

  2. By the time of the crash, it was an unrelated pointer.

  3. It’s supposed to point to a task_struct.

This suggests that there’s a data race on data->task.

We know that data is on the stack of another task. Let’s find where it’s created. In block/blk-rq-qos.c, search for struct rq_qos_wait_data. You should find it being used in rq_qos_wait():

block/blk-rq-qos.c
243void rq_qos_wait(struct rq_wait *rqw, void *private_data,
244                 acquire_inflight_cb_t *acquire_inflight_cb,
245                 cleanup_cb_t *cleanup_cb)
246{
247        struct rq_qos_wait_data data = {
248                .wq = {
249                        .func       = rq_qos_wake_function,
250                        .entry      = LIST_HEAD_INIT(data.wq.entry),
251                },
252                .task = current,
253                .rqw = rqw,
254                .cb = acquire_inflight_cb,
255                .private_data = private_data,
256        };
257        bool has_sleeper;
258
259        has_sleeper = wq_has_sleeper(&rqw->wait);
260        if (!has_sleeper && acquire_inflight_cb(rqw, private_data))
261                return;
262
263        has_sleeper = !prepare_to_wait_exclusive(&rqw->wait, &data.wq,
264                                                 TASK_UNINTERRUPTIBLE);
265        do {
266                /* The memory barrier in set_task_state saves us here. */
267                if (data.got_token)
268                        break;
269                if (!has_sleeper && acquire_inflight_cb(rqw, private_data)) {
270                        finish_wait(&rqw->wait, &data.wq);
271
272                        /*
273                         * We raced with rq_qos_wake_function() getting a token,
274                         * which means we now have two. Put our local token
275                         * and wake anyone else potentially waiting for one.
276                         */
277                        smp_rmb();
278                        if (data.got_token)
279                                cleanup_cb(rqw, private_data);
280                        break;
281                }
282                io_schedule();
283                has_sleeper = true;
284                set_current_state(TASK_UNINTERRUPTIBLE);
285        } while (1);
286        finish_wait(&rqw->wait, &data.wq);
287}

This function creates data on the stack, with data->task set to the current task, and then tries to acquire an “inflight counter”. If one is not available, it puts itself on a wait queue and blocks until it can get one.

So, rq_qos_wait() waits for an inflight counter, and rq_qos_wake_function() wakes it up when one becomes available. We would expect that the PID we found earlier, 909, is currently blocked in rq_qos_wait(). Pass the PID to stack_trace() to check:

>>> stack_trace(909)
#0  rep_nop (./arch/x86/include/asm/vdso/processor.h:0:2)
#1  cpu_relax (./arch/x86/include/asm/vdso/processor.h:18:2)
#2  queued_spin_lock_slowpath (kernel/locking/qspinlock.c:380:3)
#3  queued_spin_lock (./include/asm-generic/qspinlock.h:114:2)
#4  do_raw_spin_lock (./include/linux/spinlock.h:187:2)
#5  __raw_spin_lock_irqsave (./include/linux/spinlock_api_smp.h:111:2)
#6  _raw_spin_lock_irqsave (kernel/locking/spinlock.c:162:9)
#7  virtblk_add_req_batch (drivers/block/virtio_blk.c:481:2)
#8  virtio_queue_rqs (drivers/block/virtio_blk.c:519:11)
#9  __blk_mq_flush_plug_list (block/blk-mq.c:2704:2)
#10 blk_mq_flush_plug_list (block/blk-mq.c:2781:4)
#11 blk_add_rq_to_plug (block/blk-mq.c:1292:3)
#12 blk_mq_submit_bio (block/blk-mq.c:3028:3)
#13 __submit_bio (block/blk-core.c:615:3)
#14 __submit_bio_noacct_mq (block/blk-core.c:696:3)
#15 submit_bio_noacct_nocheck (block/blk-core.c:725:3)
#16 ext4_io_submit (fs/ext4/page-io.c:377:3)
#17 io_submit_add_bh (fs/ext4/page-io.c:418:3)
#18 ext4_bio_write_folio (fs/ext4/page-io.c:560:3)
#19 mpage_submit_folio (fs/ext4/inode.c:1943:8)
#20 mpage_process_page_bufs (fs/ext4/inode.c:2056:9)
#21 mpage_prepare_extent_to_map (fs/ext4/inode.c:2564:11)
#22 ext4_do_writepages (fs/ext4/inode.c:2706:8)
#23 ext4_writepages (fs/ext4/inode.c:2842:8)
#24 do_writepages (mm/page-writeback.c:2683:10)
#25 __filemap_fdatawrite_range (mm/filemap.c:430:9)
#26 generic_fadvise (mm/fadvise.c:114:3)
#27 vfs_fadvise (mm/fadvise.c:185:9)
#28 ksys_fadvise64_64 (mm/fadvise.c:199:8)
#29 __do_sys_fadvise64 (mm/fadvise.c:214:9)
#30 __se_sys_fadvise64 (mm/fadvise.c:212:1)
#31 __x64_sys_fadvise64 (mm/fadvise.c:212:1)
#32 do_syscall_x64 (arch/x86/entry/common.c:52:14)
#33 do_syscall_64 (arch/x86/entry/common.c:83:7)
#34 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
#35 0x7f1ef340203a

It’s not in rq_qos_wait()! It seems to have moved on to something else.

Analysis

At this point, we’ve gotten everything that we need from drgn. Now we need to interpret what we’ve gathered and analyze the kernel code.

Based on the stack trace for PID 909, we can conclude that the waiter got a counter, returned, and moved on to something else. It reused the stack for unrelated data, which explains the mystery pointer that we saw in data->task. The series of events is something like this:

  1. acquire_inflight_cb() on line 260 fails.

  2. prepare_to_wait_exclusive() puts data on the waitqueue.

  3. acquire_inflight_cb() on line 269 succeeds.

  4. finish_wait() removes data from the waitqueue.

  5. rq_qos_wait() returns and the task moves on to something else, reusing the stack memory.

This means that the waker found the waiter’s data in between steps 2 and 4, but by the time the waker called wake_up_process(data->task), the waiter was past step 5.

Wakers and waiters are supposed to be synchronized. Going back to the crashing stack trace, we see that rq_qos_wake_function() is called via __wake_up_common_lock():

#10 __wake_up_common (kernel/sched/wait.c:89:9)
#11 __wake_up_common_lock (kernel/sched/wait.c:106:14)

Open kernel/sched/wait.c at line 106 and see that it’s holding wq_head->lock:

kernel/sched/wait.c
 99static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int mode,
100                        int nr_exclusive, int wake_flags, void *key)
101{
102        unsigned long flags;
103        int remaining;
104
105        spin_lock_irqsave(&wq_head->lock, flags);
106        remaining = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags,
107                        key);
108        spin_unlock_irqrestore(&wq_head->lock, flags);
109
110        return nr_exclusive - remaining;
111}

On the waiter side, finish_wait() also grabs wq_head->lock:

kernel/sched/wait.c
446void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry)
447{
448        unsigned long flags;
449
450        __set_current_state(TASK_RUNNING);
451        /*
452         * We can check for list emptiness outside the lock
453         * IFF:
454         *  - we use the "careful" check that verifies both
455         *    the next and prev pointers, so that there cannot
456         *    be any half-pending updates in progress on other
457         *    CPU's that we haven't seen yet (and that might
458         *    still change the stack area.
459         * and
460         *  - all other users take the lock (ie we can only
461         *    have _one_ other CPU that looks at or modifies
462         *    the list).
463         */
464        if (!list_empty_careful(&wq_entry->entry)) {
465                spin_lock_irqsave(&wq_head->lock, flags);
466                list_del_init(&wq_entry->entry);
467                spin_unlock_irqrestore(&wq_head->lock, flags);
468        }
469}

But there’s an important detail here: finish_wait() doesn’t take the lock if the wait queue list entry is empty, i.e., if it has already been removed from the wait queue.

Go back to rq_qos_wake_function():

block/blk-rq-qos.c
206static int rq_qos_wake_function(struct wait_queue_entry *curr,
207                                unsigned int mode, int wake_flags, void *key)
208{
209        struct rq_qos_wait_data *data = container_of(curr,
210                                                     struct rq_qos_wait_data,
211                                                     wq);
212
213        /*
214         * If we fail to get a budget, return -1 to interrupt the wake up loop
215         * in __wake_up_common.
216         */
217        if (!data->cb(data->rqw, data->private_data))
218                return -1;
219
220        data->got_token = true;
221        smp_wmb();
222        list_del_init(&curr->entry);
223        wake_up_process(data->task);
224        return 1;
225}

It removes the entry from the wait queue on line 222, then accesses the entry on line 223.

That’s the race condition: as soon as the entry has been removed from the wait queue, finish_wait() in the waiter can return instantly, and the waiter is free to move on. Therefore, after the entry has been removed, the waker must not access it.

The Fix

The fix is trivial: don’t delete the wait queue entry until after using it.

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 2cfb297d9a62..058f92c4f9d5 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -219,8 +219,8 @@ static int rq_qos_wake_function(struct wait_queue_entry *curr,

        data->got_token = true;
        smp_wmb();
-       list_del_init(&curr->entry);
        wake_up_process(data->task);
+       list_del_init_careful(&curr->entry);
        return 1;
 }

The deletion also needs careful memory ordering to pair with the list_empty_careful() in finish_wait(), hence the replacement of list_del_init() with list_del_init_careful().

This fix was merged in Linux 6.12 in commit e972b08b91ef (“blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race”).

Conclusion

Debugging a core dump involves a lot of cross-referencing code and core dump state. drgn gives you some powerful capabilities for understanding kernel state, which you can use to discern subtle bugs like this one. In particular, identify_address(), member_at_offset(), and drgn.StackFrame.locals() are often crucial to an investigation.

Feel free to reference the Helpers and explore this core dump further.

Bonus Challenge: Reading File Pages

As a bonus, try dumping the contents of the file /init in the core dump (this is the script that I used to reproduce the bug).

First, find the inode for /init and its file size.

Hint

See path_lookup().

Answer
>>> inode = path_lookup("/init").dentry.d_inode
>>> inode
*(struct inode *)0xffff88800289c568 = {
        ...
}
>>> inode.i_size
(loff_t)578

The page cache for an inode is in an XArray, inode->i_mapping->i_pages. Get the cached page at offset 0.

Hint

See xa_load() and cast().

Answer
>>> entry = xa_load(inode.i_mapping.i_pages.address_of_(), 0)
>>> page = cast("struct page *", entry)
>>> page
*(struct page *)0xffffea000015f840 = {
        ...
}

Get the page’s virtual address.

Hint

See page_to_virt().

Answer
>>> addr = page_to_virt(page)
>>> addr
(void *)0xffff8880057e1000

Finally, read from the virtual address.

Hint

See drgn.Program.read().

Answer
>>> print(prog.read(addr, inode.i_size).decode())
#!/bin/sh -e

mount -t proc -o nosuid,nodev,noexec proc /proc
mount -t devtmpfs -o nosuid dev /dev
mkdir /dev/shm
mount -t tmpfs -o nosuid,nodev tmpfs /dev/shm
mount -t sysfs -o nosuid,nodev,noexec sys /sys
mount -t tmpfs -o nosuid,nodev tmpfs /tmp
kexec --load-panic --kexec-syscall-auto --command-line="root=/dev/vda rw console=ttyS0,115200 init=/kdump-init" vmlinuz
echo 1 > /sys/block/vda/queue/wbt_lat_usec
while true; do
        cat /init > /dev/null
done &
fio --name=writer --rw=randwrite --ioengine=sync --buffered=1 --bs=4K --time_based --runtime=3600 --size=16M
poweroff -f