Skip to content

[BUG] FEMU crashed on FDP mode in victim_ru_get_pri(a=0x0) #189

@xcws52

Description

@xcws52

Describe the bug
Hello! I'm testing YCSB on RocksDB on emulated FDP SSD, FEMU crashed without any error message during KV pair loading phase, the workload I used is not to heavy, It only reach half of the capacity after about 4min, FDP-TRACE log output GC_BACK_RESERT triggered but delay GC (ru 21 ipc 7168 threshold 8192 full 65536) repeatedly then FEMU crashed.

Environment

  • Host OS: Debian 13
  • Kernel version: Linux 6.12.57+deb13-amd64
  • FEMU version/commit: eb01bb7
  • FEMU mode: FDP-mode
  • GuestOS Image: Ubuntu 25.10 qcow2 image
  • Guest Kernel: Linux 6.17.0-29-generic
  • FDP confituration:
    • fdp=on
    • fdp.nruh=8
    • fdp.nrg=1
    • fdp.nru=141
    • fdp.ru.size=256MB
    • device.size=32G

To Reproduce
Steps to reproduce the behavior:

  1. Compile FEMU from commit eb01bb7
  2. Add such configuration to run-blackbox-fdp-sh
secsz=512         # sector size in bytes
secs_per_pg=8     # number of sectors in a flash page
pgs_per_blk=1024   # number of pages per flash block
blks_per_pl=141   # number of blocks per plane
pls_per_lun=1     # keep it at one, no multiplanes support
luns_per_ch=8     # number of chips per channel
nchs=8            # number of channels
ssd_size=32768    # in megabytes, consider 25% overprovisioning

# Latency in nanoseconds
pg_rd_lat=40000   # page read latency
pg_wr_lat=200000  # page write latency
blk_er_lat=2000000 # block erase latency
ch_xfer_lat=0     # channel transfer time

# GC Threshold (1-100)
gc_thres_pcent=75
gc_thres_pcent_high=95

# FDP Configuration
fdp_nruh=8        # number of reclaim unit handles
fdp_nrg=1         # number of reclaim groups
fdp_nru=$blks_per_pl  # total number of reclaim units (= blks_per_pl)
  1. Compile YCSB-CPP from https://github.com/ls4154/YCSB-cpp and TorFS from https://github.com/SamsungDS/TorFS with RocksDB 9.3.1
  2. Create XFS on /dev/nvme0n1 and mounted it on /mnt/fdp
  3. Testing YCSB workloada with 15M record and 30M operation with DB path point to our FDP SSD mount point, then FEMU will crashed during KV loading phaseExpected behavior
    The benchmark should finished gracefully.

Error logs
First, FEMU will allocate RU for RUH0 perfectly,

[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 136) old_ru=136 new_ru=137 reason=full_victim victim_ru_cnt 2
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 137) old_ru=137 new_ru=138 reason=full_victim victim_ru_cnt 3
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 138) old_ru=138 new_ru=139 reason=full_victim victim_ru_cnt 4
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 139) old_ru=139 new_ru=140 reason=full_victim victim_ru_cnt 5
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 140) old_ru=140 new_ru=0 reason=full_victim victim_ru_cnt 6
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 0) old_ru=0 new_ru=1 reason=full_victim victim_ru_cnt 7
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 1) old_ru=1 new_ru=2 reason=full_victim victim_ru_cnt 8
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 2) old_ru=2 new_ru=3 reason=full_victim victim_ru_cnt 9
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 3) old_ru=3 new_ru=4 reason=full_victim victim_ru_cnt 10
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 4) old_ru=4 new_ru=5 reason=full_victim victim_ru_cnt 11
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 5) old_ru=5 new_ru=6 reason=full_victim victim_ru_cnt 12
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 6) old_ru=6 new_ru=7 reason=full_victim victim_ru_cnt 13
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 7) old_ru=7 new_ru=8 reason=full_victim victim_ru_cnt 14
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 8) old_ru=8 new_ru=9 reason=full_victim victim_ru_cnt 15

Then FEMU start GC like this

[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=136 victim_vpc=3 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=136 pages_migrated=3 blocks_erased=64 mbmw_delta=12288 mbe_delta=268435456
[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=12 victim_vpc=3 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=12 pages_migrated=3 blocks_erased=64 mbmw_delta=12288 mbe_delta=268435456
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 92) old_ru=92 new_ru=94 reason=full_victim victim_ru_cnt 97
[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=7 victim_vpc=4 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=7 pages_migrated=4 blocks_erased=64 mbmw_delta=16384 mbe_delta=268435456
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 94) old_ru=94 new_ru=95 reason=full_victim victim_ru_cnt 97
[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=24 victim_vpc=4 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=24 pages_migrated=4 blocks_erased=64 mbmw_delta=16384 mbe_delta=268435456
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 95) old_ru=95 new_ru=96 reason=full_victim victim_ru_cnt 97
[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=6 victim_vpc=5 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=6 pages_migrated=5 blocks_erased=64 mbmw_delta=20480 mbe_delta=268435456
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 96) old_ru=96 new_ru=97 reason=full_victim victim_ru_cnt 97
[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=22 victim_vpc=5 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=22 pages_migrated=5 blocks_erased=64 mbmw_delta=20480 mbe_delta=268435456
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 97) old_ru=97 new_ru=98 reason=full_victim victim_ru_cnt 97
[FEMU] FDP-Trace: GC_START rgid=0 ruhid=0 victim_ru=9 victim_vpc=6 isolation=PI gc_type=BACK
[FEMU] FDP-Trace: GC_DONE victim_ru=9 pages_migrated=6 blocks_erased=64 mbmw_delta=24576 mbe_delta=268435456
[FEMU] FDP-Trace: RU_ROTATE ruhid=0(curr_ru 98) old_ru=98 new_ru=99 reason=full_victim victim_ru_cnt 97

After a few minutes, some GC of victim RU are delayed since the invalid page in this RU don't reach the threshold

[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 87 ipc 6823 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 50 ipc 6616 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 47 ipc 6240 threshold 8192 full 65536)
[FEMU] FDP-Trace: GC_BACK_RESERT triggered but delay GC (ru 75 ipc 5955 threshold 8192 full 65536)

Then FEMU crashed without any error message.

GDB point to victim_ru_get_pri and told that qemu-system-x86 quit with signal SIGSEGV

Thread 37 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.

0x0000555555ac3a58 in victim_ru_get_pri (a=0x0)
    at ../hw/femu/bbssd/ftl.c:152

152         return ((FemuReclaimUnit *)a)->vpc;

And I try to use gdb catch some backtrace log

#0  0x0000555555ac3a58 in victim_ru_get_pri (a=0x0) at ../hw/femu/bbssd/ftl.c:152
#1  0x0000555555acdc2b in percolate_down (q=0x55557bf60a90, i=103) at ../hw/femu/lib/pqueue.c:109
        child_node = 0
        moving_node = 0x0
        moving_pri = 16109300872511081216
#2  0x0000555555acded3 in pqueue_change_priority (q=0x55557bf60a90, new_pri=65534, d=0x55557bf609a0) at ../hw/femu/lib/pqueue.c:157
        posn = 103
        old_pri = 65534
#3  0x0000555555ac6b33 in mark_page_invalid_fdp (ssd=0x555558aef920, ppa=0x7ff34b2fb6f8) at ../hw/femu/bbssd/ftl.c:1378
        spp = 0x555558aef928
        blk = 0x55556790e050
        pg = 0x5555681be9e0
        line = 0x55557bf5c740
        ru = 0x55557bf609a0
        rm = 0x555558450ac0
        was_full_ru = false
#4  0x0000555555ac8a99 in ssd_stream_write (n=0x555558a50840, ssd=0x555558aef920, req=0x7ffff4080ef0) at ../hw/femu/bbssd/ftl.c:2025
        ret = 0x55557bf60380
        swr = {
          type = 0,
          cmd = 1,
          stime = 12375027858264394
        }
        ns = 0x555558a59ab0
        spp = 0x555558aef928
        rg = 0x55557bf5cc80
        ruh = 0x55557bf64040
        ru = 0x55557bf60380
        lba = 42968992
        len = 8192
        start_lpn = 5371124
        end_lpn = 5372147
        ppa = {
          {
            g = {
              blk = 139,
              pg = 9,
              sec = 0,
              pl = 0,
              lun = 3,
              ch = 3,
              rsv = 0
            },
            ppa = 217017207044505739
          }
        }
        lpn = 5371921
        curlat = 43661807
        maxlat = 43661807
        r = 0
        pid = 0
        dtype = 0 '\000'
        ph = 0
        rgid = 0
        ruhid = 0
#5  0x0000555555ac8ebd in nvme_do_write_fdp (n=0x555558a50840, req=0x7ffff4080ef0, slba=42968992, nlb=8192) at ../hw/femu/bbssd/ftl.c:2097
        ns = 0x555558a59ab0
        ssd = 0x555558aef920
        spp = 0x555558aef928
        data_bytes = 4194304
        pid = 0
        dtype = 0 '\000'
        ph = 0
        rg = 0
        ruhid = 0
#6  0x0000555555aca924 in ftl_thread (arg=0x555558a50840) at ../hw/femu/bbssd/ftl.c:2467
        n = 0x555558a50840
        ssd = 0x555558aef920
        req = 0x7ffff4080ef0
        lat = 0
        rc = 1
        i = 1
#7  0x000055555616996e in qemu_thread_start (args=0x55557bf691f0) at ../util/qemu-thread-posix.c:393
        __cancel_buf = {
          __cancel_jmp_buf = {{
              __cancel_jmp_buf = {140682915204316, -5503248113342606125, 32, 0, 140737488343216, 140682906812416, -5503248113319537453, -1806873028232342317},
              __mask_was_saved = 0
            }},
          __pad = {0x7ff34b2fb970, 0x0, 0x0, 0x0}
        }
        __cancel_routine = 0x5555561697f2 <qemu_thread_atexit_notify>
        __cancel_arg = 0x0
        __not_first_call = 0
        qemu_thread_args = 0x55557bf691f0
        start_routine = 0x555555aca77e <ftl_thread>
        arg = 0x555558a50840
        r = 0x0
#8  0x00007ffff7702b7b in ??? () at /lib/x86_64-linux-gnu/libc.so.6
#9  0x00007ffff77807b8 in ??? () at /lib/x86_64-linux-gnu/libc.so.6

Additional context
I can fulfill this experiment on EA version of WARP with same configuration of FEMU and identical image.
But even in EA version of WARP, if the RU usage exceed the gc_thres_pcent_high, some gc may cause [FEMU] FTL-Err: unable to find victim RU, gc skip, then WARP also crashed silencely.

I attach the full log of FEMU and gdb backtrace in attached file.

femu-segv-victim-ru.log

femu-fdp-2026-06-04-083835.log

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions