Skip to content

Segmentation fault during amd-fftw-5.1 library load (dl_open) in python #23

@rowanworth

Description

@rowanworth

Hiya,

I ran into this one while exploring amd-fftw as an optimisation option for a pipeline running on genoas. The software stack included a lot of python modules but I've isolated the segfault with a very simple reproducer:

Python 3.9.18 (main, Oct  4 2024, 00:00:00) 
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
$ python3 -c 'import ctypes; ctypes.cdll.LoadLibrary("libfftw3.so.3")'
Segmentation fault (core dumped)

Looking at the core dump:

Core was generated by `python3 -c import ctypes; ctypes.cdll.LoadLibrary("libfftw3.so.3")'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000041c06 in ?? ()
Missing separate debuginfos, use: dnf debuginfo-install python3-3.9.18-3.el9_4.6.x86_64

(gdb) p $_siginfo._sifields._sigfault
$1 = {si_addr = 0x41c06, _addr_lsb = 0, _addr_bnd = {_lower = 0x0, _upper = 0x0}}

(gdb) bt
#0  0x0000000000041c06 in ?? ()
#1  0x0000150d616482b9 in fmv_resolver_cpy2d_pair () from /tmp/gcc-5.1/lib/libfftw3.so.3
#2  0x0000150d6fc80360 in elf_machine_rela (skip_ifunc=0, 
    reloc_addr_arg=0x150d61d66330 <fftw_cpy2d_pair@got[plt]>, version=<optimized out>, 
    sym=0x150d6160b0a0, reloc=0x150d6163f1e8, scope=0x5636cfd4dfe0, map=0x5636cfd4dc70)
    at ../sysdeps/x86_64/dl-machine.h:331
#3  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=<optimized out>, nrelative=<optimized out>, 
    relsize=<optimized out>, reladdr=<optimized out>, scope=<optimized out>, map=0x5636cfd4dc70)
    at /usr/src/debug/glibc-2.34-100.el9_4.4.x86_64/elf/do-rel.h:142
#4  _dl_relocate_object (l=0x5636cfd4dc70, scope=<optimized out>, reloc_mode=<optimized out>, 
--Type <RET> for more, q to quit, c to continue without paging--c
    consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:288
#5  0x0000150d6fc7cc0b in _dl_open_relocate_one_object (args=args@entry=0x7fff176c4990, 
    r=r@entry=0x150d6fca8080 <_r_debug>, l=<optimized out>, reloc_mode=reloc_mode@entry=0, 
    relocation_in_progress=relocation_in_progress@entry=0x7fff176c468f) at dl-open.c:485
#6  0x0000150d6fc7d611 in _dl_open_relocate_one_object (relocation_in_progress=0x7fff176c468f, 
    reloc_mode=0, l=<optimized out>, r=0x150d6fca8080 <_r_debug>, args=0x7fff176c4990) at dl-open.c:452
#7  dl_open_worker_begin (a=0x7fff176c4990) at dl-open.c:695
#8  0x0000150d6f556148 in _dl_catch_exception () from /lib64/libc.so.6
#9  0x0000150d6fc7cafa in dl_open_worker (a=0x7fff176c4990) at dl-open.c:771
#10 0x0000150d6f556148 in _dl_catch_exception () from /lib64/libc.so.6
#11 0x0000150d6fc7cf5f in _dl_open (file=<optimized out>, mode=-2147483646, 
    caller_dlopen=0x150d6f673f2f <py_dl_open+143>, nsid=-2, argc=3, argv=0x7fff176c5888, 
    env=0x7fff176c58a8) at dl-open.c:873
#12 0x0000150d6f485cbc in dlopen_doit () from /lib64/libc.so.6
#13 0x0000150d6f556148 in _dl_catch_exception () from /lib64/libc.so.6
#14 0x0000150d6f556213 in _dl_catch_error () from /lib64/libc.so.6
#15 0x0000150d6f48578e in _dlerror_run () from /lib64/libc.so.6
#16 0x0000150d6f485d71 in dlopen@GLIBC_2.2.5 () from /lib64/libc.so.6
#17 0x0000150d6f673f2f in py_dl_open ()
   from /usr/lib64/python3.9/lib-dynload/_ctypes.cpython-39-x86_64-linux-gnu.so
#18 0x0000150d6f925c98 in cfunction_call () from /lib64/libpython3.9.so.1.0
#19 0x0000150d6f917ff4 in _PyObject_MakeTpCall () from /lib64/libpython3.9.so.1.0
#20 0x0000150d6f91489e in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#21 0x0000150d6f90e875 in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#22 0x0000150d6f91c095 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#23 0x0000150d6f917a57 in _PyObject_FastCallDictTstate () from /lib64/libpython3.9.so.1.0
#24 0x0000150d6f923097 in slot_tp_init () from /lib64/libpython3.9.so.1.0
#25 0x0000150d6f918273 in type_call () from /lib64/libpython3.9.so.1.0
#26 0x0000150d6f917ff4 in _PyObject_MakeTpCall () from /lib64/libpython3.9.so.1.0
#27 0x0000150d6f914f2e in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#28 0x0000150d6f91c323 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#29 0x0000150d6f9247d1 in method_vectorcall () from /lib64/libpython3.9.so.1.0
#30 0x0000150d6f91491e in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#31 0x0000150d6f90e875 in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#32 0x0000150d6f988a55 in _PyEval_EvalCodeWithName () from /lib64/libpython3.9.so.1.0
#33 0x0000150d6f9889ed in PyEval_EvalCodeEx () from /lib64/libpython3.9.so.1.0
#34 0x0000150d6f98899f in PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#35 0x0000150d6f9b8f84 in run_eval_code_obj () from /lib64/libpython3.9.so.1.0
#36 0x0000150d6f9b4de6 in run_mod () from /lib64/libpython3.9.so.1.0
#37 0x0000150d6f9abac8 in PyRun_StringFlags () from /lib64/libpython3.9.so.1.0
#38 0x0000150d6f9ab790 in PyRun_SimpleStringFlags () from /lib64/libpython3.9.so.1.0
#39 0x0000150d6f9ab278 in Py_RunMain () from /lib64/libpython3.9.so.1.0
#40 0x0000150d6f97b38d in Py_BytesMain () from /lib64/libpython3.9.so.1.0
#41 0x0000150d6f429590 in __libc_start_call_main () from /lib64/libc.so.6
#42 0x0000150d6f429640 in __libc_start_main_impl () from /lib64/libc.so.6
#43 0x00005636cf54d095 in _start ()
(gdb) 

Dump of assembler code for function fmv_resolver_cpy2d_pair:
   0x0000150d616482b0 <+0>:	sub    $0x8,%rsp
   0x0000150d616482b4 <+4>:	call   0x150d61641c00 <fftw_have_simd_avx512@plt>
=> 0x0000150d616482b9 <+9>:	test   %eax,%eax
   0x0000150d616482bb <+11>:	je     0x150d616482d0 <fmv_resolver_cpy2d_pair+32>
   0x0000150d616482bd <+13>:	mov    0x71dcdc(%rip),%rax        # 0x150d61d65fa0
   0x0000150d616482c4 <+20>:	add    $0x8,%rsp
   0x0000150d616482c8 <+24>:	ret
   0x0000150d616482c9 <+25>:	nopl   0x0(%rax)
   0x0000150d616482d0 <+32>:	call   0x150d61641c20 <fftw_have_simd_avx@plt>
--Type <RET> for more, q to quit, c to continue without paging--
   0x0000150d616482d5 <+37>:	test   %eax,%eax
   0x0000150d616482d7 <+39>:	je     0x150d616482f0 <fmv_resolver_cpy2d_pair+64>
   0x0000150d616482d9 <+41>:	mov    0x71dc28(%rip),%rax        # 0x150d61d65f08
   0x0000150d616482e0 <+48>:	add    $0x8,%rsp
   0x0000150d616482e4 <+52>:	ret
   0x0000150d616482e5 <+53>:	data16 cs nopw 0x0(%rax,%rax,1)
   0x0000150d616482f0 <+64>:	mov    0x71dc91(%rip),%rax        # 0x150d61d65f88
   0x0000150d616482f7 <+71>:	add    $0x8,%rsp
   0x0000150d616482fb <+75>:	ret
End of assembler dump.

The binary in question is the one from https://www.amd.com/en/developer/aocl/fftw/eula/fftw-libraries-5-1-eula.html?filename=aocl-fftw-linux-gcc-5.1.0.tar.gz

$ md5sum /tmp/gcc-5.1/lib/libfftw3.so.3 
21fbd71ec790cbdb322564b785e12528  /tmp/gcc-5.1/lib/libfftw3.so.3

I also tried libfftw3.so.3 from the the aocc-5.0 build and that one did not crash. I didn't find an aocc-5.1 or gcc-5.0 build to narrow down whether the problem is new in AOCL-FFTW v5.1 or somehow introduced by gcc. But it looks like the only real change between 5.0 to 5.1 was in mpi/transpose-pairwise-omc.c, so that suggests gcc (or changes in build environment) is the culprit?

I'm on an AMD EPYC 9654 96-Core Processor running Rocky Linux 9.4 -- let me know if you need any other info.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions