Skip to content
This repository was archived by the owner on Nov 12, 2025. It is now read-only.

resiliency can be improved when facing constrained memory situations #5515

@bfaccini

Description

@bfaccini

Environment Information

  • PMDK package version(s): 1.12.1
  • OS(es) version(s): CentOS 7.9
  • ndctl version(s): ndctl-65-5
  • kernel version(s): 3.10.0-1160.45.1.el7.x86_64
  • compiler, libraries, packaging and other related tools version(s): DAOS v2.3

Please provide a reproduction of the bug:

Running DAOS Servers as a simple user with memory limits.

How often bug is revealed: (always, often, rare): rare

Actual behavior:

Silent abort() or SEGV, leading to unnecessary/wasted time spent to debug.

Expected behavior:

graceful handling.

Details

1st case/abort() encountered with a stack trace like following :

#0  0x00002aeb80a98387 in raise () from /lib64/libc.so.6
#1  0x00002aeb80a99a78 in abort () from /lib64/libc.so.6
#2  0x00002aeb803d48bc in out_get_errormsg () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#3  0x00002aeb803d4978 in out_err () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#4  0x00002aeb803d4fcc in ravl_emplace () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#5  0x00002aeb803f720e in pmemobj_tx_add_common.constprop.21 () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#6  0x00002aeb803f814c in pmemobj_tx_add_range_direct () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#7  0x00002aeb7ec7fadf in pmem_tx_add_ptr (umm=<optimized out>, ptr=<optimized out>, size=<optimized out>) at src/common/mem.c:133
#8  0x00002aeb7f4a0823 in umem_tx_add_ptr (size=8, ptr=0x2af703cd2bd8, umm=0x564a24f6fb80) at src/include/daos/mem.h:470
#9  ilog_ptr_set_full (lctx=lctx@entry=0x564a24f6fb30, dest=dest@entry=0x2af703cd2bd8, src=src@entry=0x2aedb9aa3490, len=len@entry=8) at src/vos/ilog.c:345
#10 0x00002aeb7f4a0b95 in update_inplace (lctx=0x564a24f6fb30, id_out=0x2af703cd2bd8, id_in=0x2aedb9aa3660, opc=1, is_equal=<optimized out>) at src/vos/ilog.c:661
#11 0x00002aeb7f4a11af in ilog_modify (loh=..., id_in=id_in@entry=0x2aedb9aa3660, epr=epr@entry=0x2aedb9aa35e0, opc=opc@entry=1) at src/vos/ilog.c:926
#12 0x00002aeb7f4a411e in ilog_persist (loh=..., id=id@entry=0x2aedb9aa3660) at src/vos/ilog.c:1017
#13 0x00002aeb7f485ef4 in dtx_ilog_rec_release (abort=96, dae=0x2aedbd922838, rec=<optimized out>, cont=0x0, umm=0x2aedbd922838) at src/vos/vos_dtx.c:561
#14 do_dtx_rec_release (umm=umm@entry=0x2aedbd5ae2b8, cont=cont@entry=0x2aedbd6efe20, dae=dae@entry=0x2aedbd922838, rec=<optimized out>, abort=abort@entry=false) at src/vos/vos_dtx.c:583
#15 0x00002aeb7f4861ba in do_dtx_rec_release (abort=false, rec=<optimized out>, dae=0x2aedbd922838, cont=0x2aedbd6efe20, umm=0x2aedbd5ae2b8) at src/vos/vos_dtx.c:578
#16 dtx_rec_release (cont=cont@entry=0x2aedbd6efe20, dae=0x2aedbd922838, abort=abort@entry=false) at src/vos/vos_dtx.c:688
#17 0x00002aeb7f48a02c in vos_dtx_commit_one (fatal=<synthetic pointer>, rm_cos=0x2aedbfe5745d, dae_p=0x2aedbfe92a58, dce_p=<synthetic pointer>, cmt_time=946528699681341440, epoch=0, dti=0x2aedbff49588, cont=0x2aedbd6efe20)
    at src/vos/vos_dtx.c:856
#18 vos_dtx_commit_internal (cont=cont@entry=0x2aedbd6efe20, dtis=dtis@entry=0x2aedbff47050, count=count@entry=512, epoch=epoch@entry=0, rm_cos=rm_cos@entry=0x2aedbfe572d0, daes=daes@entry=0x2aedbfe91df0, dces=dces@entry=0x2aedbff4a060)
    at src/vos/vos_dtx.c:1949
#19 0x00002aeb7f48d058 in vos_dtx_commit (coh=..., dtis=dtis@entry=0x2aedbff47050, count=count@entry=512, rm_cos=rm_cos@entry=0x2aedbfe572d0) at src/vos/vos_dtx.c:2140
#20 0x00002aeb8b944e97 in dtx_commit (cont=cont@entry=0x2aedbd6efa50, dtes=0x2aedbff2eef0, dcks=0x2aedbff43040, count=count@entry=512, epoch=epoch@entry=0) at src/dtx/dtx_rpc.c:813
#21 0x00002aeb8b950a2c in dtx_batched_commit_one (arg=0x2aedbd6f0d80) at src/dtx/dtx_common.c:518
#22 0x00002aeb7ff7f4aa in ABTD_ythread_func_wrapper () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#23 0x00002aeb7ff7f651 in make_fcontext () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#24 0x0000000000000000 in ?? ()

Further corefile analysis along with associated pmdk/libpmemobj source code review indicates that the abort()/FATAL() comes from the error-return handling from the malloc() call in Last_errormsg_get() :

 static inline struct errormsg *
Last_errormsg_get(void)
{
        Last_errormsg_key_alloc();

        struct errormsg *errormsg = os_tls_get(Last_errormsg_key);
        if (errormsg == NULL) {
                errormsg = malloc(sizeof(struct errormsg));
                if (errormsg == NULL)
                        FATAL("!malloc");
                /* make sure it contains empty string initially */
                errormsg->msg[0] = '\0';
                int ret = os_tls_set(Last_errormsg_key, errormsg);
                if (ret)
                        FATAL("!os_tls_set");
        }
        return errormsg;
}
"src/core/out.c" [readonly] line 69 of 598 --11%-- col 3-10

and more source code analysis also indicates that this behaviour could be avoided if these allocation/initialisation stuff could be done upstream during out_init(). What do you think ??

2nd case/SEGV encountered with a stack trace like following :

#0  0x00002b6a9e82c42f in ucs_debug_backtrace_next (bckt=0x0, line=line@entry=0x2b950edea8c0) at debug/debug.c:495
#1  0x00002b6a9e82c6a7 in ucs_debug_print_backtrace (stream=0x2b6a98b3a1c0 <_IO_2_1_stderr_>, strip=strip@entry=2) at debug/debug.c:656
#2  0x00002b6a9e82e945 in ucs_handle_error (message=message@entry=0x2b6a9e8f0d27 "tkill(2) or tgkill(2)") at debug/debug.c:1081
#3  0x00002b6a9e82ed0c in ucs_debug_handle_error_signal (signo=signo@entry=11, cause=0x2b6a9e8f0d27 "tkill(2) or tgkill(2)", fmt=fmt@entry=0x2b6a9e8f0f8b " at address %p") at debug/debug.c:1033
#4  0x00002b6a9e82ef7b in ucs_error_signal_handler (signo=11, info=0x2b950edead30, context=<optimized out>) at debug/debug.c:1055
#5  <signal handler called>
#6  0x00002b6a9810266c in palloc_defer_free_create () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#7  0x00002b6a981090a1 in pmemobj_tx_xfree () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/pmdk/lib/libpmemobj.so.1
#8  0x00002b6a9698fbae in pmem_tx_free (umm=0x2b6d155ae2b8, umoff=72350928) at src/common/mem.c:92
#9  0x00002b6a971960e2 in dtx_rec_release (cont=cont@entry=0x2b6d156efe20, dae=0x2b6d174b4d60, abort=abort@entry=true) at src/vos/vos_dtx.c:665
#10 0x00002b6a9719d9ea in vos_dtx_abort (coh=..., dti=0x2b6d17f87cf0, epoch=epoch@entry=946528706794618887) at src/vos/vos_dtx.c:2204
#11 0x00002b6aa3654aea in dtx_abort (cont=cont@entry=0x2b6d156efa50, dte=dte@entry=0x2b6d17f87cf0, epoch=946528706794618887) at src/dtx/dtx_rpc.c:896
#12 0x00002b6aa3664382 in dtx_leader_end (dlh=0x2b6d17f87cf0, coh=<optimized out>, result=<optimized out>) at src/dtx/dtx_common.c:1290
#13 0x00002b6aa3d9d114 in ds_obj_rw_handler (rpc=0x2b6d17db1db0) at src/object/srv_obj.c:2732
#14 0x00002b6a96ed1f68 in crt_handle_rpc (arg=0x2b6d17db1db0) at src/cart/crt_rpc.c:1654
#15 0x00002b6a97c8f4aa in ABTD_ythread_func_wrapper () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#16 0x00002b6a97c8f651 in make_fcontext () from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#17 0x00002b950ecec000 in ?? ()
#18 0x0000000000100000 in ?? ()
#19 0x00002b6a96ed1f20 in ?? () at src/cart/crt_rpc.c:663 from /work2/08126/dbohninx/frontera/BUILDS/daos-11940_pr10837/20221116/daos/install/bin/../lib64/libcart.so.4
#20 0x00002b6d17db1db0 in ?? ()
#21 0x00002b950edebfe0 in ?? ()
#22 0x00002b950edebfe0 in ?? ()
#23 0x000055e1455968d0 in ?? ()
#24 0x000055e14503f5f0 in spi_key_cmp (htable=<optimized out>, rlink=<optimized out>, key=<optimized out>, len=<optimized out>) at src/engine/sched.c:289
#25 0x0000000000000000 in ?? ()

Further corefile analysis along with associated pmdk/libpmemobj source code review indicates that the SEGV has occurred due to a 0xffffffffffffff80 invalid action pointer being returned by tx_action_add() and to be passed as 3rd parameter to palloc_defer_free() in pmemobj_tx_xfree(), and looking more into the associated code (in core/out.c, libpmemobj/palloc.c, common/mem.c, libpmemobj/tx.c, common/vec.h), this should be caused by the fact tx->actions->buffer value is NULL due to a previous Realloc() error in vec_reserve().
For this issue, I believe this could be handled by testing the VEC_INC_BACK() negative return value in tx_action_add() and thus have this latter to return NULL in this case instead of an invalid pointer. What do you think ?

I would be happy to push a PR for both issues if you agree with my analysis, just let me know.

Additional information about Priority and Help Requested:

Are you willing to submit a pull request with a proposed change? Yes

Requested priority: Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions