Skip to content

[th/l3cfg-acd-crash] fix assertion failure in l3cfg's ACD handling during "instance-reset"

Thomas Haller requested to merge th/l3cfg-acd-crash into main

The ACD state handling is unfortunately very complicated. That is, because we obviously need to track state about how ACD is going (acd_data, and in particular NML3AcdAddrState). Then there are various things that can happen, which are the AcdStateChangeMode enums. All these state-changes are then in one function: _l3_acd_data_state_change(), which is therefore complicated (I don't think that it would become simpler by spreading this code out to different functions, on the contrary). Anyway.

So, what happens when we need to reset the n-acd instance? For example, because the MAC address of the link changed or some error. I guess, we need to restart probing.

Previously, I think this was not handled properly. We already tried to fix this several times, the last was commit b3316063 ('l3cfg: on n-acd instance-reset clear also ready ACD state'). There is still an issue ([1]).

The bug [1] is, that we are in state NM_L3_ACD_ADDR_STATE_READY, during ACD_STATE_CHANGE_MODE_TIMEOUT event. That leads to an assertion failure.

  #5  0x00007f23be74698f in g_assertion_message_expr (domain=0x5629aca70359 "nm", file=0x5629aca62aab "src/core/nm-l3cfg.c", line=2395, func=0x5629acb26b30 <__func__.72.lto_priv.4> "_l3_acd_data_state_change", expr=<optimized out>) at ../glib/gtestutils.c:3091
  #6  0x00005629ac937e46 in _l3_acd_data_state_change (self=0x5629add69790, acd_data=0x5629add8d520, state_change_mode=ACD_STATE_CHANGE_MODE_TIMEOUT, sender_addr=0x0, p_now_msec=0x7ffded506460) at src/core/nm-l3cfg.c:2395
  #7  0x00005629ac939f4d in _l3_acd_data_timeout_cb (user_data=user_data@entry=0x5629add8d520) at src/core/nm-l3cfg.c:1933
  #8  0x00007f23be71c5a1 in g_timeout_dispatch (source=0x5629addd7a80, callback=0x5629ac939ee0 <_l3_acd_data_timeout_cb>, user_data=0x5629add8d520) at ../glib/gmain.c:4889
  #9  0x00007f23be71bd4f in g_main_dispatch (context=0x5629adc6da00) at ../glib/gmain.c:3337
  #10 g_main_context_dispatch (context=0x5629adc6da00) at ../glib/gmain.c:4055

That can only happen, (I think) when we scheduled the timeout during an earlier ACD_STATE_CHANGE_MODE_INSTANCE_RESET event. Meaning, we need to handle instance-reset better.

Instead, during instance-reset, switch always back to state PROBING, and let the timeout figure it out.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2047788

Merge request reports