Intermittent segfault when freeing MMManager object using g_object_unref()

Mon Jun 7 08:56:06 UTC 2021

Hey!

>
> > "modems" here is the list returned by the manager with
> > g_dbus_object_manager_get_objects(). The list contains full
> > references, so g_list_free_full() is fine.
> >
> > "object" is an element of that list. Are you really doing a
> > g_object_ref() of the object when you're iterating? Or are you
> > unref-ing the "object" without having acquired a full reference?
> > Remember that you're then doing a g_list_free_full() on the whole
> > object list. Maybe here's the problem?
>
> That is a good point -- it doesn't look like I need to free "object."
>
> > If you send the test application source code (doesn't need to be your
> > full application, just a reproducer) I can give it a look; we should
> > definitely confirm this is not a problem in libmm-glib itself.
>
> I put together a small standalone application to test this out, and was able to reproduce the segfault with the double-free you pointed out above.  When I explicitly free "object," I get the following warning message after freeing "manager":
>
>     (process:5480): GLib-GObject-CRITICAL **: 19:45:59.258: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
>
> When I don't explicitly free "object," I do not get that error message, and no longer seem to get the segfault!  So that issue seems to be resolved.  Thanks for catching that.
>

Yey!

> Unfortunately, it seems like there is effectively a memory leak in the code.  I have attached the standalone application to this email if you are interested in taking a look.  The two places it indicates leaks (multiple times each) are:
>
>     12,196 bytes in 239 blocks are still reachable in loss record 8 of 11
>         at 0x4DA4124: g_type_create_instance (in /usr/lib/libgobject-2.0.so.0.6400.4)
>
>     23,244 bytes in 126 blocks are still reachable in loss record 9 of 11
>         at 0x4842B4C: realloc (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so)
>

"still reachable" isn't always a leak, it just means some memory was
allocated and is kept used and tracked by the time the program exits.
A true memory leak ("definitely lost" and "indirectly lost") are the
ones to really take care of, as those are kept used but not tracked
during program exit. That said, if the "still reachable" amount
increases all the time, it may still be a problem to fix. I wrote a
document some years ago explaining all these different types of leaks,
in case you're interested
https://aleksander.es/data/valgrind-memcheck.pdf :)

How are you running the valgrind test? Are you running it with
G_SLICE=always-malloc?
What version of GLib are you using? I hope it's GLib >= 2.55.1.

> From running one iteration through valgrind, it looks like the "leaks" (valgrind reports some memory as "possibly lost" and most as "still reachable") are part of the library, so I do not believe I am missing any frees/unrefs in my code.

Most of the "possibly lost" leaks may be due to not using
G_SLICE=always-malloc (as GLib would use the internal slice allocator
instead) and also due to how GLib itself manages internal memory, e.g.
in the GObject type registration system.

>  For thoroughness, I ran mmcli through valgrind to print my modem information and that also reported a large amount of "possibly lost" and "still reachable" memory, so I believe this is expected.  However, this memory grows each iteration, and it eventually stops working due to a lack of available memory.  The use case for my application is to run continuously for longer periods of time (anywhere from hours to weeks on 128MB of RAM), so I don't think this is sustainable.

Ok, that is bad, and a thing to fix definitely. And I believe I know
the culprit, see below.

>
> Is there any way I can force the library to clean up memory earlier than it normally would?  It seems like it holds onto the memory even after objects are fully unreferenced, or perhaps the disposal functions for the objects are not being run yet for some reason.

See, you need to do 2 things. One, make sure you always run your
valgrind tests with G_SLICE=always-malloc and --leak-check=full
--show-reachable=yes, as that will give you very accurate results; and
two, get a full valgrind log of your application after that increased
memory to give a look at which specific still reachable leak is the
one increasing and taking up most memory. Then we can determine
whether it's in glib, libmm-glib, or your application. From your
tester, I can see already some issues.

> The other thing I have been playing around with is forking off a new process and using the library functions there to circumvent the leak, but I would like to avoid doing it this way if possible as it adds complexity.
>

But that would not circumvent any leak, you would be moving it to a
different place. Or do you mean you would fork a process every time
you need to run some libmm-glib operation?

I've analyzed your program and the valgrind results, and I believe I
now know what the issue is.

You're exclusively using libmm-glib with the "sync" APIs; i.e. you're
running a GLib mainloop to process the events asynchronously. This, in
theory, should not be a problem. But I'm afraid it is in this case
because of how GLib/GDBus works. I've modified your tester to run 2000
times with no delay in between retries, and there are a ton of still
reachable block reported all with the same signature:

==709343== 78,000 bytes in 2,000 blocks are still reachable in loss
record 1,213 of 1,218
==709343==    at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==709343==    by 0x4C0DE98: g_malloc (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C28153: g_strdup (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C06180: g_source_set_name (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A7EEE0: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A7F1C6: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4BD65B5: ??? (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A82BD2: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A86F5A: g_dbus_connection_signal_unsubscribe (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A91E64: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4B6FD0D: g_object_unref (in
/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.6400.6)
==709343==    by 0x4AA1D82: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==
==709343== 78,000 bytes in 2,000 blocks are still reachable in loss
record 1,214 of 1,218
==709343==    at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==709343==    by 0x4C0DE98: g_malloc (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C28153: g_strdup (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C06180: g_source_set_name (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A7EEE0: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A7F1C6: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4BD65B5: ??? (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A82BD2: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A86F5A: g_dbus_connection_signal_unsubscribe (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A91E94: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4B6FD0D: g_object_unref (in
/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.6400.6)
==709343==    by 0x4AA1D82: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==
==709343== 80,000 bytes in 2,000 blocks are still reachable in loss
record 1,215 of 1,218
==709343==    at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==709343==    by 0x4C0DE98: g_malloc (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C26485: g_slice_alloc (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C26AAD: g_slice_alloc0 (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C053EF: g_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C090E6: g_idle_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A7EEAB: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A7F1C6: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4BD65B5: ??? (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A82BD2: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A86F5A: g_dbus_connection_signal_unsubscribe (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A91E64: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==
==709343== 80,000 bytes in 2,000 blocks are still reachable in loss
record 1,216 of 1,218
==709343==    at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==709343==    by 0x4C0DE98: g_malloc (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C26485: g_slice_alloc (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C26AAD: g_slice_alloc0 (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C053EF: g_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C090E6: g_idle_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A7EEAB: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A7F1C6: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4BD65B5: ??? (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A82BD2: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A86F5A: g_dbus_connection_signal_unsubscribe (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A91E94: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==
==709343== 192,000 bytes in 2,000 blocks are still reachable in loss
record 1,217 of 1,218
==709343==    at 0x483DD99: calloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==709343==    by 0x4C0DEF0: g_malloc0 (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C053E2: g_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C090E6: g_idle_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A7EEAB: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A7F1C6: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4BD65B5: ??? (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A82BD2: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A86F5A: g_dbus_connection_signal_unsubscribe (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A91E64: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4B6FD0D: g_object_unref (in
/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.6400.6)
==709343==    by 0x4AA1D82: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==
==709343== 192,000 bytes in 2,000 blocks are still reachable in loss
record 1,218 of 1,218
==709343==    at 0x483DD99: calloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==709343==    by 0x4C0DEF0: g_malloc0 (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C053E2: g_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4C090E6: g_idle_source_new (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A7EEAB: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A7F1C6: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4BD65B5: ??? (in
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6)
==709343==    by 0x4A82BD2: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A86F5A: g_dbus_connection_signal_unsubscribe (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4A91E94: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)
==709343==    by 0x4B6FD0D: g_object_unref (in
/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.6400.6)
==709343==    by 0x4AA1D82: ??? (in
/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.6400.6)

What's happening here is that there is some g_object_unref() inside
GIO that *creates* an idle source and attaches it to the main context
in GLib. The purpose, I assume, is to run something once the object
has been fully disposed, but in your case, the problem is that as
there is no main loop, those idles keep being added for every unref
you run.

I discussed this already some time ago with GLib maintainers, see:
https://gitlab.gnome.org/GNOME/glib/-/issues/2066

You should be able to solve this issue without further help or change
in libmm-glib or glib by switching your logic to use a main loop and
use the async APIs instead or in addition to the sync ones. This is
relatively easy to do, even if your program doesn't use a GLib
mainloop, because you could have all libmm-glib related stuff running
in a separate thread with its own mainloop inside the thread, and then
syncing whatever info you need via mutexes or something like that.

If you cannot change the whole logic to use a main loop in a separate
thread, you could still have a dummy main loop running some iterations
for some time; e.g. in your tester, instead of the usleep() call you
could put:

   {
        GMainLoop *loop;

        loop = g_main_loop_new (NULL, FALSE);
        g_idle_add ((GSourceFunc)stop_loop, loop);
        g_main_loop_run (loop);
        g_main_loop_unref (loop);
    }

with a separate stop_loop method like this:

static gboolean
stop_loop (GMainLoop *loop)
{
    g_main_loop_quit (loop);
    return G_SOURCE_REMOVE;
}

What we're doing is creating a main loop that will run the events in
the main context (the GDBus setup would have scheduled events in that
main context). And we also add our own idle source in the main
context, and given that idles are all scheduled one after the other,
our idle source would be scheduled LAST. And then, in the idle source
we stop the main loop, as there were no more idles scheduled
afterwards. It's a bit hacky, but should work when you need to
synchronize the idles setup by GDBus. Note that the main loop will run
exclusively for the time needed to cleanup the event list until your
idle, so that should be extremely quick.

After that change, I cannot see any major still reachable leak
happening any more, see the attached source file with my changes.

I should definitely document this solution in modemmanager.org...

-- 
Aleksander
https://aleksander.es
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.c
Type: text/x-csrc
Size: 4826 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/modemmanager-devel/attachments/20210607/a100a4bd/attachment.c>