Skip to content

epub: free libxml2 strings with xmlFree instead of g_free#720

Open
maxheise wants to merge 1 commit into
linuxmint:masterfrom
maxheise:epub-xmlfree-allocator
Open

epub: free libxml2 strings with xmlFree instead of g_free#720
maxheise wants to merge 1 commit into
linuxmint:masterfrom
maxheise:epub-xmlfree-allocator

Conversation

@maxheise

@maxheise maxheise commented Jun 7, 2026

Copy link
Copy Markdown

Hello,

backend/epub/epub-document.c reads many strings out of the parsed
container.xml and OPF document through the helper xml_get_data_from_node(),
which returns an xmlChar* allocated by libxml2 (internally xmlGetProp or
xmlNodeListGetString). A libxml2 string should be released with xmlFree,
since libxml2 allocates it through its own allocator. Most of these
results were released with GLib's g_free instead; a few were stored into
a struct that frees them with g_free, and two were leaked. Mixing the
allocators happens to work while libxml2 and GLib share the system
malloc, but it is not guaranteed (it breaks with a custom libxml2
allocator, or where the two libraries do not share one heap). The file
already xmlFree's some of these results in a couple of places, so the
handling is currently inconsistent.

The change makes every libxml2 string use the matching allocator, in one
of three shapes depending on who owns it:

  1. Where the function owns the free and the freed pointer is the libxml2
    string, g_free becomes xmlFree (relativepath in get_uri_to_content and
    setup_document_content_list, the newnode->key error paths and
    free_tree_nodes, linktext in free_link_nodes, the href in
    get_child_list, class in change_to_night_sheet, tocfilename in
    setup_document_index, the stylesheet name in
    epub_document_check_add_night_sheet, and mathml/href in
    epub_document_add_mathJax). The neighbouring g_free calls that release
    g_filename_to_uri / g_strdup results are left as-is.
  2. Two libxml2 strings that were copied and then leaked are now freed:
    the toc attribute in get_toc_file_name, and the version attribute in
    epub_document_get_info (copied by g_string_new).
  3. Where a libxml2 string is stored into a struct that frees it with
    g_free and cannot be changed, it is copied with g_strdup and the
    original xmlFree'd: the EvDocumentInfo title/author/subject/creator
    fields (freed by ev_document_info_free with g_free) and docTitle
    (freed with g_free at finalize). The NULL / "unknown" fallback
    branches already used g_strdup and are unchanged.

For valid documents the visible result is unchanged; only the allocator
used for each string is corrected, and the two previously leaked strings
are now released.

Thanks for taking a look.

Best regards
Max

backend/epub/epub-document.c reads many strings out of the parsed
container.xml and OPF package document through the helper
xml_get_data_from_node(), which returns an xmlChar* allocated by
libxml2 (it calls xmlGetProp or xmlNodeListGetString internally). A
string obtained from libxml2 must be released with xmlFree, because
libxml2 allocates it through its own allocator (xmlMalloc). The code
released most of these strings with GLib's g_free, and in a few places
stored a libxml2 string into a struct field that is later freed with
g_free, or never freed it at all. Mixing the two allocators is wrong:
it only happens to work while libxml2 and GLib share the system
malloc, and breaks if libxml2 is built with a custom allocator
(xmlMemSetup) or where the two libraries do not share one heap. A few
sites also leaked the libxml2 string outright.

This change makes every libxml2 string in the file use the matching
allocator, using one of three shapes depending on who owns the string:

1. Where the code owns the free site and the freed pointer is the
   libxml2 string itself, switch g_free to xmlFree. This covers the
   relativepath in get_uri_to_content() and setup_document_content_list()
   (the OPF full-path and item href), the newnode->key error paths and
   free_tree_nodes() (the spine idref), free_link_nodes() linktext, the
   href in get_child_list(), the class attribute in
   change_to_night_sheet(), the tocfilename in setup_document_index()
   (returned by get_toc_file_name()/epub_document_get_nav_file()), the
   stylesheet name in epub_document_check_add_night_sheet() (returned by
   epub_document_get_alternate_stylesheet()), and mathml plus href in
   epub_document_add_mathJax(). The companion g_free calls in those
   functions that release g_filename_to_uri / g_strdup / g_strdup_printf
   results (for example dataptr->value, dataptr->pagelink, the filepath
   and filename locals) are GLib strings and are left as g_free.

2. Where a libxml2 string was copied and then leaked, free it. The
   "toc" attribute ncx in get_toc_file_name() is used only to look up
   the manifest item and is now xmlFree'd after that lookup. The
   "version" attribute in epub_document_get_info() is copied by
   g_string_new() and is now captured in a local, passed to
   g_string_new(), then xmlFree'd.

3. Where a libxml2 string is stored into a struct whose destructor
   frees it with g_free and cannot be changed, store a GLib copy and
   free the libxml2 original. This applies to the EvDocumentInfo fields
   title, author, subject and creator (ev_document_info_free() frees
   them with g_free), and to EpubDocument.docTitle (freed with g_free
   in the finalize handler), which is assigned both in
   setup_document_index() and through epub_document_set_document_title().
   Each of these now does epubinfo->field = g_strdup((const char*)s)
   followed by xmlFree(s). The NULL / "unknown" fallback branches were
   already g_strdup'd and are unchanged. docTitle keeps its existing
   g_free at finalize, which is now correct because it always holds a
   g_strdup'd copy. g_strdup(NULL) returns NULL and xmlFree(NULL) is a
   no-op, so the empty-metadata cases behave exactly as before; the
   docTitle while-loop in setup_document_index() also keeps its
   original termination behaviour.

For valid documents the visible result is unchanged; only the allocator
used to free each string is corrected, and the previously leaked toc and
version strings are now released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant