Add `lmdb` as alternative file format #852

RasmusOrsoe · 2025-12-09T10:37:34Z

This PR extends the list of supported backends to include lmdb, thereby addressing #834 and closing #820.

The main benefits of LMDB are threefold: It requires roughly half the space of SQLite, it has significantly faster random access for larger events than SQLite, and it provides a generic way of pre-computing data representations.

The downsides are subjective: No SQL syntax, and accessing large subsets of the dataset in one go is also slow.

Major Changes

Adds LMDBWriter:
The writer outputs .lmdb databases, where entries are key-value pairs. The keys are created on the index column (similar to the primary key in sqlite) and the values associated with the entry is all extracted data for the given event. The values are serialized, and several common serialization methods (json, pickle, etc) are supported.

To make the files self-contained, the databases contain a __meta__ entry with information on the serialization method used, and utility functions are added that will identify the correct method and use it for deserialization on queries. As a result, the user doesn't need to know the serialization method in order to read the files. Below is an example of a query:
```
from graphnet.data.utilities.lmdb_utilities import query_database

lmdb_path = "~/merged/merged.lmdb"
event_no = 1

# The query function automatically detects the serializer used and will deserialize the blop
# Result is a dict with all table entries for a single event
# Every entry in the dict is a `table`. E.g. result["truth"] 

result = query_database(lmdb_path, event_no)
```
I profiled the query speeds in a usual data loading scenario vs. the event size, and found the following relationship

The query speed includes both event-level truth and the pulsemap (deserialized), and is repeated 100 times for each event. Real-time computation of representations and computational overhead of establishing connections are not included. From the figure it can be seen that for large events, lmdb offers a significant speed-up.

Additionally, the LMDBWriter accepts a list of DataRepresentations - and if provided - the representations are calculated and stored in the file alongside other extracted data. Another meta field is written to the files that contain the config files of the representations, allowing users to re-instantiate the data representation modules used to compute the representations. A utility function for retrieving these is added. As such, this PR also closes Graph construction before training #781. An example of retrieving the data representation from the meta data can be seen below
```
from graphnet.data.utilities.lmdb_utilities import query_database, get_data_representation_from_metadata

event_no = 0
lmdb_path = "~/merged/merged.lmdb"
# Return list of available representations
query_database(database = lmdb_path, index = event_no )["data_representations"].keys()

# Returns perhaps dict_keys(['KNNGraph', 'GraphDefinition'])

# Get the data representation for 'KNNGraph' from the metadata
data_representation = get_data_representation_from_metadata(lmdb_path, "KNNGraph")
```
It is assumed that the data representation used is part of the users graphnet installation - i.e. exotic representations that are not yet part of the library will fail. There is no robust way around this.
Adds LMDBDataset
The dataset is compatible with the .lmdb files and largely identical to the existing SQLiteDataset. It supports str-selections and has a "pre-computed" mode, where the user may choose to query pre-computed data representations instead of calculating them in real-time.
Adds SQLiteToLMDBConverter
A pre-configured converter that converts existing sqlite databases to lmdb format, similar to our ParquetToSQLiteConverter. This converter also accepts a list of data representations, allowing your to export to lmdb alongside pre-computed representations.

Minor Changes

Expanded the test suite to include the lmdb backend for unit testing
Replaced deprecated converters in the relevant test suite
Icetray conversion example adjusted to include the lmdb backend (which is now the default format in the example)
Added utility functions (graphnet.data.utilities.lmdb_utilities) - for querying events etc
Minor updates of the documentation. I kept these very small to not bloat the PR further.
Updated the GraphNeTDataModule to support the lmdb backend

Tagging @astrojarred @giogiopg @Aske-Rosted @sevmag and @pweigel as we've all discussed various aspects of this in the past.

…o lmdb_pr

Aske-Rosted

I found two small things one of which might not be very specific to this PR and could just be an issue/improvement for the future since it also occurs for the parquet file format.

With this large of a contribution it is hard for me to determine just by reading the code whether or not all the logic in the new writer for an example is correct, but it seems to me that the tests should be covering the new functionality.

Aske-Rosted · 2026-01-23T06:50:24Z

src/graphnet/data/dataset/lmdb/lmdb_dataset.py

+        loss_weight_column: Optional[str] = None,
+        loss_weight_default_value: Optional[float] = None,
+        seed: Optional[int] = None,
+        labels: Optional[Dict[str, Any]] = None,


Can we not do something smarter with kwargs, such that we both don't have duplicate code but also don't have to worry about forgetting to add new argument calls to the LMDB class in case new functionality is added for the dataset class.

I think it might a point to repeat some args and code so that non-expert users to have to dig into the native functions where all the args are coming from. Just my opinion.

Aske-Rosted · 2026-01-23T07:18:01Z

src/graphnet/training/utils.py

+        try:
+            label = fn(data)
+        except KeyError:
+            raise KeyError(f"Key {key} not found in data.")


I don't believe this KeyError to be correct, as what is going wrong is the function not finding the expected values upon which to create the labels in the data ,but the key being returned in the error is the name of the label which was supposed to be created by the function.

giogiopg

This PR is impressive. I only would suggest to add to this PR or in a new one an example on the example folder (or modify the examples/01_icetray/01_convert_i3_files.py) to include a conversion to lmdb including a precomputed graph representation than can be later called. I believe that with the documentation examples on docs/source/datasets/datasets.rst should be enough to know how to load precomputed data representations, so maybe there is no need to add an examples on the example folder training from an lmdb, but I still will go for the example of the data conversion including a precomputed data representation.

RasmusOrsoe and others added 26 commits November 4, 2025 08:16

ilmd + sqlitereader

26fd4af

automatically infer memmap size

cf39cc0

add method for identifying serialization method in used lmdbwriter

fbdd12b

fix serialization of lambda func

b3fd015

add query_database

3c2e3c1

add get_all_indices utility method

c397d2f

allow lists of data representations to lmdbwriter

bd04276

add lmdb dataset

4ffa2c6

adjust GraphNeTDataModule to accept lmdb

6255999

Error handling in lmdbwriter

70c452b

add missing truth variables to data representation in lmdb

8698fdc

add meta_data to lmdb

948e4d3

add ´SQLiteToLMDBConverter´

4fc40f5

add unit tests for lmdb

fe19562

update docs and add lmdb to setup.py

9212fdb

update conversion example with lmdb backend

97807dc

Merge branch 'main' into lmdb_pr

54d4be0

# noqa: C901

c12049b

Merge branch 'lmdb_pr' of https://github.com/RasmusOrsoe/graphnet int…

602c3cf

…o lmdb_pr

mypy update

28ba1cf

mypy

01af352

update missing column logic

3e4a88d

remove stray function call in test_dataconverters_and_datasets.py

b055391

expand unit tests

ea6ee69

Update deprecated converters in unit test

72d248a

fix unit test

206f7a4

RasmusOrsoe changed the title ~~Lmdb pr~~ Add lmdb as alternative file format Dec 10, 2025

RasmusOrsoe requested a review from giogiopg December 10, 2025 10:32

Aske-Rosted reviewed Jan 23, 2026

View reviewed changes

giogiopg reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `lmdb` as alternative file format #852

Add `lmdb` as alternative file format #852

Uh oh!

RasmusOrsoe commented Dec 9, 2025 •

edited

Loading

Uh oh!

Aske-Rosted left a comment

Uh oh!

Aske-Rosted Jan 23, 2026

Uh oh!

giogiopg Jan 23, 2026

Uh oh!

Aske-Rosted Jan 23, 2026 •

edited

Loading

Uh oh!

giogiopg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add lmdb as alternative file format #852

Are you sure you want to change the base?

Add lmdb as alternative file format #852

Uh oh!

Conversation

RasmusOrsoe commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major Changes

Minor Changes

Uh oh!

Aske-Rosted left a comment

Choose a reason for hiding this comment

Uh oh!

Aske-Rosted Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

giogiopg Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Aske-Rosted Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giogiopg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add `lmdb` as alternative file format #852

Add `lmdb` as alternative file format #852

RasmusOrsoe commented Dec 9, 2025 •

edited

Loading

Aske-Rosted Jan 23, 2026 •

edited

Loading