Multiple bags per HDF5 file

By default, every call to H5Bag.save(obj, filepath) creates (or overwrites) one file holding one bag. When you want to put a handful or hundreds of objects into the same file, that round-trip of opening, deleting, and rewriting the whole file gets expensive fast.

This notebook demos the interior-path form: any text after a recognised HDF5 file extension is treated as a group path inside the file. So analysis.h5/sim/run0 refers to the group /sim/run0 inside analysis.h5. Multiple bags can coexist in the same file, each with its own metadata, and overwriting one leaves its peers alone.

[13]:
import contextlib
import os
import tempfile

import h5py
import numpy as np

import bagofholding as boh

tmp = tempfile.TemporaryDirectory()
shared_file = os.path.join(tmp.name, "analysis.h5")

Saving several bags into one file

Each save call addresses a different interior group. The first one creates analysis.h5; the rest write into it.

[14]:
payloads = {
    "params": {"alpha": 0.1, "beta": 0.2, "steps": 1_000},
    "sim/run0": np.linspace(0.0, 1.0, 16),
    "sim/run1": np.linspace(0.0, 2.0, 16),
    "summary": {"mean": 0.42, "sd": 0.01},
}
for sub, obj in payloads.items():
    boh.H5Bag.save(obj, f"{shared_file}/{sub}")

for sub in payloads:
    reloaded = boh.H5Bag(f"{shared_file}/{sub}").load()
    print(f"{sub:>10}: {reloaded!r}")
    params: {'alpha': 0.1, 'beta': 0.2, 'steps': 1000}
  sim/run0: array([0.        , 0.06666667, 0.13333333, 0.2       , 0.26666667,
       0.33333333, 0.4       , 0.46666667, 0.53333333, 0.6       ,
       0.66666667, 0.73333333, 0.8       , 0.86666667, 0.93333333,
       1.        ])
  sim/run1: array([0.        , 0.13333333, 0.26666667, 0.4       , 0.53333333,
       0.66666667, 0.8       , 0.93333333, 1.06666667, 1.2       ,
       1.33333333, 1.46666667, 1.6       , 1.73333333, 1.86666667,
       2.        ])
   summary: {'mean': 0.42, 'sd': 0.01}
[15]:
boh.H5Bag(shared_file + "/summary").bag_info
[15]:
H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')

What does the file look like on disk?

Use plain h5py to peek under the hood. Each interior path corresponds to an HDF5 group; the bag’s BagInfo lives in that group’s attrs (so different bags in the same file don’t share metadata).

[16]:
with h5py.File(shared_file, "r") as f:
    print("Top-level groups:", list(f))
    print("sim groups:     ", list(f["sim"]))
    print("BagInfo on /summary:", dict(f["summary"].attrs))
    print("BagInfo on /sim/run0:", dict(f["sim/run0"].attrs))
Top-level groups: ['params', 'sim', 'summary']
sim groups:      ['run0', 'run1']
BagInfo on /summary: {'libver_str': 'latest', 'module': 'bagofholding.h5.bag', 'qualname': 'H5Bag', 'version': '0.1.11.dev0+g32d710783.d20260321'}
BagInfo on /sim/run0: {'libver_str': 'latest', 'module': 'bagofholding.h5.bag', 'qualname': 'H5Bag', 'version': '0.1.11.dev0+g32d710783.d20260321'}

Per-bag metadata is isolated

Each interior bag carries its own scraped version metadata.

[17]:
poly = np.polynomial.Polynomial([1, 2, 3])
boh.H5Bag.save(poly, f"{shared_file}/extras/poly")
boh.H5Bag.save({"x": 1}, f"{shared_file}/extras/plain")

print("poly bag's object version :", boh.H5Bag(f"{shared_file}/extras/poly")["object"].version)
print("plain bag's object version:", boh.H5Bag(f"{shared_file}/extras/plain")["object"].version)
poly bag's object version : 2.4.3
plain bag's object version: None

poly bag’s object version matches the underlying numpy version

[18]:
np.__version__
[18]:
'2.4.3'

Overwriting one bag leaves peers intact

Saving back into an existing interior path replaces just that group. Passing overwrite_existing=False refuses instead of clobbering.

[19]:
before_summary = boh.H5Bag(f"{shared_file}/summary").load()
boh.H5Bag.save(np.linspace(0.0, 3.0, 16), f"{shared_file}/sim/run0")
after_summary = boh.H5Bag(f"{shared_file}/summary").load()

print("sim/run0 replaced :", boh.H5Bag(f"{shared_file}/sim/run0").load())
print("summary unchanged :", before_summary == after_summary)

try:
    boh.H5Bag.save({"z": 0}, f"{shared_file}/sim/run0", overwrite_existing=False)
except FileExistsError as e:
    print("refused:", e)
sim/run0 replaced : [0.  0.2 0.4 0.6 0.8 1.  1.2 1.4 1.6 1.8 2.  2.2 2.4 2.6 2.8 3. ]
summary unchanged : True
refused: Group '/sim/run0' already exists in /tmp/tmpjuulq69k/analysis.h5.

Debug: how is the path parsed?

The split between the filesystem path and the interior group path happens at the first ancestor whose extension is in H5Bag.file_extensions (or which already exists as a file). The mixin exposes properties so you can inspect what got picked up.

[20]:
for path in [
    shared_file,                            # top-level (no interior path)
    f"{shared_file}/sim/run0",              # explicit interior path
    f"{shared_file}/missing/group",         # interior path that doesn't exist yet
    "no_extension_yet",                     # neither extension nor existing file
]:
    bag = boh.H5Bag(path)
    print(f"{path!r:>50} -> file={bag.h5_file_path.name!r:>18}",
          f"group={bag.h5_group_path!r:>14}",
          f"is_subpath={bag.is_subpath}")
                    '/tmp/tmpjuulq69k/analysis.h5' -> file=     'analysis.h5' group=           '/' is_subpath=False
           '/tmp/tmpjuulq69k/analysis.h5/sim/run0' -> file=     'analysis.h5' group=   '/sim/run0' is_subpath=True
      '/tmp/tmpjuulq69k/analysis.h5/missing/group' -> file=     'analysis.h5' group='/missing/group' is_subpath=True
                                'no_extension_yet' -> file='no_extension_yet' group=           '/' is_subpath=False

Reading from a missing interior group raises a KeyError so you don’t silently get an empty bag:

[21]:
try:
    boh.H5Bag(f"{shared_file}/sim/does_not_exist").load()
except KeyError as e:
    print("missing group:", e)
missing group: "Group '/sim/does_not_exist' not found in /tmp/tmpjuulq69k/analysis.h5"

Debug: custom file extensions

If you keep your HDF5 data under a different suffix, subclass and override file_extensions. Path parsing then splits on that extension instead of .h5/.hdf5.

[22]:
class BagFile(boh.H5Bag):
    file_extensions = (".bag",)

bagfile = os.path.join(tmp.name, "store.bag")
BagFile.save({"who": "alice"}, f"{bagfile}/users/alice")
BagFile.save({"who": "bob"},   f"{bagfile}/users/bob")
print("alice:", BagFile(f"{bagfile}/users/alice").load())
print("bob:  ", BagFile(f"{bagfile}/users/bob").load())
alice: {'who': 'alice'}
bob:   {'who': 'bob'}

TrieH5Bag works the same way

Same interior-path semantics, same file. Useful if you want to mix bag implementations across groups for size-vs-speed trade-offs.

[23]:
trie_file = os.path.join(tmp.name, "trie.h5")
boh.TrieH5Bag.save({"k": "v1"}, f"{trie_file}/first")
boh.TrieH5Bag.save({"k": "v2"}, f"{trie_file}/second")
print("first :", boh.TrieH5Bag(f"{trie_file}/first").load())
print("second:", boh.TrieH5Bag(f"{trie_file}/second").load())
first : {'k': 'v1'}
second: {'k': 'v2'}

Cleanup

[24]:
tmp.cleanup()