Multiple bags per HDF5 file
By default, every call to H5Bag.save(obj, filepath) creates (or overwrites) one file holding one bag. When you want to put a handful or hundreds of objects into the same file, that round-trip of opening, deleting, and rewriting the whole file gets expensive fast.
This notebook demos the interior-path form: any text after a recognised HDF5 file extension is treated as a group path inside the file. So analysis.h5/sim/run0 refers to the group /sim/run0 inside analysis.h5. Multiple bags can coexist in the same file, each with its own metadata, and overwriting one leaves its peers alone.
[13]:
import contextlib
import os
import tempfile
import h5py
import numpy as np
import bagofholding as boh
tmp = tempfile.TemporaryDirectory()
shared_file = os.path.join(tmp.name, "analysis.h5")
Saving several bags into one file
Each save call addresses a different interior group. The first one creates analysis.h5; the rest write into it.
[14]:
payloads = {
"params": {"alpha": 0.1, "beta": 0.2, "steps": 1_000},
"sim/run0": np.linspace(0.0, 1.0, 16),
"sim/run1": np.linspace(0.0, 2.0, 16),
"summary": {"mean": 0.42, "sd": 0.01},
}
for sub, obj in payloads.items():
boh.H5Bag.save(obj, f"{shared_file}/{sub}")
for sub in payloads:
reloaded = boh.H5Bag(f"{shared_file}/{sub}").load()
print(f"{sub:>10}: {reloaded!r}")
params: {'alpha': 0.1, 'beta': 0.2, 'steps': 1000}
sim/run0: array([0. , 0.06666667, 0.13333333, 0.2 , 0.26666667,
0.33333333, 0.4 , 0.46666667, 0.53333333, 0.6 ,
0.66666667, 0.73333333, 0.8 , 0.86666667, 0.93333333,
1. ])
sim/run1: array([0. , 0.13333333, 0.26666667, 0.4 , 0.53333333,
0.66666667, 0.8 , 0.93333333, 1.06666667, 1.2 ,
1.33333333, 1.46666667, 1.6 , 1.73333333, 1.86666667,
2. ])
summary: {'mean': 0.42, 'sd': 0.01}
[15]:
boh.H5Bag(shared_file + "/summary").bag_info
[15]:
H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')
What does the file look like on disk?
Use plain h5py to peek under the hood. Each interior path corresponds to an HDF5 group; the bag’s BagInfo lives in that group’s attrs (so different bags in the same file don’t share metadata).
[16]:
with h5py.File(shared_file, "r") as f:
print("Top-level groups:", list(f))
print("sim groups: ", list(f["sim"]))
print("BagInfo on /summary:", dict(f["summary"].attrs))
print("BagInfo on /sim/run0:", dict(f["sim/run0"].attrs))
Top-level groups: ['params', 'sim', 'summary']
sim groups: ['run0', 'run1']
BagInfo on /summary: {'libver_str': 'latest', 'module': 'bagofholding.h5.bag', 'qualname': 'H5Bag', 'version': '0.1.11.dev0+g32d710783.d20260321'}
BagInfo on /sim/run0: {'libver_str': 'latest', 'module': 'bagofholding.h5.bag', 'qualname': 'H5Bag', 'version': '0.1.11.dev0+g32d710783.d20260321'}
Per-bag metadata is isolated
Each interior bag carries its own scraped version metadata.
[17]:
poly = np.polynomial.Polynomial([1, 2, 3])
boh.H5Bag.save(poly, f"{shared_file}/extras/poly")
boh.H5Bag.save({"x": 1}, f"{shared_file}/extras/plain")
print("poly bag's object version :", boh.H5Bag(f"{shared_file}/extras/poly")["object"].version)
print("plain bag's object version:", boh.H5Bag(f"{shared_file}/extras/plain")["object"].version)
poly bag's object version : 2.4.3
plain bag's object version: None
poly bag’s object version matches the underlying numpy version
[18]:
np.__version__
[18]:
'2.4.3'
Overwriting one bag leaves peers intact
Saving back into an existing interior path replaces just that group. Passing overwrite_existing=False refuses instead of clobbering.
[19]:
before_summary = boh.H5Bag(f"{shared_file}/summary").load()
boh.H5Bag.save(np.linspace(0.0, 3.0, 16), f"{shared_file}/sim/run0")
after_summary = boh.H5Bag(f"{shared_file}/summary").load()
print("sim/run0 replaced :", boh.H5Bag(f"{shared_file}/sim/run0").load())
print("summary unchanged :", before_summary == after_summary)
try:
boh.H5Bag.save({"z": 0}, f"{shared_file}/sim/run0", overwrite_existing=False)
except FileExistsError as e:
print("refused:", e)
sim/run0 replaced : [0. 0.2 0.4 0.6 0.8 1. 1.2 1.4 1.6 1.8 2. 2.2 2.4 2.6 2.8 3. ]
summary unchanged : True
refused: Group '/sim/run0' already exists in /tmp/tmpjuulq69k/analysis.h5.
Debug: how is the path parsed?
The split between the filesystem path and the interior group path happens at the first ancestor whose extension is in H5Bag.file_extensions (or which already exists as a file). The mixin exposes properties so you can inspect what got picked up.
[20]:
for path in [
shared_file, # top-level (no interior path)
f"{shared_file}/sim/run0", # explicit interior path
f"{shared_file}/missing/group", # interior path that doesn't exist yet
"no_extension_yet", # neither extension nor existing file
]:
bag = boh.H5Bag(path)
print(f"{path!r:>50} -> file={bag.h5_file_path.name!r:>18}",
f"group={bag.h5_group_path!r:>14}",
f"is_subpath={bag.is_subpath}")
'/tmp/tmpjuulq69k/analysis.h5' -> file= 'analysis.h5' group= '/' is_subpath=False
'/tmp/tmpjuulq69k/analysis.h5/sim/run0' -> file= 'analysis.h5' group= '/sim/run0' is_subpath=True
'/tmp/tmpjuulq69k/analysis.h5/missing/group' -> file= 'analysis.h5' group='/missing/group' is_subpath=True
'no_extension_yet' -> file='no_extension_yet' group= '/' is_subpath=False
Reading from a missing interior group raises a KeyError so you don’t silently get an empty bag:
[21]:
try:
boh.H5Bag(f"{shared_file}/sim/does_not_exist").load()
except KeyError as e:
print("missing group:", e)
missing group: "Group '/sim/does_not_exist' not found in /tmp/tmpjuulq69k/analysis.h5"
Debug: custom file extensions
If you keep your HDF5 data under a different suffix, subclass and override file_extensions. Path parsing then splits on that extension instead of .h5/.hdf5.
[22]:
class BagFile(boh.H5Bag):
file_extensions = (".bag",)
bagfile = os.path.join(tmp.name, "store.bag")
BagFile.save({"who": "alice"}, f"{bagfile}/users/alice")
BagFile.save({"who": "bob"}, f"{bagfile}/users/bob")
print("alice:", BagFile(f"{bagfile}/users/alice").load())
print("bob: ", BagFile(f"{bagfile}/users/bob").load())
alice: {'who': 'alice'}
bob: {'who': 'bob'}
TrieH5Bag works the same way
Same interior-path semantics, same file. Useful if you want to mix bag implementations across groups for size-vs-speed trade-offs.
[23]:
trie_file = os.path.join(tmp.name, "trie.h5")
boh.TrieH5Bag.save({"k": "v1"}, f"{trie_file}/first")
boh.TrieH5Bag.save({"k": "v2"}, f"{trie_file}/second")
print("first :", boh.TrieH5Bag(f"{trie_file}/first").load())
print("second:", boh.TrieH5Bag(f"{trie_file}/second").load())
first : {'k': 'v1'}
second: {'k': 'v2'}
Cleanup
[24]:
tmp.cleanup()