{ "cells": [ { "cell_type": "markdown", "id": "intro-md", "metadata": {}, "source": [ "# Multiple bags per HDF5 file\n", "\n", "By default, every call to `H5Bag.save(obj, filepath)` creates (or overwrites) one file holding one bag. When you want to put a handful or hundreds of objects into the same file, that round-trip of opening, deleting, and rewriting the whole file gets expensive fast.\n", "\n", "This notebook demos the *interior-path* form: any text after a recognised HDF5 file extension is treated as a group path *inside* the file. So `analysis.h5/sim/run0` refers to the group `/sim/run0` inside `analysis.h5`. Multiple bags can coexist in the same file, each with its own metadata, and overwriting one leaves its peers alone." ] }, { "cell_type": "code", "execution_count": 13, "id": "setup", "metadata": {}, "outputs": [], "source": [ "import contextlib\n", "import os\n", "import tempfile\n", "\n", "import h5py\n", "import numpy as np\n", "\n", "import bagofholding as boh\n", "\n", "tmp = tempfile.TemporaryDirectory()\n", "shared_file = os.path.join(tmp.name, \"analysis.h5\")" ] }, { "cell_type": "markdown", "id": "save-many-md", "metadata": {}, "source": [ "## Saving several bags into one file\n", "\n", "Each save call addresses a different interior group. The first one creates `analysis.h5`; the rest write into it." ] }, { "cell_type": "code", "execution_count": 14, "id": "save-many", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " params: {'alpha': 0.1, 'beta': 0.2, 'steps': 1000}\n", " sim/run0: array([0. , 0.06666667, 0.13333333, 0.2 , 0.26666667,\n", " 0.33333333, 0.4 , 0.46666667, 0.53333333, 0.6 ,\n", " 0.66666667, 0.73333333, 0.8 , 0.86666667, 0.93333333,\n", " 1. ])\n", " sim/run1: array([0. , 0.13333333, 0.26666667, 0.4 , 0.53333333,\n", " 0.66666667, 0.8 , 0.93333333, 1.06666667, 1.2 ,\n", " 1.33333333, 1.46666667, 1.6 , 1.73333333, 1.86666667,\n", " 2. ])\n", " summary: {'mean': 0.42, 'sd': 0.01}\n" ] } ], "source": [ "payloads = {\n", " \"params\": {\"alpha\": 0.1, \"beta\": 0.2, \"steps\": 1_000},\n", " \"sim/run0\": np.linspace(0.0, 1.0, 16),\n", " \"sim/run1\": np.linspace(0.0, 2.0, 16),\n", " \"summary\": {\"mean\": 0.42, \"sd\": 0.01},\n", "}\n", "for sub, obj in payloads.items():\n", " boh.H5Bag.save(obj, f\"{shared_file}/{sub}\")\n", "\n", "for sub in payloads:\n", " reloaded = boh.H5Bag(f\"{shared_file}/{sub}\").load()\n", " print(f\"{sub:>10}: {reloaded!r}\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "16c5b74b-3936-41f0-a98a-e5b9a259ee32", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boh.H5Bag(shared_file + \"/summary\").bag_info" ] }, { "cell_type": "markdown", "id": "raw-h5-md", "metadata": {}, "source": [ "## What does the file look like on disk?\n", "\n", "Use plain `h5py` to peek under the hood. Each interior path corresponds to an HDF5 group; the bag's `BagInfo` lives in that group's attrs (so different bags in the same file don't share metadata)." ] }, { "cell_type": "code", "execution_count": 16, "id": "raw-h5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top-level groups: ['params', 'sim', 'summary']\n", "sim groups: ['run0', 'run1']\n", "BagInfo on /summary: {'libver_str': 'latest', 'module': 'bagofholding.h5.bag', 'qualname': 'H5Bag', 'version': '0.1.11.dev0+g32d710783.d20260321'}\n", "BagInfo on /sim/run0: {'libver_str': 'latest', 'module': 'bagofholding.h5.bag', 'qualname': 'H5Bag', 'version': '0.1.11.dev0+g32d710783.d20260321'}\n" ] } ], "source": [ "with h5py.File(shared_file, \"r\") as f:\n", " print(\"Top-level groups:\", list(f))\n", " print(\"sim groups: \", list(f[\"sim\"]))\n", " print(\"BagInfo on /summary:\", dict(f[\"summary\"].attrs))\n", " print(\"BagInfo on /sim/run0:\", dict(f[\"sim/run0\"].attrs))" ] }, { "cell_type": "markdown", "id": "metadata-md", "metadata": {}, "source": [ "## Per-bag metadata is isolated\n", "\n", "Each interior bag carries its own scraped version metadata." ] }, { "cell_type": "code", "execution_count": 17, "id": "metadata", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "poly bag's object version : 2.4.3\n", "plain bag's object version: None\n" ] } ], "source": [ "poly = np.polynomial.Polynomial([1, 2, 3])\n", "boh.H5Bag.save(poly, f\"{shared_file}/extras/poly\")\n", "boh.H5Bag.save({\"x\": 1}, f\"{shared_file}/extras/plain\")\n", "\n", "print(\"poly bag's object version :\", boh.H5Bag(f\"{shared_file}/extras/poly\")[\"object\"].version)\n", "print(\"plain bag's object version:\", boh.H5Bag(f\"{shared_file}/extras/plain\")[\"object\"].version)" ] }, { "cell_type": "markdown", "id": "40e81b43-53c7-4463-be36-e60be49fb50d", "metadata": {}, "source": [ "poly bag's object version matches the underlying numpy version" ] }, { "cell_type": "code", "execution_count": 18, "id": "15637818-1419-42f5-9297-eea289ed4fd9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.4.3'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.__version__" ] }, { "cell_type": "markdown", "id": "overwrite-md", "metadata": {}, "source": [ "## Overwriting one bag leaves peers intact\n", "\n", "Saving back into an existing interior path replaces just that group. Passing `overwrite_existing=False` refuses instead of clobbering." ] }, { "cell_type": "code", "execution_count": 19, "id": "overwrite", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sim/run0 replaced : [0. 0.2 0.4 0.6 0.8 1. 1.2 1.4 1.6 1.8 2. 2.2 2.4 2.6 2.8 3. ]\n", "summary unchanged : True\n", "refused: Group '/sim/run0' already exists in /tmp/tmpjuulq69k/analysis.h5.\n" ] } ], "source": [ "before_summary = boh.H5Bag(f\"{shared_file}/summary\").load()\n", "boh.H5Bag.save(np.linspace(0.0, 3.0, 16), f\"{shared_file}/sim/run0\")\n", "after_summary = boh.H5Bag(f\"{shared_file}/summary\").load()\n", "\n", "print(\"sim/run0 replaced :\", boh.H5Bag(f\"{shared_file}/sim/run0\").load())\n", "print(\"summary unchanged :\", before_summary == after_summary)\n", "\n", "try:\n", " boh.H5Bag.save({\"z\": 0}, f\"{shared_file}/sim/run0\", overwrite_existing=False)\n", "except FileExistsError as e:\n", " print(\"refused:\", e)" ] }, { "cell_type": "markdown", "id": "debug-md", "metadata": {}, "source": [ "## Debug: how is the path parsed?\n", "\n", "The split between the filesystem path and the interior group path happens at the first ancestor whose extension is in `H5Bag.file_extensions` (or which already exists as a file). The mixin exposes properties so you can inspect what got picked up." ] }, { "cell_type": "code", "execution_count": 20, "id": "debug-parse", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " '/tmp/tmpjuulq69k/analysis.h5' -> file= 'analysis.h5' group= '/' is_subpath=False\n", " '/tmp/tmpjuulq69k/analysis.h5/sim/run0' -> file= 'analysis.h5' group= '/sim/run0' is_subpath=True\n", " '/tmp/tmpjuulq69k/analysis.h5/missing/group' -> file= 'analysis.h5' group='/missing/group' is_subpath=True\n", " 'no_extension_yet' -> file='no_extension_yet' group= '/' is_subpath=False\n" ] } ], "source": [ "for path in [\n", " shared_file, # top-level (no interior path)\n", " f\"{shared_file}/sim/run0\", # explicit interior path\n", " f\"{shared_file}/missing/group\", # interior path that doesn't exist yet\n", " \"no_extension_yet\", # neither extension nor existing file\n", "]:\n", " bag = boh.H5Bag(path)\n", " print(f\"{path!r:>50} -> file={bag.h5_file_path.name!r:>18}\",\n", " f\"group={bag.h5_group_path!r:>14}\",\n", " f\"is_subpath={bag.is_subpath}\")" ] }, { "cell_type": "markdown", "id": "missing-md", "metadata": {}, "source": [ "Reading from a missing interior group raises a `KeyError` so you don't silently get an empty bag:" ] }, { "cell_type": "code", "execution_count": 21, "id": "debug-missing", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "missing group: \"Group '/sim/does_not_exist' not found in /tmp/tmpjuulq69k/analysis.h5\"\n" ] } ], "source": [ "try:\n", " boh.H5Bag(f\"{shared_file}/sim/does_not_exist\").load()\n", "except KeyError as e:\n", " print(\"missing group:\", e)" ] }, { "cell_type": "markdown", "id": "custom-ext-md", "metadata": {}, "source": [ "## Debug: custom file extensions\n", "\n", "If you keep your HDF5 data under a different suffix, subclass and override `file_extensions`. Path parsing then splits on that extension instead of `.h5`/`.hdf5`." ] }, { "cell_type": "code", "execution_count": 22, "id": "debug-ext", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "alice: {'who': 'alice'}\n", "bob: {'who': 'bob'}\n" ] } ], "source": [ "class BagFile(boh.H5Bag):\n", " file_extensions = (\".bag\",)\n", "\n", "bagfile = os.path.join(tmp.name, \"store.bag\")\n", "BagFile.save({\"who\": \"alice\"}, f\"{bagfile}/users/alice\")\n", "BagFile.save({\"who\": \"bob\"}, f\"{bagfile}/users/bob\")\n", "print(\"alice:\", BagFile(f\"{bagfile}/users/alice\").load())\n", "print(\"bob: \", BagFile(f\"{bagfile}/users/bob\").load())" ] }, { "cell_type": "markdown", "id": "triebag-md", "metadata": {}, "source": [ "## `TrieH5Bag` works the same way\n", "\n", "Same interior-path semantics, same file. Useful if you want to mix bag implementations across groups for size-vs-speed trade-offs." ] }, { "cell_type": "code", "execution_count": 23, "id": "triebag", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "first : {'k': 'v1'}\n", "second: {'k': 'v2'}\n" ] } ], "source": [ "trie_file = os.path.join(tmp.name, \"trie.h5\")\n", "boh.TrieH5Bag.save({\"k\": \"v1\"}, f\"{trie_file}/first\")\n", "boh.TrieH5Bag.save({\"k\": \"v2\"}, f\"{trie_file}/second\")\n", "print(\"first :\", boh.TrieH5Bag(f\"{trie_file}/first\").load())\n", "print(\"second:\", boh.TrieH5Bag(f\"{trie_file}/second\").load())" ] }, { "cell_type": "markdown", "id": "cleanup-md", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "code", "execution_count": 24, "id": "cleanup", "metadata": {}, "outputs": [], "source": [ "tmp.cleanup()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }