{ "cells": [ { "cell_type": "markdown", "id": "6be429b8dfe9b11f", "metadata": {}, "source": [ "# `bagofholding` introduction\n", "\n", "This notebook provides a quick rundown of the key user-facing features for `bagofholding`" ] }, { "cell_type": "code", "execution_count": 1, "id": "7ba5f13eb7ad327c", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.691560Z", "start_time": "2025-10-05T02:41:23.608118Z" } }, "outputs": [], "source": [ "import numpy as np\n", "\n", "import bagofholding as boh" ] }, { "cell_type": "markdown", "id": "4cc7a496100aadac", "metadata": {}, "source": [ "`bagofholding` is intended to work with any `pickle`-able python object, so first let's whip up some custom class to work with" ] }, { "cell_type": "code", "execution_count": 2, "id": "aa94a143f81248d4", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.698357Z", "start_time": "2025-10-05T02:41:23.696375Z" } }, "outputs": [], "source": [ "class MyCustomClass:\n", " def __init__(self, n: int):\n", " self.n = n\n", " self.name = f\"my_custom_class_{n}\"\n", " self.data = np.arange(n)\n", "\n", " def __eq__(self, other):\n", " return all(\n", " (\n", " self.__class__ == other.__class__,\n", " self.n == other.n,\n", " self.name == other.name,\n", " np.all(self.data == other.data),\n", " )\n", " )\n", "\n", "my_object = MyCustomClass(10)\n", "my_object.__metadata__ = \"Let's add some metadata reminding ourselves this was created for the example notebook\"" ] }, { "cell_type": "markdown", "id": "9c09b6aed45f4342", "metadata": {}, "source": [ "## Basics\n", "\n", "Storage with `bagofholding` differs from `pickle` under the hood, but is intended to be similarly easy to work with.\n", "\n", "The underlying analogy is that we have a \"bag\" that we're putting our python objects into. Presently, the only back-end we have implemented uses HDF5 with `h5py`, so let's save our object with a bag of that flavour" ] }, { "cell_type": "code", "execution_count": 3, "id": "316dab21fdbf074b", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.707754Z", "start_time": "2025-10-05T02:41:23.702963Z" } }, "outputs": [], "source": [ "filename = \"notebook_example.h5\"\n", "boh.H5Bag.save(my_object, filename)" ] }, { "cell_type": "markdown", "id": "78aeb7cab60da1b5", "metadata": {}, "source": [ "Saving is a class-level method, we never actually need to instantiate a \"bag\". For loading, we do:" ] }, { "cell_type": "code", "execution_count": 4, "id": "b6f35d6fe5a9ba7", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.717494Z", "start_time": "2025-10-05T02:41:23.711394Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The reloaded object is the same as the saved object: True\n" ] } ], "source": [ "bag = boh.H5Bag(filename)\n", "reloaded = bag.load()\n", "\n", "print(\"The reloaded object is the same as the saved object:\", reloaded == my_object)" ] }, { "cell_type": "markdown", "id": "768a3a54c3a556b2", "metadata": {}, "source": [ "So the basic save-load cycle is extremely straightforward. From here on, we go beyond the power of `pickle`.\n", "\n", "We make a \"bag\" object before loading because, without re-instantiating anything we've saved, we can look at its internal structure! Under the hood, we leverage the same [`__reduce__` workflow that `pickle` uses](https://docs.python.org/3/library/pickle.html#object.__reduce__) in order to decompose arbitrary objects. But instead of lumping everything together in a binary blob, `bagofholding` lets us peek at these different components:" ] }, { "cell_type": "code", "execution_count": 5, "id": "87864f699da8f3cf", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.729427Z", "start_time": "2025-10-05T02:41:23.726040Z" } }, "outputs": [ { "data": { "text/plain": [ "['object',\n", " 'object/args',\n", " 'object/args/i0',\n", " 'object/constructor',\n", " 'object/item_iterator',\n", " 'object/kv_iterator',\n", " 'object/state',\n", " 'object/state/__metadata__',\n", " 'object/state/data',\n", " 'object/state/n',\n", " 'object/state/name']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag.list_paths()" ] }, { "cell_type": "markdown", "id": "2a7276339024b83e", "metadata": {}, "source": [ "## Metadata\n", "\n", "We have additionally scraped metadata from our object at save-time, which can be found using item-access on the bag with the appropriate path:" ] }, { "cell_type": "code", "execution_count": 6, "id": "7ee01d6910aa9619", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.738340Z", "start_time": "2025-10-05T02:41:23.735384Z" } }, "outputs": [ { "data": { "text/plain": [ "Metadata(content_type='bagofholding.content.Reducible', qualname='MyCustomClass', module='__main__', version=None, meta=\"Let's add some metadata reminding ourselves this was created for the example notebook\")" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag[\"object\"]" ] }, { "cell_type": "markdown", "id": "689147946404c8bc", "metadata": {}, "source": [ "Complex objects like numpy arrays get non-trivial metadata" ] }, { "cell_type": "code", "execution_count": 7, "id": "ccdc0d40e4304324", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.749252Z", "start_time": "2025-10-05T02:41:23.746146Z" } }, "outputs": [ { "data": { "text/plain": [ "Metadata(content_type='bagofholding.h5.content.Array', qualname='ndarray', module='numpy', version='2.4.3', meta=None)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag[\"object/state/data\"]" ] }, { "cell_type": "markdown", "id": "32e5d68c45f6ed4f", "metadata": {}, "source": [ "While for simple python primitives we don't bother storing anything" ] }, { "cell_type": "code", "execution_count": 8, "id": "e4f4a97e55e4a07f", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.756823Z", "start_time": "2025-10-05T02:41:23.754293Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Metadata(content_type='bagofholding.content.Long', qualname=None, module=None, version=None, meta=None)\n" ] } ], "source": [ "print(bag[\"object/state/n\"])" ] }, { "cell_type": "markdown", "id": "1f604daacc4a6334", "metadata": {}, "source": [ "And, of course, we store metadata about the bag itself!" ] }, { "cell_type": "code", "execution_count": 9, "id": "c54f0040426c0b5c", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.763926Z", "start_time": "2025-10-05T02:41:23.761747Z" } }, "outputs": [ { "data": { "text/plain": [ "\"H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='...', libver_str='latest')\"" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "re.sub(r\"(?<=version=')[^']*\", \"...\", str(bag.get_bag_info()))\n", "# Don't worry about the regex, we're just replacing the version number so the automated test doesn't fail each new commit" ] }, { "cell_type": "markdown", "id": "e5b90e3045e1f76e", "metadata": {}, "source": [ "For Jupyter users, we can browse the structure and metadata of the stored object conveniently in a GUI" ] }, { "cell_type": "code", "execution_count": 10, "id": "d45c249ad0bb6c17", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.782073Z", "start_time": "2025-10-05T02:41:23.774724Z" } }, "outputs": [ { "data": { "text/plain": [ "['object',\n", " 'object/args',\n", " 'object/args/i0',\n", " 'object/constructor',\n", " 'object/item_iterator',\n", " 'object/kv_iterator',\n", " 'object/state',\n", " 'object/state/__metadata__',\n", " 'object/state/data',\n", " 'object/state/n',\n", " 'object/state/name']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "widget = bag.browse()\n", "widget" ] }, { "cell_type": "markdown", "id": "b35bfcea9e295b2", "metadata": {}, "source": [ "## Partial loading\n", "\n", "A powerful advantage of `bagofholding` is that we allow objects to be only _partially_ reloaded! Since we track the internal object structure, we can pass a particular internal path within the object to reload just that piece" ] }, { "cell_type": "code", "execution_count": 11, "id": "90e59d1c30992064", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.792516Z", "start_time": "2025-10-05T02:41:23.788292Z" } }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag.load(\"object/state/n\")" ] }, { "cell_type": "markdown", "id": "9bf0991c32ec22ab", "metadata": {}, "source": [ "Of course it may be convenient to leverage this feature, but its real power begins to shine when we consider long-term storage.\n", "\n", "Suppose your colleague worked with their custom python code to generate important data... and then left. Now you want to access that data, but don't have a python environment that includes all of their bespoke code! Let's simulate this by resetting our kernel's knowledge, and losing access to `__main__.MyCustomClass`." ] }, { "cell_type": "code", "execution_count": 12, "id": "70443bcb75fc029e", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.826799Z", "start_time": "2025-10-05T02:41:23.797565Z" } }, "outputs": [], "source": [ "%reset -f" ] }, { "cell_type": "code", "execution_count": 13, "id": "8c841101b8fd0efd", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.831660Z", "start_time": "2025-10-05T02:41:23.830049Z" } }, "outputs": [], "source": [ "import bagofholding as boh\n", "\n", "filename = \"notebook_example.h5\"" ] }, { "cell_type": "markdown", "id": "b454f29e2e564a18", "metadata": {}, "source": [ "We can still browse the saved object" ] }, { "cell_type": "code", "execution_count": 14, "id": "c9c36fbf06f0dae3", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.841862Z", "start_time": "2025-10-05T02:41:23.834603Z" } }, "outputs": [ { "data": { "text/plain": [ "['object',\n", " 'object/args',\n", " 'object/args/i0',\n", " 'object/constructor',\n", " 'object/item_iterator',\n", " 'object/kv_iterator',\n", " 'object/state',\n", " 'object/state/__metadata__',\n", " 'object/state/data',\n", " 'object/state/n',\n", " 'object/state/name']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag = boh.H5Bag(filename)\n", "bag.browse()" ] }, { "cell_type": "markdown", "id": "30bf2c3269e1cfe0", "metadata": {}, "source": [ "But of course, we are no longer able to simply reload it" ] }, { "cell_type": "code", "execution_count": 15, "id": "c026c2877c84a79e", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.850967Z", "start_time": "2025-10-05T02:41:23.846714Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No module named '__main__.MyCustomClass'; '__main__' is not a package\n" ] } ], "source": [ "try:\n", " bag.load()\n", "except ModuleNotFoundError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "id": "d85ec77f7bb82502", "metadata": {}, "source": [ "However, if we know where the data we want is stored -- either because we're familiar with the object's library even though we don't have it available right now, or simply by inspecting the object's browsable structure using `bagofholding` -- then we can still reload just that data!" ] }, { "cell_type": "code", "execution_count": 16, "id": "b66832b492baea32", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.860618Z", "start_time": "2025-10-05T02:41:23.857471Z" } }, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag.load(\"object/state/n\")" ] }, { "cell_type": "markdown", "id": "216c2311b876d9f0", "metadata": {}, "source": [ "In some cases we might want to have _part_ of the original environment available, i.e. that part needed to load the terminal data we're interested in. We can see what that is, right down to the version number" ] }, { "cell_type": "code", "execution_count": 17, "id": "d3d3892e5f65551b", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.871007Z", "start_time": "2025-10-05T02:41:23.867417Z" } }, "outputs": [ { "data": { "text/plain": [ "Metadata(content_type='bagofholding.h5.content.Array', qualname='ndarray', module='numpy', version='2.4.3', meta=None)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bag[\"object/state/data\"]" ] }, { "cell_type": "markdown", "id": "8deeb1bb3d79d786", "metadata": {}, "source": [ "And load once we've made it available to our current python interpreter" ] }, { "cell_type": "code", "execution_count": 18, "id": "f9f5e19a4f083853", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.879733Z", "start_time": "2025-10-05T02:41:23.876368Z" } }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "bag.load(\"object/state/data\")" ] }, { "cell_type": "markdown", "id": "6e11779cc770ce7b", "metadata": {}, "source": [ "In this way, data stored with `bagofholding` is extremely transparent and robust." ] }, { "cell_type": "markdown", "id": "d7cd130f02833334", "metadata": {}, "source": [ "## Version control\n", "\n", "Another advantage to storing metadata is that we can check against stored versions at load-time to ensure that our current environment will be able to safely recreate the desired objects from the serialized data.\n", "\n", "By default, versions are found by looking at the `__version__` attribute of a given object's base module, but since not all modules store their versioning info this way, this can be overridden on a per-module basis using the `version_scraping` argument.\n", "\n", "By default, `bagofholding` will complain if two versions do not match exactly, but we can relax this with the `require_versions` argument.\n", "For semantically versioned packages, we have granular control over how strictly the versions match.\n", "\n", "For the sake of this notebook, let's use the `version_scraping` dictionary to provide a custom version for `numpy` at read-time, and then explore the possibilities for `require_versions`:" ] }, { "cell_type": "code", "execution_count": 19, "id": "7b758775498ad711", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.887054Z", "start_time": "2025-10-05T02:41:23.884702Z" } }, "outputs": [], "source": [ "import importlib\n", "\n", "def change_numpy_major(module_name: str) -> str:\n", " if module_name != \"numpy\":\n", " raise ValueError(\"Hey, this is supposed to be a numpy-based example!\")\n", " numpy = importlib.import_module(module_name)\n", " numpy_actual_version = numpy.__version__\n", " semantic_breakdown = numpy_actual_version.split(\".\")\n", " semantic_breakdown[1] = \"9999\" # Change the semantic minor version\n", " return \".\".join(semantic_breakdown)" ] }, { "cell_type": "code", "execution_count": 20, "id": "799f96d6b0ede22b", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.896197Z", "start_time": "2025-10-05T02:41:23.894135Z" } }, "outputs": [], "source": [ "def print_error_without_addresses(e):\n", " \"\"\"\n", " Don't worry about this, it's just so automated tests don't get hung up\n", " on memory addresses changing in error messages\n", " \"\"\"\n", " import re\n", "\n", " msg = str(e)\n", " pattern = re.compile(r\"\")\n", " clean_message = pattern.sub(r\"\", msg)\n", " pattern_lambda = re.compile(r\" at 0x[0-9a-fA-F]+>\")\n", " clean_message = pattern_lambda.sub(r\" ...>\", clean_message)\n", " print(clean_message)" ] }, { "cell_type": "markdown", "id": "88eb603a3d5872f9", "metadata": {}, "source": [ "When our \"current version of numpy\" is X.9999.Z, default load behaviour will complain:" ] }, { "cell_type": "code", "execution_count": 21, "id": "e02fd9f649c84f87", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.901821Z", "start_time": "2025-10-05T02:41:23.899008Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "numpy is stored with version 2.4.3, but the current environment has 2.9999.3. This does not pass validation criterion: \n" ] } ], "source": [ "try:\n", " bag.load(\"object/state/data\", version_scraping={\"numpy\": change_numpy_major})\n", "except boh.EnvironmentMismatchError as e:\n", " print_error_without_addresses(e)" ] }, { "cell_type": "markdown", "id": "683daa082ecac817", "metadata": {}, "source": [ "In fact, either of the choices below will complain that these versions are not compatible for loading:" ] }, { "cell_type": "code", "execution_count": 22, "id": "d6f85080b75854f3", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.911767Z", "start_time": "2025-10-05T02:41:23.907944Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Can't load with exact: numpy is stored with version 2.4.3, but the current environment has 2.9999.3. This does not pass validation criterion: \n", "Can't load with semantic-minor: numpy is stored with version 2.4.3, but the current environment has 2.9999.3. This does not pass validation criterion: \n" ] } ], "source": [ "for validation in [\"exact\", \"semantic-minor\"]:\n", " try:\n", " bag.load(\"object/state/data\", version_validator=validation, version_scraping={\"numpy\": change_numpy_major})\n", " except boh.EnvironmentMismatchError as e:\n", " print_error_without_addresses(f\"Can't load with {validation}: {e}\")" ] }, { "cell_type": "markdown", "id": "665c6bbd0b79b97d", "metadata": {}, "source": [ "But either of these more relaxed flags will let us proceed:" ] }, { "cell_type": "code", "execution_count": 23, "id": "d751b9d9de41493e", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.921159Z", "start_time": "2025-10-05T02:41:23.917083Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loaded without complaint with semantic-major\n", "Loaded without complaint with none\n" ] } ], "source": [ "for validation in [\"semantic-major\", \"none\"]:\n", " bag.load(\"object/state/data\", version_validator=validation, version_scraping={\"numpy\": change_numpy_major})\n", " print(f\"Loaded without complaint with {validation}\")" ] }, { "cell_type": "markdown", "id": "62ace1e6-6cc4-4117-80ab-682f2ca7af18", "metadata": {}, "source": [ "## Multiple Bags per HDF5 File\n", "\n", "The HDF5 backends also support saving multiple, independent bags into the same file.\n", "\n", "To use this append additional 'path' components after the actual file path.\n", "These are used as the internal groups." ] }, { "cell_type": "code", "execution_count": 24, "id": "6c0ece6e-ba6a-4ca6-9110-8089fa2629b4", "metadata": {}, "outputs": [], "source": [ "boh.H5Bag.save(range(10), f\"{filename}/range\")\n", "boh.H5Bag.save(\"multi\", f\"{filename}/string\")\n", "boh.H5Bag.save({\"level1\": {\"level2\": \"value\"}}, f\"{filename}/nested\")" ] }, { "cell_type": "code", "execution_count": 25, "id": "2b0e5c8c-47fe-4eb1-941b-793e34708d60", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')\n", "range(0, 10)\n" ] } ], "source": [ "with boh.H5Bag(f\"{filename}/range\") as bag:\n", " print(bag.bag_info)\n", " print(bag.load())" ] }, { "cell_type": "code", "execution_count": 26, "id": "3b542d4a-7c67-4206-bc8a-de078feb1bc4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')\n", "multi\n" ] } ], "source": [ "with boh.H5Bag(f\"{filename}/string\") as bag:\n", " print(bag.bag_info)\n", " print(bag.load())" ] }, { "cell_type": "code", "execution_count": 27, "id": "b1e03c88-00d4-4fa9-94f0-0fe0739a9c3f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')\n", "{'level1': {'level2': 'value'}}\n" ] } ], "source": [ "with boh.H5Bag(f\"{filename}/nested\") as bag:\n", " print(bag.bag_info)\n", " print(bag.load())" ] }, { "cell_type": "markdown", "id": "346ef4f9-dbbc-4a3b-910c-1c83080e35ce", "metadata": {}, "source": [ "Otherwise access to the data remains the same as in single bag case.\n", "Bag meta data and version info are tracked entirely separately.\n", "As we can see below, the layout inside the HDF5 is otherwise the same as in the single bag case, just prefixed by the additional path components given in save/load." ] }, { "cell_type": "code", "execution_count": 28, "id": "b6abb470-de04-425e-b02d-1d219381bde3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/ Group\n", "/nested Group\n", "/nested/object Group\n", "/nested/object/level1 Group\n", "/nested/object/level1/level2 Dataset {SCALAR}\n", "/object Group\n", "/object/args Group\n", "/object/args/i0 Dataset {SCALAR}\n", "/object/constructor Dataset {SCALAR}\n", "/object/item_iterator Dataset {NULL}\n", "/object/kv_iterator Dataset {NULL}\n", "/object/state Group\n", "/object/state/__metadata__ Dataset {SCALAR}\n", "/object/state/data Dataset {10}\n", "/object/state/n Dataset {SCALAR}\n", "/object/state/name Dataset {SCALAR}\n", "/range Group\n", "/range/object Group\n", "/range/object/args Group\n", "/range/object/args/i0 Dataset {SCALAR}\n", "/range/object/args/i1 Dataset {SCALAR}\n", "/range/object/args/i2 Dataset {SCALAR}\n", "/range/object/constructor Dataset {SCALAR}\n", "/string Group\n", "/string/object Dataset {SCALAR}\n" ] } ], "source": [ "!h5ls -r {filename}" ] }, { "cell_type": "markdown", "id": "38906781-1b10-4a1b-94db-1d71b3493f42", "metadata": {}, "source": [ "Compared to the single bag use case, it becomes **important** to close bags after reading!\n", "If read-only bag handles are still open, we won't be able to save new bags into the same file." ] }, { "cell_type": "markdown", "id": "f70c76c96684d067", "metadata": {}, "source": [ "## Save-time safety\n", "\n", "Being able to exploit the above version control to the fullest means your stored object(s) needs to come from an importable module with some sort of versioning.\n", "To this end, we provide two save-time flags to ensure better behaviour from saved objects.\n", "\n", "First, you can require at save-time that non-standard objects all have a version:" ] }, { "cell_type": "code", "execution_count": 29, "id": "ffd8790c7d9f3f7", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.925362Z", "start_time": "2025-10-05T02:41:23.923936Z" } }, "outputs": [], "source": [ "class SomethingLocalAndUnversioned:\n", " pass" ] }, { "cell_type": "code", "execution_count": 30, "id": "3f9e43248e4ae5f1", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.934323Z", "start_time": "2025-10-05T02:41:23.931459Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Could not find a version for __main__. Either disable `require_versions`, use `version_scraping` to find an existing version for this package, or add versioning to the unversioned package.\n" ] } ], "source": [ "try:\n", " boh.H5Bag.save(SomethingLocalAndUnversioned, filename, require_versions=True)\n", "except boh.NoVersionError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "id": "a88cd9496478186d", "metadata": {}, "source": [ "And second, you can forbid particular modules, e.g. some local library or, more commonly, `__main__`:" ] }, { "cell_type": "code", "execution_count": 31, "id": "385dc69cf00cd98", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.941825Z", "start_time": "2025-10-05T02:41:23.939098Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Module '__main__' is forbidden as a source of stored objects. Change the `forbidden_modules` or move this object to an allowed module.\n" ] } ], "source": [ "try:\n", " boh.H5Bag.save(SomethingLocalAndUnversioned, filename, forbidden_modules=(\"__main__\",))\n", "except boh.ModuleForbiddenError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "id": "7880ba4eba4778fc", "metadata": {}, "source": [ "## (Advanced topic) Customization\n", "\n", "Because it is modeled on the `pickle` API, power users can customize the `bagofholding` storage behavior using familiar tools like custom `__reduce__` or `__getstate__` methods on their classes.\n", "E.g., below we see that modifying the state manipulation impacts what is displayed on browsing:" ] }, { "cell_type": "code", "execution_count": 32, "id": "780d657a9de7fd48", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.954696Z", "start_time": "2025-10-05T02:41:23.948001Z" } }, "outputs": [ { "data": { "text/plain": [ "'object/state/by_default_this_would_just_be_x'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "class Customized:\n", " def __init__(self, x):\n", " self.x = x\n", "\n", " def __getstate__(self):\n", " return {\"by_default_this_would_just_be_x\": self.x}\n", "\n", " def __setstate__(self, state):\n", " self.x = state[\"by_default_this_would_just_be_x\"]\n", "boh.H5Bag.save(Customized(42), filename)\n", "boh.H5Bag(filename).list_paths()[-1]" ] }, { "cell_type": "markdown", "id": "1f7d28e23c1b4a3d", "metadata": {}, "source": [ "## Limitations\n", "\n", "`bagofholding` uses many of the same patterns as `pickle`, and thus is only expected to work for objects which could otherwise be pickled.\n", "Bag objects offer a convenience method to quickly test this:" ] }, { "cell_type": "code", "execution_count": 33, "id": "f6c592f5e5eb2625", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.959605Z", "start_time": "2025-10-05T02:41:23.957910Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Can't pickle ...>: attribute lookup on __main__ failed\n" ] } ], "source": [ "message = boh.H5Bag.pickle_check(lambda x: x, raise_exceptions=False)\n", "print_error_without_addresses(message)" ] }, { "cell_type": "markdown", "id": "9d446fffaf2ffc39", "metadata": {}, "source": [ "And although the same patterns as `pickle` are exploited, `bagofholding` does not actually _execute_ `pickle`.\n", "To this end, the highest protocol value exploiting out-of-band data is not supported:" ] }, { "cell_type": "code", "execution_count": 34, "id": "c7bfbf7e9861dfb9", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.970172Z", "start_time": "2025-10-05T02:41:23.967182Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pickle protocol must be <= 4, got 5\n" ] } ], "source": [ "try:\n", " boh.H5Bag.save(42, filename, _pickle_protocol=5)\n", "except boh.PickleProtocolError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "id": "94bab79ee3fa4f8a", "metadata": {}, "source": [ "## Notebook cleanup\n", "\n", "At the end of the day, let's clean up the files we created." ] }, { "cell_type": "code", "execution_count": 35, "id": "7066b6eec08f8608", "metadata": { "ExecuteTime": { "end_time": "2025-10-05T02:41:23.976774Z", "start_time": "2025-10-05T02:41:23.975257Z" } }, "outputs": [], "source": [ "import contextlib\n", "import os\n", "\n", "with contextlib.suppress(FileNotFoundError):\n", " os.remove(filename)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }