bagofholding introduction

This notebook provides a quick rundown of the key user-facing features for bagofholding

[1]:
import numpy as np

import bagofholding as boh

bagofholding is intended to work with any pickle-able python object, so first let’s whip up some custom class to work with

[2]:
class MyCustomClass:
    def __init__(self, n: int):
        self.n = n
        self.name = f"my_custom_class_{n}"
        self.data = np.arange(n)

    def __eq__(self, other):
        return all(
            (
                self.__class__ == other.__class__,
                self.n == other.n,
                self.name == other.name,
                np.all(self.data == other.data),
            )
        )

my_object = MyCustomClass(10)
my_object.__metadata__ = "Let's add some metadata reminding ourselves this was created for the example notebook"

Basics

Storage with bagofholding differs from pickle under the hood, but is intended to be similarly easy to work with.

The underlying analogy is that we have a “bag” that we’re putting our python objects into. Presently, the only back-end we have implemented uses HDF5 with h5py, so let’s save our object with a bag of that flavour

[3]:
filename = "notebook_example.h5"
boh.H5Bag.save(my_object, filename)

Saving is a class-level method, we never actually need to instantiate a “bag”. For loading, we do:

[4]:
bag = boh.H5Bag(filename)
reloaded = bag.load()

print("The reloaded object is the same as the saved object:", reloaded == my_object)
The reloaded object is the same as the saved object: True

So the basic save-load cycle is extremely straightforward. From here on, we go beyond the power of pickle.

We make a “bag” object before loading because, without re-instantiating anything we’ve saved, we can look at its internal structure! Under the hood, we leverage the same `__reduce__ workflow that pickle uses <https://docs.python.org/3/library/pickle.html#object.__reduce__>`__ in order to decompose arbitrary objects. But instead of lumping everything together in a binary blob, bagofholding lets us peek at these different components:

[5]:
bag.list_paths()
[5]:
['object',
 'object/args',
 'object/args/i0',
 'object/constructor',
 'object/item_iterator',
 'object/kv_iterator',
 'object/state',
 'object/state/__metadata__',
 'object/state/data',
 'object/state/n',
 'object/state/name']

Metadata

We have additionally scraped metadata from our object at save-time, which can be found using item-access on the bag with the appropriate path:

[6]:
bag["object"]
[6]:
Metadata(content_type='bagofholding.content.Reducible', qualname='MyCustomClass', module='__main__', version=None, meta="Let's add some metadata reminding ourselves this was created for the example notebook")

Complex objects like numpy arrays get non-trivial metadata

[7]:
bag["object/state/data"]
[7]:
Metadata(content_type='bagofholding.h5.content.Array', qualname='ndarray', module='numpy', version='2.4.3', meta=None)

While for simple python primitives we don’t bother storing anything

[8]:
print(bag["object/state/n"])
Metadata(content_type='bagofholding.content.Long', qualname=None, module=None, version=None, meta=None)

And, of course, we store metadata about the bag itself!

[9]:
import re
re.sub(r"(?<=version=')[^']*", "...", str(bag.get_bag_info()))
# Don't worry about the regex, we're just replacing the version number so the automated test doesn't fail each new commit
[9]:
"H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='...', libver_str='latest')"

For Jupyter users, we can browse the structure and metadata of the stored object conveniently in a GUI

[10]:
widget = bag.browse()
widget
[10]:
['object',
 'object/args',
 'object/args/i0',
 'object/constructor',
 'object/item_iterator',
 'object/kv_iterator',
 'object/state',
 'object/state/__metadata__',
 'object/state/data',
 'object/state/n',
 'object/state/name']

Partial loading

A powerful advantage of bagofholding is that we allow objects to be only partially reloaded! Since we track the internal object structure, we can pass a particular internal path within the object to reload just that piece

[11]:
bag.load("object/state/n")
[11]:
10

Of course it may be convenient to leverage this feature, but its real power begins to shine when we consider long-term storage.

Suppose your colleague worked with their custom python code to generate important data… and then left. Now you want to access that data, but don’t have a python environment that includes all of their bespoke code! Let’s simulate this by resetting our kernel’s knowledge, and losing access to __main__.MyCustomClass.

[12]:
%reset -f
[13]:
import bagofholding as boh

filename = "notebook_example.h5"

We can still browse the saved object

[14]:
bag = boh.H5Bag(filename)
bag.browse()
[14]:
['object',
 'object/args',
 'object/args/i0',
 'object/constructor',
 'object/item_iterator',
 'object/kv_iterator',
 'object/state',
 'object/state/__metadata__',
 'object/state/data',
 'object/state/n',
 'object/state/name']

But of course, we are no longer able to simply reload it

[15]:
try:
    bag.load()
except ModuleNotFoundError as e:
    print(e)
No module named '__main__.MyCustomClass'; '__main__' is not a package

However, if we know where the data we want is stored – either because we’re familiar with the object’s library even though we don’t have it available right now, or simply by inspecting the object’s browsable structure using bagofholding – then we can still reload just that data!

[16]:
bag.load("object/state/n")
[16]:
10

In some cases we might want to have part of the original environment available, i.e. that part needed to load the terminal data we’re interested in. We can see what that is, right down to the version number

[17]:
bag["object/state/data"]
[17]:
Metadata(content_type='bagofholding.h5.content.Array', qualname='ndarray', module='numpy', version='2.4.3', meta=None)

And load once we’ve made it available to our current python interpreter

[18]:
import numpy as np

bag.load("object/state/data")
[18]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In this way, data stored with bagofholding is extremely transparent and robust.

Version control

Another advantage to storing metadata is that we can check against stored versions at load-time to ensure that our current environment will be able to safely recreate the desired objects from the serialized data.

By default, versions are found by looking at the __version__ attribute of a given object’s base module, but since not all modules store their versioning info this way, this can be overridden on a per-module basis using the version_scraping argument.

By default, bagofholding will complain if two versions do not match exactly, but we can relax this with the require_versions argument. For semantically versioned packages, we have granular control over how strictly the versions match.

For the sake of this notebook, let’s use the version_scraping dictionary to provide a custom version for numpy at read-time, and then explore the possibilities for require_versions:

[19]:
import importlib

def change_numpy_major(module_name: str) -> str:
    if module_name != "numpy":
        raise ValueError("Hey, this is supposed to be a numpy-based example!")
    numpy = importlib.import_module(module_name)
    numpy_actual_version = numpy.__version__
    semantic_breakdown = numpy_actual_version.split(".")
    semantic_breakdown[1] = "9999"  # Change the semantic minor version
    return ".".join(semantic_breakdown)
[20]:
def print_error_without_addresses(e):
    """
    Don't worry about this, it's just so automated tests don't get hung up
    on memory addresses changing in error messages
    """
    import re

    msg = str(e)
    pattern = re.compile(r"<function (\S+) at 0x[0-9a-fA-F]+>")
    clean_message = pattern.sub(r"<function \1 ...>", msg)
    pattern_lambda = re.compile(r"<function <lambda> at 0x[0-9a-fA-F]+>")
    clean_message = pattern_lambda.sub(r"<function <lambda> ...>", clean_message)
    print(clean_message)

When our “current version of numpy” is X.9999.Z, default load behaviour will complain:

[21]:
try:
    bag.load("object/state/data", version_scraping={"numpy": change_numpy_major})
except boh.EnvironmentMismatchError as e:
    print_error_without_addresses(e)
numpy is stored with version 2.4.3, but the current environment has 2.9999.3. This does not pass validation criterion: <function _versions_are_equal ...>

In fact, either of the choices below will complain that these versions are not compatible for loading:

[22]:
for validation in ["exact", "semantic-minor"]:
    try:
        bag.load("object/state/data", version_validator=validation, version_scraping={"numpy": change_numpy_major})
    except boh.EnvironmentMismatchError as e:
        print_error_without_addresses(f"Can't load with {validation}: {e}")
Can't load with exact: numpy is stored with version 2.4.3, but the current environment has 2.9999.3. This does not pass validation criterion: <function _versions_are_equal ...>
Can't load with semantic-minor: numpy is stored with version 2.4.3, but the current environment has 2.9999.3. This does not pass validation criterion: <function _versions_match_semantic_minor ...>

But either of these more relaxed flags will let us proceed:

[23]:
for validation in ["semantic-major", "none"]:
    bag.load("object/state/data", version_validator=validation, version_scraping={"numpy": change_numpy_major})
    print(f"Loaded without complaint with {validation}")
Loaded without complaint with semantic-major
Loaded without complaint with none

Multiple Bags per HDF5 File

The HDF5 backends also support saving multiple, independent bags into the same file.

To use this append additional ‘path’ components after the actual file path. These are used as the internal groups.

[24]:
boh.H5Bag.save(range(10), f"{filename}/range")
boh.H5Bag.save("multi", f"{filename}/string")
boh.H5Bag.save({"level1": {"level2": "value"}}, f"{filename}/nested")
[25]:
with boh.H5Bag(f"{filename}/range") as bag:
    print(bag.bag_info)
    print(bag.load())
H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')
range(0, 10)
[26]:
with boh.H5Bag(f"{filename}/string") as bag:
    print(bag.bag_info)
    print(bag.load())
H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')
multi
[27]:
with boh.H5Bag(f"{filename}/nested") as bag:
    print(bag.bag_info)
    print(bag.load())
H5Info(qualname='H5Bag', module='bagofholding.h5.bag', version='0.1.11.dev0+g32d710783.d20260321', libver_str='latest')
{'level1': {'level2': 'value'}}

Otherwise access to the data remains the same as in single bag case. Bag meta data and version info are tracked entirely separately. As we can see below, the layout inside the HDF5 is otherwise the same as in the single bag case, just prefixed by the additional path components given in save/load.

[28]:
!h5ls -r {filename}
/                        Group
/nested                  Group
/nested/object           Group
/nested/object/level1    Group
/nested/object/level1/level2 Dataset {SCALAR}
/object                  Group
/object/args             Group
/object/args/i0          Dataset {SCALAR}
/object/constructor      Dataset {SCALAR}
/object/item_iterator    Dataset {NULL}
/object/kv_iterator      Dataset {NULL}
/object/state            Group
/object/state/__metadata__ Dataset {SCALAR}
/object/state/data       Dataset {10}
/object/state/n          Dataset {SCALAR}
/object/state/name       Dataset {SCALAR}
/range                   Group
/range/object            Group
/range/object/args       Group
/range/object/args/i0    Dataset {SCALAR}
/range/object/args/i1    Dataset {SCALAR}
/range/object/args/i2    Dataset {SCALAR}
/range/object/constructor Dataset {SCALAR}
/string                  Group
/string/object           Dataset {SCALAR}

Compared to the single bag use case, it becomes important to close bags after reading! If read-only bag handles are still open, we won’t be able to save new bags into the same file.

Save-time safety

Being able to exploit the above version control to the fullest means your stored object(s) needs to come from an importable module with some sort of versioning. To this end, we provide two save-time flags to ensure better behaviour from saved objects.

First, you can require at save-time that non-standard objects all have a version:

[29]:
class SomethingLocalAndUnversioned:
    pass
[30]:
try:
    boh.H5Bag.save(SomethingLocalAndUnversioned, filename, require_versions=True)
except boh.NoVersionError as e:
    print(e)
Could not find a version for __main__. Either disable `require_versions`, use `version_scraping` to find an existing version for this package, or add versioning to the unversioned package.

And second, you can forbid particular modules, e.g. some local library or, more commonly, __main__:

[31]:
try:
    boh.H5Bag.save(SomethingLocalAndUnversioned, filename, forbidden_modules=("__main__",))
except boh.ModuleForbiddenError as e:
    print(e)
Module '__main__' is forbidden as a source of stored objects. Change the `forbidden_modules` or move this object to an allowed module.

(Advanced topic) Customization

Because it is modeled on the pickle API, power users can customize the bagofholding storage behavior using familiar tools like custom __reduce__ or __getstate__ methods on their classes. E.g., below we see that modifying the state manipulation impacts what is displayed on browsing:

[32]:
class Customized:
    def __init__(self, x):
        self.x = x

    def __getstate__(self):
        return {"by_default_this_would_just_be_x": self.x}

    def __setstate__(self, state):
        self.x = state["by_default_this_would_just_be_x"]
boh.H5Bag.save(Customized(42), filename)
boh.H5Bag(filename).list_paths()[-1]
[32]:
'object/state/by_default_this_would_just_be_x'

Limitations

bagofholding uses many of the same patterns as pickle, and thus is only expected to work for objects which could otherwise be pickled. Bag objects offer a convenience method to quickly test this:

[33]:
message = boh.H5Bag.pickle_check(lambda x: x, raise_exceptions=False)
print_error_without_addresses(message)
Can't pickle <function <lambda> ...>: attribute lookup <lambda> on __main__ failed

And although the same patterns as pickle are exploited, bagofholding does not actually execute pickle. To this end, the highest protocol value exploiting out-of-band data is not supported:

[34]:
try:
    boh.H5Bag.save(42, filename, _pickle_protocol=5)
except boh.PickleProtocolError as e:
    print(e)
pickle protocol must be <= 4, got 5

Notebook cleanup

At the end of the day, let’s clean up the files we created.

[35]:
import contextlib
import os

with contextlib.suppress(FileNotFoundError):
    os.remove(filename)