swmr_tools

Getting Started

swmr_tools is a python package for making live data processing of hdf5_files easy.

swmr_tools can be installed from conda-forge using:

conda install -c conda-forge swmr-tools

It can also be installed from PyPi:

pip install swmr_tools

Alternatively you can clone the git repository containing swmr_tools using:

git clone https://github.com/DiamondLightSource/python-swmrtools.git

HDF5 File Requirements

To live process HDF5 data using the swmr_tools package there are a few requirements on the file structure.

  • The file must be created in swmr mode (see https://docs.h5py.org/en/stable/swmr.html)

  • The file must have one (or more) key datasets (see below)

  • (Optional) The file can have a finished dataset (see below)

Key Datasets

Although swmr allows HDF5 to be read while being written, it can be difficult to determine whether a slice of the data has been written to or is just the fill data HDF5 uses when a dataset is expanded. To determine whether real data is actually written, swmr_tools needs a key dataset. The key dataset is usually an integer dataset, with a fill value of zero, which is flushed with a non-zero integer value after the corresponding frame of the main dataset is flushed. By monitoring these key datasets, swmr_tools can determine when each data frame is readable.

Finished Dataset

Since HDF5 datasets can be expanded it can be difficult to tell whether a file is complete or whether more data is likely to be written. The swmr_tools library uses a time out to determine when to finish, but this can also be paired with a finished dataset. The finished dataset is a single integer dataset, with a value zero when the file is still being written to and non-zero when the file is complete. This allows a long time out to be used without wasting time waiting when the file is complete.

Basic Use

The DataSource class is the simplest way to interact with a live swmr file. The DataSource is an iterator that provides a map of data for each frame.

The DataSource class requires 2 arguments:

  • A list of key datasets.

  • A list of datasets containing the data you wish to process.

The DataSource also has an optional timeout argument, which defaults to 10 second unless otherwise specified, and finished_dataset argument, which is a finished dataset.

The DataSource works out the dimensions of the frame (whether scalar, vector or image) by looking at the difference between the rank of the key and data datasets. It assumes that the data is written row-major and the data frames are in the fastest dimensions.

Reading Data

As an example we will create two small datasets (of the same size but containing different values) and corresponding unique key dataset to use in our example. This example shows a 2 x 2 grid scan of a detector with shape [5,10]. The keys will all be non-zero so we should expect to receive every frame of the dataset

from swmr_tools import DataSource
import h5py
import numpy as np

#Create a small dataset to extract frames from
data_1 = np.random.randint(low = -10000, high = 10000, size = (2,2,5,10))
data_2 = np.random.randint(low = -10000, high = 10000, size = (2,2,5,10))
keys_1 = np.arange(1,5).reshape(2,2,1,1)

#Save data to an hdf5 File
with h5py.File("example.h5", "w", libver = "latest") as f:
    f.create_group("keys")
    f.create_group("data")
    f["keys"].create_dataset("keys_1", data = keys_1)
    f["data"].create_dataset("data_1", data = data_1)
    f["data"].create_dataset("data_2", data = data_2)

Then we simply setup a DataSource pointing at the keys and datasets and let it run:

with h5py.File("example.h5", "r") as f:
    keys = [f["/keys/keys_1"]]
    datasets = {"/data/data_1" : f["/data/data_1"],
                "/data/data_2" : f["/data/data_2"]}
    ds = DataSource(keys,datasets)

    for data_map in ds:
        frame = data_map["/data/data_1"]
        print(data_map.slice_metadata)
        print(str(frame))

(slice(0, 1, None), slice(0, 1, None))
[[[[ 3980 -3645 -5966  8665   360  1863  7697  -769 -5559 -2142]
   [ 4588 -9254  8550 -1948  1172  -886  5600 -4307 -3488  2684]
   [ 6961 -6236 -4299 -7908  4577  4358 -6297 -8586 -4147 -3344]
   [ 7149 -2261  1190 -6692  -828  4310  5177 -1239  8868 -4319]
   [ 2442  5367 -1959  6815  5524 -2185 -2171 -8405 -2000 -6897]]]]
(slice(0, 1, None), slice(1, 2, None))
[[[[-4746  9432  4913 -7990 -7969   508 -4400 -4904   749 -1777]
   [-5639 -6433   214 -9282   951 -9444  3568   147 -3306  3393]
   [-9036 -9871 -9149  3938 -4487  9919  -170  5348  3916   289]
   [-3024   237  6456  8663  3531  8984 -3129  9678  3566  1306]
   [ 1891 -6206  9541 -4270 -7572 -6388 -1389  7990 -9341  8785]]]]
(slice(1, 2, None), slice(0, 1, None))
[[[[ 5964  6778 -1285 -4820  1111  5613 -3506 -2496 -6278  2581]
   [ 5037 -1065 -5667  1903  -311 -3747  1912  8773  1429   459]
   [ 4058  6380 -8450 -6520  7715  2446  8190 -6177 -9543  5414]
   [-6701  -870 -7936 -1994  9943  7053  9467 -5751 -7643  1843]
   [ 5033  4083  4520 -3509  9507  1576  9728 -1245  3678 -9098]]]]
    ...

The data (as numpy arrays) can be accessed from the data_map for each point using the dataset path as a key in the map. The slice_metadata attribute on the data_map shows the slice the data was taken from.

The slice_metadata can be used to write processed data into a new hdf5 dataset, and the DataSource class has some convenience methods to help with this.

Writing Data

The DataSource class has two methods to assist with writing processed data back into a hdf5 file:

ds.create_dataset(result_data,file_handle,hdf5_path)

which creates a new hdf5 dataset, with the correct type and shape for the result_data numpy array, and:

ds.append_data(result_data,slice_metadata,output_dataset)

which adds new result datasets into this hdf5 dataset.

Advanced Use

The DataSource class is designed to be simple but because of this may not work for every method of processing (for example if for performance reasons you dont want to read every frame, or only want to read a region of each frame).

For more complicated use cases the KeyFollower and FrameReader classes can be used.

KeyFollower

The KeyFollower is the most fundamental class in swmrtools; it follows the key datasets and reports the highest index for which all the key datasets are non-zero.

As an example we will create a dataset of non-zero integers, respresenting a complete set of scans all flushed to disk

import h5py
from swmr_tools import KeyFollower
import numpy as np

#create a sequential array of the numbers 1-8 and reshape them into an array
# of shape (2,4,1,1)
complete_key_array = np.arange(8).reshape(2,4,1,1) + 1

We then create an empty hdf5 file, create a group called “keys” and create a dataset in that group called “key_1” where we will add our array of non-zero keys

with h5py.File("test_file.h5", "w", libver = "latest") as f:
    f.create_group("keys")
    f["/keys"].create_dataset("key_1", data = complete_key_array)

Next, we shall create an instance of the KeyFollower class and demonstrate a simple example of its use. At a minimum we must pass the key datasets we wish to read from.

Shown below is an example of using an instance of KeyFollower within a for loop, as you would with any standard iterable object. For this basic example of a dataset containing only non-zero values, the loop runs 8 times and stops as expected

# using an instance of Follower in a for loop
with h5py.File("test_file.h5", "r", swmr = True) as f:
    keys = [f["/keys/key_1"]]
    kf = KeyFollower(keys)
    for key in kf:
        print(key)
0
1
2
3
4
5
6
7

As with the DataSource, the timeout and finished_dataset can be set on contruction of the KeyFollower.

Running the KeyFollower should not be computationally expensive, because all of the key datasets should be relatively small, allowing the KeyFollower to follow a very rapid scan.

The DataSource class is just a KeyFollower that uses a FrameReader to read a frame from each requested dataset. The FrameReader class can also be used outside the DataSource.

FrameReader

The FrameReader class is constructed using the dataset to read, and the rank of the scan (1 for a stack of images, 2 for a grid scan etc).

The read_frame(index) method then reads the frame corresponding to the index i which can be provided by the KeyFollower.

References

swmr_tools package

Submodules

swmr_tools.DataSource module

swmr_tools.KeyFollower module

Module contents