HDF5 2.0.0.2ad0391
API Reference
|
The purpose of this technical note is to help HDF5 users with troubleshooting problems with HDF5 Filters, especially with compression filters. The document assumes that the reader knows HDF5 basics and is aware of the compression feature in HDF5.
One of the most powerful features of HDF5 is the ability to modify, or “filter,” data during I/O. Filters provided by the HDF5 Library, “predefined filters”, include several types of data compression, data shuffling and checksum. Users can implement their own “user-defined filters” and use them with the HDF5 Library.
By far the most common user-defined filters are ones that perform data compression. While the programming model and usage of the compression filters are straightforward, it is easy, especially for novice users, to overlook important details when implementing compression filters and to end up with data that is not modified as they would expect.
The purpose of this document is to describe how to diagnose situations where the data in a file is not compressed as expected.
Sometimes users may find that HDF5 data was not compressed in a file or that the compression ratio is very small. By themselves, these results do not mean that compression did not work or did not work well. These results suggest that something might have gone wrong when a compression filter was applied. How can users determine the true cause of the problem?
There are two major reasons why a filter did not produce the desired result: it was not applied, or it was not effective.
If a filter was not applied at all, then it was not included at compile time when the library was built or was not found at run time for dynamically loaded filters.
The absence or presence of HDF5 predefined filters can be confirmed by examining the installed HDF5 files or by using HDF5 API calls. The absence or presence of all filter types can be confirmed by running HDF5 command-line tools on the produced HDF5 files. See If a Filter Was Not Applied for more information.
The effectiveness of compression filters is a complex matter and is only briefly covered this document. See If a Compression Filter Was Not Effective for more information. This section gives a short overview of the problem and provides an example in which the advantages of different compression filters and their combinations are shown.
This section discusses how it may happen that a compression filter is not available to an application and describes the behavior of the HDF5 Library in the absence of the filter. Then we walk through how to troubleshoot the problem by checking the HDF5 installation, by examining what an application can do at run time to see if a filter is available, and by using some HDF5 command line tools to see if a filter was applied.
Note that there are internal predefined filters:
H5Z_FILTER_DEFLATE | The gzip compression, or deflation, filter |
H5Z_FILTER_SZIP | The SZIP compressionfilter |
H5Z_FILTER_NBIT | The N-bit compression filter |
H5Z_FILTER_SCALEOFFSET | The scale-offset compression filter |
H5Z_FILTER_SHUFFLE | The shuffle algorithm filter |
H5Z_FILTER_FLETCHER32 | The Fletcher32 checksum, or error checking, filter |
These are enabled by default by both configure and CMake builds. While these filters can be disabled intentionally with the configure flag –disablefilters
, disabling them is not recommended. The discussion and the examples in this document focus on compression filters, but everything said can be applied to other missing internal filters as well.
The HDF5 Library uses external libraries for data compression. The two predefined compression methods are gzip3 and szip or libaec, and these can be requested at the HDF5 Library configuration time (compile time). User-defined compression filters and the corresponding libraries are usually linked with an application or provided as a dynamically loaded library.
Note that the libaec library is a replacement for the original szip library. The libaec library is a freely available, open-source library that provides compression and decompression functionality and is compatible with the szip filter. The libaec library can be used as a drop-in replacement for the szip, but requires two libraries to be present on the system: libaec.a(so,dylib,lib) and libsz.a(so,dylib,lib). Everywhere in this document, the term szip refers to the szip filter and the libaec library.
gzip and szip require the libz.a(so,dylib,lib) and libsz.a(so,dylib,lib)/libaec.a(so,dylib,lib) libraries, respectively, to be present on the system and to be enabled during HDF5 configuration with this autotools configure command:
There is one important difference in the behavior of GNU Autotools configure between gzip and szip. On Unix systems,gzip compression is enabled automatically if the zlib library is present on the system in default locations without explicitly specifying –with-zlib=/path
. For example, if libz.so
is installed under /usr/lib
with the header under /usr/include
or under /usr/local/lib
with the header under /usr/local/include
, the following HDF5 configure command will find the gzip library and will configure the compression filter in:
With GNU Autotools configure will not fail if libraries supporting the requested compression method are not found, for example, because a specified path was not correct, or the library is missing.
Or CMake configure command:
With CMake both libraries have to be explicitly enabled. The source code distribution’s config/cmake/cacheinit.cmake
file will enable both filters along with setting other options. Users can overwrite the defaults by using -DHDF5_ENABLE_SZIP_SUPPORT:BOOL=OFF -DHDF5_ENABLE_ZLIB_SUPPORT:BOOL=OFF with the “cmake –C” command. See the INSTALL_CMake.txt file under the release_docs directory in the HDF5 source distribution.
If compression is not requested or found at configuration time, the compression method is not registered with the library and cannot be applied when data is written or read. For example, the h5repack tool will not be able to remove an szip compression filter from a dataset if the szip library was not configured into the library against which the tool was built. The next section discusses the behavior of the HDF5 Library in the absence of filters.
By design, the HDF5 Library allows applications to create and write datasets using filters that are not available at creation/write time. This feature makes it possible to create HDF5 files on one system and to write data on another system where the HDF5 Library is configured with or without the requested filter.
Let’ s recall the HDF5 programming model for enabling filters.
An HDF5 application uses one or more H5Pset_<filter>
calls to configure a dataset’s filter pipeline at its creation time. The excerpt below shows how a gzip filter is added to a pipeline5 with H5Pset_deflate.
For all internal filters (shuffle, fletcher32, scaleoffset, and nbit) and the external gzip filter, the HDF5 Library does not check to see if the filter is registered when the corresponding H5Pset_<filter>
function is called. The only exception to this rule is H5Pset_szip which will fail if szip was not configured in or is configured with a decoder only. Hence, in the example above, H5Pset_deflate will succeed. The specified filter will be added to the dataset’s filter pipeline and will be applied to any data written to this dataset.
When H5Pset_<filter>
is called, a record for the filter is added to the dataset’s object header in the file, and information about the filter can be queried with the HDF5 APIs and displayed by HDF5 tools such as h5dump. The presence of filter information in a dataset’s header does not mean that the filter was actually applied to the dataset’s data, as will be explained later in this document. See How to Use HDF5 Tools to Investigate Missing Compression Filters for more information on how to use h5ls and h5debug to determine if the filter was actually applied.
The success of further write operations to a dataset when filters are missing depends on the filter type.
By design, an HDF5 filter can be optional or required. This filter mode defines the behavior of the HDF5 Library during write operations. In the absence of an optional filter, H5Dwrite calls will succeed and data will be written to the file, bypassing the filter. A missing required filter will cause H5Dwrite calls to fail. Clearly, H5Dread calls will fail when filters that are needed to decode the data are missing.
The HDF5 Library has only one required internal filter, Fletcher32 (checksum creation), and one required external filter, szip. As mentioned earlier, only the szip compression (H5Pset_szip) will flag the absence of the filter. If, despite the missing filter, an application goes on to create a dataset via H5Dcreate, the call will succeed, but the szip filter will not be added to the filter pipeline. This behavior is different from all other filters that may not be present, but will be added to the filter pipeline and applied during I/O. See the Using HDF5 APIs section for more information on how to determine if a filter is available and to avoid writing data while the filter is missing.
Developers who create their own filters should use the flags parameter in H5Pset_filter to declare if the filter is optional or required. The filter type can be determined by calling H5Pget_filter and checking the value of the flags parameter.
For more information on filter behavior in HDF5, see HDF5 Filters.
The previous section described how the HDF5 Library could be configured without certain compression filters and the resulting expected library behavior.
The following subsections explain how to determine if a compression method is configured in the HDF5 Library and how to avoid accessing data if the filter is missing.
To see how the library was configured and built, users should examine the hdf5lib.settings
text file found in the lib directory of the HDF5 installation point and search for the lines that contain the “I/O filters”
string. The hdf5lib.settings
file is automatically generated at configuration time when the HDF5 Library is built with configure on Unix or with CMake on Unix and Windows, and it should contain the following lines:
The same lines in the file generated by CMake look slightly different:
“ENCODE DECODE”
indicates that both the szip compression encoder and decoder are present. This inconsistency between configure and CMake generated files will be removed in a future release. These lines show the compression libraries configured with HDF5. Here is an example of the same output when external compression filters are absent:
Depending on the values listed on the I/O filters (external) line, users will be able to tell if their HDF5 files are compressed appropriately. If szip is not included in the build, data files will not be compressed with szip. If gzip is not included in the build and is not installed on the system, then data files will not be compressed with gzip.
If the hdf5lib.settings
file is not present on the system, then users can examine a public header file or the library binary file to find out if a filter is present, as is discussed in the next two sections.
To see if a filter is present, users can also inspect the HDF5 public header file installed under the include directory of the HDF5 installation point. If the compression and internal filters are present, the corresponding symbols will be defined as follows:
If a compression or internal filter was not configured, the corresponding lines will be commented out as follows:
The HDF5 Library’s binary contains summary output similar to what is stored in the hdf5lib.settings
file. Users can run the Unix “strings”
command to get information about the configured filters:
When compression filters are not configured, the output of the command above will be:
On Windows one can use the dumpbin /all
command, and then view and search the output for strings like DEFLATE
, FLETCHER32
, DECODE
, and ENCODE
.
Developers can also use the compiler scripts such as h5cc
to verify that a compression library is present and configured in. Use the - show
option with any of the compilers scripts found in the bin subdirectory of the HDF5 installation directory. The presence of –lsz
and –lz
options among the linker flags will confirm that szip or gzip were compiled with the HDF5 Library. See the sample below
CMake users can check the hdf5-config.cmake
file in the CMake installation directory. The file will indicate what options were used to configure the HDF5 Library. The variables in the "User Options" section can be used by developers programmatically to determine if a filter was configured in.
After using find-package(HDF5)
CMake can test the setting of these variables as shown below:
Applications can check filter availability at run time. In order to check the filter’s availability with the HDF5 Library, users should know the filter identifier (for example, H5Z_FILTER_DEFLATE) and call the H5Zfilter_avail function as shown in the example below. Use H5Zget_filter_info to determine if the filter is configured to decode data, to encode data, neither, or both.
H5Zfilter_avail can be used to find filters that are registered with the library or are available via dynamically loaded libraries. For more information, see Using Dynamically-Loadable Filters.
Currently there is no HDF5 API call to retrieve a list of all of the registered or dynamically loaded filters. The default installation directories for HDF5 dynamically loaded filters are /usr/local/hdf5/lib/plugin
on Unix and ALLUSERSPROFILE%\hdf5\lib\plugin
on Windows. Users can also check to see if the environment variable HDF5_PLUGIN_PATH
is set on the system and refers to a directory with available plugins.
In this section, we will use the h5dump, h5ls, and h5debug command-line utilities to see if a file was created with an HDF5 Library that did or did not have a compression filter configured in. For more information on these tools, see the Command Line Tools for HDF5 Files page in the HDF5 User Guide.
The h5dump command-line tool can be used to see if a file uses a compression filter. The tool has two flags that will limit the output: the –p
flag causes dataset properties including compression filters to be displayed, and the –H
flag is used to suppress the output of data. The program provided in the Command Line Tools for HDF5 Files section creates a file called h5ex_d_gzip.h5
. The output of h5dump shows that the gzip compression filter set to level 9 was added to the DS1 dataset filter pipeline at creation time.
The output also shows a compression ratio defined as (original size)/(storage size). The size of the stored data is 5018 bytes vs. 8192 bytes of uncompressed data, a ratio of 1.663. This shows that the filter was successfully applied.
Now let’s look at what happens when the same program is linked against an HDF5 Library that was not configured with the gzip library.
Notice that some chunks are only partially filled. 56 chunks (7 along the first dimension and 8 along the second dimension) are required to store the data. Since no compression was applied, each chunk has size 5x9x4 = 180
bytes, resulting in a total storage size of 10,080 bytes. With an original size of 8192 bytes, the compression ratio is 0.813 (in other words, less than 1) and visible in the output below.
As discussed in the How Does the HDF5 Library Behave in the Absence of a Filter, the presence of a filter in an object’s filter pipeline does not imply that it will be applied unconditionally when data is written.
If the compression ratio is less than 1, compression is not applied. If it is 1, and compression is shown by h5dump, more investigation is needed; this will be discussed in the next section.
Filters operate on chunked datasets. A filter may be ineffective for one chunk (for example, the compressed data is bigger than the original data), and succeed on another. How can users discern if a filter is missing or just ineffective (and as a result non-compressed data was written)? The h5ls and h5debug command-line tools can be used to investigate the issue.
First, let’s take a look at what kind of information h5ls displays about the dataset DS1 in our example file, which was written with an HDF5 library that has the deflate filter configured in:
We see output similar to h5dump output with the compression ratio at 163%.
Now let’s compare this output with another dataset DS1, but this time the dataset was written with a program linked against an HDF5 library without the gzip filter present.
The h5ls output above shows that the gzip filter was added to the filter pipeline of the dataset DS1. It also shows that the compression ratio is less than 1. We can confirm by using h5debug that the filter was not applied at all, and, as a result of the missing filter, the individual chunks were not compressed.
From the h5ls output we know that the dataset object header is located at address 800. We retrieve the dataset object header at address 800 and search the layout message for the address of the chunk index B-tree as shown in the excerpt of the h5debug output below:
Now we can retrieve the B-tree information:
We see that the size of each chunk is 180 bytes: in other words, compression was not successful. The filter mask value 0x00000001 indicates that filter was not applied. For more information on the filter mask, see the III.A.1. Disk Format: Level 1A1 - Version 1 B-trees section in the HDF5 File Format Specification.
The example program used to create the file discussed in this document is a modified version of the program available at h5ex_d_gzip.c. It was modified to have chunk dimensions not be factors of the dataset dimensions. Chunk dimensions were chosen for demonstration purposes only and are not recommended for real applications.
There is no “one size fits all” compression filter solution. Users have to consider a number of characteristics such as the type of data, the desired compression ratio, the encoding/decoding speed, the general availability of a compression filter, and licensing among other issues before committing to a compression filter. This is especially true for data producers. The way data is written will affect how much bandwidth consumers will need to download data products, how much system memory and time will be required to read the data, and how many data products can be stored on the users’ system to name a few issues. Users should plan on experimenting with various compression filters and settings. The following are some suggestions for where to start finding the best compression filter for your data:
The h5repack tool can be used to experiment with the data to address the items above.
Users should also look beyond compression. An HDF5 file may contain a substantial amount of unused space. The h5stat tool can be used to determine if space is used efficiently in an HDF5 file, and the h5repack tool can be used to reduce the amount of unused space in an HDF5 file. See An Alternative to Compression for more information.
An extensive comparison of different compression filters is outside the scope of this document. However, it is easy to show that, unless a suitable compression method or an advantageous filter combination is chosen, applying the same compression filter to different types of data may not reduce HDF5 file size as much as possible.
For example, we looked at a NASA weather data product file packaged with it its geolocation information (the file name is GCRIOREDRO_npp_d20030125_t0702533_e0711257_b00993_c20140501163427060570_XXXX_XXX.h5
) and used h5repack to apply three different compressions to the original file:
Then we compared the sizes of the 32-bit floating dataset /All_Data/CrIMSS-EDR-GEOTC_All/Height
when different types of compression were used and compared for the sizes of the 32-bit integer dataset /All_Data/CrIMSS-EDR_All/FORnum
. The results are shown in the table below.
Data | Original | gzip Level 7 | szip Using NN Mode and Blocksize 32 | Shuffle and gzip Level 7 |
---|---|---|---|---|
32-bit Floats | 1 | 2.087 | 1.628 | 2.56 |
32-bit Integers | 1 | 3.642 | 10.832 | 38.20 |
The combination of the shuffle filter and gzip compression level 7 worked well on both floating point and integer datasets, as shown in the fifth column of the table above. gzip compression worked better than szip on the floating point dataset, but not on the integer dataset as shown by the results in columns three and four. Clearly, if the objective is to minimize the size of the file, datasets with different types of data have to be compressed with different compression methods.
For more information on the shuffle filter, see the Data Pipeline Filters section in the HDF5 Datasets chapter of the HDF5 User Guide. See also the Property Lists (H5P) in the HDF5 Reference Manual for the H5Pset_shuffle function call entry.
Sometimes HDF5 files contain unused space. The h5repack command-line tool can be used to reduce the amount of unused space in a file without changing any storage parameters of the data. For example, running h5stat on the file GCRIOREDRO_npp_d20030125_t0702533_e0711257_b00993_c20140501163524579819_XXXX_XXX.h5
shows:
After running h5repack, the file shows a 10-fold reduction in unaccounted space:
There is also a small reduction in file metadata space. For more information on h5repack and h5stat, see the Command Line Tools for HDF5 Files page in the HDF5 User Guide.
See the following documents published by The HDF Group for more information.