HDF5.jl variable length string

Hello everyone!

I am currently trying to translate i python code to julia! What i am trying to do, broken down, is creating a dataset where dtype=“String”. I know that to do that in python you have to declare a special datatype from the “h5py” package, vlen.

In Julia, i first tried to use simply “String”, like so:
create_dataset(name,"my_dataset", String, (1,) ) #dt

which gave me the following error:
Type Symbol does not have a definite size.

Here is the Stacktrace:


ERROR: Type Symbol does not have a definite size.
Stacktrace:
  [1] sizeof(x::Type)
    @ Base .\essentials.jl:473
  [2] hdf5_type_id(#unused#::Type{Symbol}, isstruct::Val{true})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:71
  [3] hdf5_type_id(#unused#::Type{Symbol})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:69
  [4] hdf5_type_id(#unused#::Type{Core.TypeName}, isstruct::Val{true})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:74
  [5] hdf5_type_id(#unused#::Type{Core.TypeName})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:69
  [6] hdf5_type_id(#unused#::Type{DataType}, isstruct::Val{true})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:74
  [7] hdf5_type_id(#unused#::Type{DataType})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:69
  [8] datatype(#unused#::Type{String})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\typeconversions.jl:66
  [9] create_dataset(parent::HDF5.Group, path::String, dtype::Type, dspace_dims::Tuple{Int64}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\datasets.jl:103
 [10] create_dataset(parent::HDF5.Group, path::String, dtype::Type, dspace_dims::Tuple{Int64})
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\datasets.jl:103
 [11] (::var"#182#183"{result, String})(file::HDF5.File)
    @ Main c:\Users\Win10\Desktop\Arbeit\TU\work_env\read_vtk\export.jl:231
 [12] (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#182#183"{result, String}, HDF5.File})()
    @ HDF5 C:\Users\Win10\.julia\packages\HDF5\HtnQZ\src\file.jl:98
 [13] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#182#183"{result, String}, HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
    @ Base .\task.jl:292
 [14] h5open(::var"#182#183"{result, String}, ::String, ::Vararg{String}; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})

I tried again with the HDF5.VLen datatype that comes with the package, but had the same error.

Unfourtunately the HDF5.jl package is not very well documented when it comes to datatypes, or at least i didn’t find anyrthing.

Does somebody maybe have an idea what’s wrong here?

Thank you very much,
Thomas

Where is the Symbol? Can you translate our symbol to a string or fixed length string?

HDF5.jl is for HDF5 things. You might want to look into JLD.jl (uses HDF5.jl) or JLD2.jl (independent HDF5 implementation) to see if that might translate your structure directly.

Hmm…, I see. Something does seem to have gone wrong with converting String to a HDF5 type when we added support for some arbitrary structs.

You can describe a String as follows in the meantime.

julia> const H5String = HDF5.Datatype(HDF5.API.h5t_create(HDF5.API.H5T_STRING, HDF5.API.H5T_VARIABLE))
HDF5.Datatype: H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }
4 Likes

I’m still investigating your issue, but the following is the the tested path to get the above to work.

julia> using HDF5

julia> h5open("jltest.h5", "w") do f
           ds = write(f, "string", "Hello world!")
       end

julia> h5open("jltest.h5", "r") do f
           read(f["string"])
       end
"Hello world!"

julia> h5open("jltest.h5", "w") do f
           ds = write(f, "strings", ["Hello", "World!"])
       end

julia> h5open("jltest.h5", "r") do f
           read(f["strings"])
       end
2-element Vector{String}:
 "Hello"
 "World!"

julia> h5f = h5open("jltest.h5")
🗂️ HDF5.File: (read-only) jltest.h5
└─ 🔢 strings

julia> h5f["strings"]
🔢 HDF5.Dataset: /strings (file: jltest.h5 xfer_mode: 0)

julia> h5f["strings"][1]
"Hello"

julia> h5f["strings"][2]
"World!"

julia> close(h5f)

That said, I do think your example should work, so I’m working on it. There are some intracacies here since it would be good to avoid variable length or null terminated strings if possible.

Update, draft pull request here:

I’m hesitating on this because this makes an inefficient path easy to use.

thank you very much, this works nicely! is there some list of the supported datatypes and their names? thank you!

I would take a look at the HDF5 documentation itself:
https://docs.hdfgroup.org/hdf5/v1_12/group___h5_t.html#gaa9afc38e1a7d35e4d0bec24c569b3c65

Hello again!

Unfourtunately I found out that the solution does not work after all. The Error went away, however when i want to write to a Dataset with write(dset, data), where data is a simple string, it says the following:

H5T__path_find_real: Datatype/Unable to initialize object
     no appropriate function for conversion path

Could you show me a minimum example that produces this error?

Adapting from above, I’ve already demonstrated how to write a simple string to a dataset. This should work with the released HDF5.jl v0.16.14. However, this creates and writes a fixed length strings.

julia> using HDF5

julia> h5open("jltest.h5", "w") do f
           # This will both create the dataset, and write text to it
           ds = write_dataset(f, "dataset_name", "Hi")
       end

We can see that is successful via h5dump:

$ h5dump jltest.h5
HDF5 "jltest.h5" {
GROUP "/" {
   DATASET "dataset_name" {
      DATATYPE  H5T_STRING {
         STRSIZE 2;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "Hi"
      }
   }
}
}

Note that in this case we specialized the data type to be of size 2. If you are trying to write another value that is larger than size 2, then we will have a problem.

I’m supposing that you did not do it that way, but instead created a variable length string dataset. It would be helpful if you could show me an example of how you got the error you just got, but I think I can guess.

julia> using HDF5

julia> import HDF5.API: H5S_SCALAR, H5P_DEFAULT

julia> h5open("jltest.h5", "w") do f
          create_dataset(f, "dataset_name", H5String)
           f["dataset_name"][1] = "Hello"
       end^C

julia> const H5String = HDF5.Datatype(HDF5.API.h5t_create(HDF5.API.H5T_STRING, HDF5.API.H5T_VARIABLE))
HDF5.Datatype: H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }

julia> h5open("jltest.h5", "w") do f
           ds = create_dataset(f, "dataset_name", H5String)
       end
(HDF5.Dataset: (invalid), HDF5.Datatype: H5T_STRING {
      STRSIZE H5T_VARIABLE;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   })

julia> h5open("jltest.h5", "r+") do f
           ds = f["dataset_name"]
           write(ds, "Hello World")
       end
ERROR: HDF5.API.H5Error: Error writing dataset
libhdf5 Stacktrace:
 [1] H5T__path_find_real: Datatype/Unable to initialize object
     no appropriate function for conversion path
  ⋮

Above HDF5.jl is still trying to write fixed length strings, but HDF5 is getting confused because it needs to convert it to a variable length string. To force it to write a variable length string directly, we’ll use a low-level C call to H5Dwrite which we have wrapped as HDF5.API.h5d_write

julia> using HDF5

julia> h5open("jltest.h5", "r+") do f
           dset = f["dataset_name"]
           dtype = datatype(dset) # H5String as above
           dspace = dataspace(dset) # HDF5.API.H5S_ALL == 0
           xfer = 0 # HDF5.API.H5P_DEFAULT == 0
           HDF5.API.h5d_write(dset, dtype, dspace, dspace, xfer, Ref{Cstring}(["good bye"]))
       end

julia> h5open("jltest.h5", "r") do f
           f["dataset_name"][]
       end
"good bye"

I can’t get this to work for me. I need to make an attribute and it must, absolutely must have this type:

      GROUP "PointData" {
         ATTRIBUTE "Scalars" {
            DATATYPE  H5T_STRING {
               STRSIZE H5T_VARIABLE;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_UTF8;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
            DATA {
            (0): "PNGImage"
            }
         }

But in HDF5.jl I can only get so far to create_attribute:

      GROUP "PointData" {
         ATTRIBUTE "Scalars" {
            DATATYPE  H5T_STRING {
               STRSIZE H5T_VARIABLE;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SCALAR
            DATA {
            (0): NULL
            }
         }

And then I am locked, I cannote overwrite DATA from Null in any way as far as I can see. Is there a solution for this? I really want to use Julia for this instead of h5py.

Kind regards

What have you tried so far?

1 Like

I think I found the solution, it is because the help function mentions write_attribute but not HDF5.h5a_write, which does work like this:

    # Create the PointData group and the dataset inside it
    field_data_group = create_group(vtkhdf_group, "PointData")
    attr = create_attribute(field_data_group, "Scalars", datatype(String), HDF5.dataspace(1))
    HDF5.API.h5a_write(attr, datatype(String), Ref{Cstring}(["PNGImage"]))
3 Likes