Trying to read from a HDF5 file and convert to another format in parallel?

Hello!

Using HDF5.jl I have constructed a .h5 file with multiple groups, with the keys, “0”, “1”, “2” up to “401”. I wish to read in data in parallel and convert to another format. I have tried to used such as loop:

        Base.Threads.@threads for (ichunk,inds) ∈ chunks(all_keys; n=Base.Threads.nthreads())
            for iter in inds
                key = all_keys[iter]
                DictVariable = read(fid[key])
                ConvertHDFtoVTP(SimMetaData.SaveLocation * "/" * SimMetaData.SimulationName * "_" * key * ".vtp", DictVariable)
            end
        end

But it segfaults, when I try to run it in parallel. Does anyone have a clear example of how to perform the pattern I want?

Kind regards

HDF5.jl doesn’t use a thread-safe libhdf5 build by default (because it’s incompatible with distributed-memory parallel HDF5, I guess). See Crash when using from multiple threads · Issue #835 · JuliaIO/HDF5.jl · GitHub — that’s required even if you are trying to read/write separate HDF5 files in parallel, see anaconda - Does HDF5 support concurrent reads, or writes to different files? - Stack Overflow

1 Like

Thank you for sharing and oh wauw, that is a real bummer…

I was able to 10x write speeds of my program by going to .h5, but if I now have to spend the remaining time gain converting them, then it matters a lot less…

I tried modding my function to use Reentrantlock but it still does not work unfortunately and would probably be meaningless too?

Do other who use .h5 just never work with the data in other programs afterwards? They have to convert it to some format, so that it can be used elsewhere…

Ideally, you use other programs that can also read HDF5. Or you convert only on demand — if you store everything as .h5 files and try to mostly work in that format, you don’t batch-convert them all at once to .vtk or whatever, but only convert them one at at time when you need to process them in some other tool. That way you don’t pay the compute and storage cost of converting all of your files at once.

You can also convert them in parallel using separate processes of course, rather than separate threads.

It seems that for some reason hdf is not possible to load in to the program I want Paraview, atleast not when it is a custom file written by me. For me it is important to save data to hdf, since it is just so much faster, than using some other solutions. Your point of multi-processing was really great and might be exactly what I need, I got this to work:

using Distributed
addprocs(4) # Adjust based on your requirement

# This block ensures that each worker is using the same environment as the master process
@everywhere begin
    using Pkg
    Pkg.activate(".") # Activate the current project environment. Adjust the path as necessary.
    Pkg.instantiate() # Ensure all dependencies are installed

    using SPHExample
end

@everywhere using HDF5

# Path to your HDF5 file
hdf5_file_path = raw"E:\SecondApproach\TESTING_CPU\Test7.h5"

fid = h5open(hdf5_file_path,"r")
        
all_keys = keys(fid)

# Distribute processing of keys across workers
results_future = @distributed for key in all_keys
    # Each worker reads and processes a subset of keys
    local DictVariable = nothing
    # It's generally safer to open the file separately in each worker to avoid concurrency issues
    h5open(hdf5_file_path, "r") do file
        DictVariable = read(file[key])
    end
    filepath = "E:/SecondApproach/TESTING_CPU/" * "Test_" * key * ".vtp"  # Adjust output path as needed
    ConvertHDFtoVTP(filepath, DictVariable)
    # Return value from the loop if needed, e.g., a status message or result
    "Processed $(key)"
end

And yes, it is me and ChatGPT having a go at it, but this ran in parallel atleast!

This is my first time using Distributed in the past I have been able to get away with Threads so this is pretty cool :slight_smile:

Kind regards

You can read HDF5 files directly into Paraview if you write a small XDMF metadata file describing the contents you want to load. This should be a lot quicker than converting the data.

(Paraview can read a lot of file formats. It shouldn’t be too hard to find one you can write efficiently — a binary format, like HDF5 or NetCDF or “legacy” binary VTK — that it can read.)

1 Like

Thank you, that might become relevant at a later point for me. For now I need to ensure I can output the data extremely efficiently, to not bottleneck simulation run time and also have my data in a common data format such as .vtp.

Being able to read in the h5 file directly with a custom format would be a huge blessing, so hopefully I can do that one day.

Thanks!

HDF5.jl uses a ReentrantLock on the C API. The thread-safe version of the C library does the same on the C side.

The underlying C library does not support thread based parallelism. Rather it requires the use of MPI, which can be enabled by using MPI.jl:

https://juliaio.github.io/HDF5.jl/dev/mpi/

3 Likes

You might also be interested in the relatively recent VTKHDF format.

VTKHDF files are just HDF5 files, with the data organised in a specific way such that it can be understood and visualised by ParaView.

1 Like

Just to be clear, the following thread-parallel writing and reading (of different HDF5 files) works just fine (because of the lock mentioned by @mkitti):

@sync for t in 1:nthreads()
    @spawn h5open("file$t.h5", "w") do f
        for i in 1:100
            f[string(i)] = rand(1000, 1000)
        end
    end
end

@sync for t in 1:nthreads()
    @spawn h5open("file$t.h5", "r") do f
        s = 0.0
        for i in 1:100
            s += sum(read(f[string(i)]))
        end
        @show t, s
    end
end

It’s hence not clear - at least not to me - where the segfault in the OP is coming from.

Thanks! I have converted your loop to my code:

    @sync for t in 1:nthreads()
        @spawn h5open(SaveLocation_, "r") do f
            all_keys = keys(f)
            for key in all_keys
                DictVariable = read(f[key])
                filepath = SimMetaData.SaveLocation * "/" * SimMetaData.SimulationName * "_" * key * ".vtp"  # Adjust output path as needed
                ConvertHDFtoVTP(filepath, DictVariable)
            end
        end
    end

ERROR: TaskFailedException

    nested task error: HDF5.API.H5Error: Error iterating through links in group /116
    libhdf5 Stacktrace:
     [1] H5VL_reset_vol_wrapper: Virtual Object Layer/Bad value
         no VOL object wrap context?

I think I am doing it a bad wrong right now, but it seems to have the potential of working since it was able to output a few files before breaking!

That is amazing. Would be the perfect option as @stevengj mentioned, not having to do any conversion.

Do you know where I could find some code to mimic to write the vtkhdf format? Or is it perhaps already in WriteVTK.jl

Kind regards

It’s not currently included in WriteVTK.jl, even though it would be very nice to have a VTKHDF backend there.

I can’t send you an example right now, but you can look at some examples showing the structure of VTKHDF files in the VTK docs. In your case, if I understood correctly you’re trying to write particle locations, in which case your files should look like their PolyData example. For example, you would need to write the particle locations to the HDF5 dataset /VTKHDF/Points, the connectivity vector (which in your case is trivially 0:(num_particles - 1)) to /VTKHDF/Vertices/Connectivity, and so on.

1 Like

I may see the problem. You are attempting to iterate through keys in parallel. This involves a callback mechanism. We release the lock on callback so that you can do other things, but you should not be starting a new iteration on another thread.

Collect the keys before spawning. I will consider adding an iteration lock to prevent more than one iteration from occurring at once. Effectively though this will elimate any advantage of thread parallelism as demonstrated by your code.

Cartsen’s code works because he is not iterating via keys.

That said since your tasks are divided between separate files, using Distributed might be better.

xref: Callbacks, coroutines, and API calls from callbacks - #2 by gheber - HDF5 Library - HDF Forum

2 Likes

Thanks!

I am trying to give it a go. My points have no connectivity, so I am trying to make the simplest file possible:

using HDF5

# Open or create a .vtkhdf file
h5file = h5open("mydata.vtkhdf", "w")

vtkhdf  = create_group(h5file,"VTKHDF")
version = write_attribute(vtkhdf,"Version","2.1")
type    = write_attribute(vtkhdf,"Type",8) #length of "PolyData"?

data = rand(Float32, 412, 3)

# Write the 'Points' dataset to the file
write(h5file, "VTKHDF/Points", data)

close(h5file)

It generates a file but it does not recognize the format / open in Paraview 5.12. If anyone has some pointers then it would be highly appreciated.

EDIT: In Python this produces something which works:

import h5py #Uses UTF-8 by default
import numpy as np

with h5py.File("wtf.hdf",'w') as hdffile:

        # write support data
        vtkhdf_group = hdffile.create_group("VTKHDF")
        vtkhdf_group.attrs.create("Version", [1, 0])

        vtkhdf_group.attrs.create("Type",np.string_("ImageData")) # ascii fixed length string
        whole_extent = (0, 3, 0, 1, 0, 0)
        vtkhdf_group.attrs.create("WholeExtent", whole_extent)

        vtkhdf_group.attrs.create("Origin", (0.0, 0.0, 0.0))
        vtkhdf_group.attrs.create("Spacing", (1.0, 1.0, 1.0))
        vtkhdf_group.attrs.create("Direction", (1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0))

        # create the pointdata group and the dataset inside it
        field_data_group = vtkhdf_group.create_group("PointData")
        field_data_group.attrs.create('Scalars', "PNGImage") ## UTF-8 seems to work here
        dset = field_data_group.create_dataset('PNGImage',dtype=np.uint8,shape=(2,4))

        dset[:, :] = np.reshape([1.0,2.3,3.1,4.5,\
                                    1.0,2.3,3.1,4.5],(2, 4))

Taken from: Paraview VTKHDF Reader can not read utf8 strings (#22135) · Issues · ParaView / ParaView · GitLab

I would love if someone could get it to work in Julia. I have the following Julia code:

using HDF5

# Open an HDF5 file
filename = "wtf.hdf"
h5open(filename, "w") do file
    # Create the top-level group 'VTKHDF' and set its attributes
    vtkhdf_group = create_group(file, "VTKHDF")
    write_attribute(vtkhdf_group, "Version", [1, 0]) # Julia HDF5 handles arrays as attributes natively
    
    # For ASCII fixed-length strings, Julia's HDF5 doesn't require special handling like np.string_
    # Direct assignment is typically sufficient, as HDF5.jl manages string encoding
    attrs(vtkhdf_group)["Type"] = codeunits("ImageData") # This is inherently UTF-8, and HDF5.jl handles it correctly
    
    whole_extent = [0, 3, 0, 1, 0, 0]
    write_attribute(vtkhdf_group,"WholeExtent", whole_extent)
    
    write_attribute(vtkhdf_group, "Origin", [0.0, 0.0, 0.0])
    write_attribute(vtkhdf_group, "Spacing", [1.0, 1.0, 1.0])
    write_attribute(vtkhdf_group, "Direction", [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0])
    
    # Create the PointData group and the dataset inside it
    field_data_group = create_group(vtkhdf_group, "PointData")
    write_attribute(field_data_group, "Scalars", "PNGImage")
    dset = create_dataset(field_data_group, "PNGImage", datatype(UInt8), dataspace(2, 4))
    write(dset, UInt8[1 2 3 4; 5 6 7 8]) # Adjusted for UInt8
end

# Path to the file for reference
filename

Which runs but produces a file which is non openable in Paraview.

@jipolanco now I got the toy example from above to work:

using HDF5

# Open an HDF5 file
filename = "it_only_took_a_few_hours_sigh.hdf"
file = h5open(filename, "w") 

# Create the top-level group 'VTKHDF' and set its attributes
vtkhdf_group = create_group(file, "VTKHDF")
write_attribute(vtkhdf_group, "Version", [1, 0])

# write_attribute(vtkhdf_group,"Type", "ImageData")
# Create the ASCII string datatype with fixed length and null padding
strtype = HDF5.h5t_copy(HDF5.H5T_C_S1)
HDF5.h5t_set_size(strtype, 9)  # Length of the string "ImageData"
HDF5.h5t_set_strpad(strtype, HDF5.H5T_STR_NULLPAD)  # Null padding
HDF5.h5t_set_cset(strtype, HDF5.H5T_CSET_ASCII)  # Set character set to ASCII
# Define a scalar dataspace for the attribute
dataspace_id = HDF5.dataspace(1)
# Create the 'Type' attribute with the specific string type and scalar dataspace
attr_id = HDF5.h5a_create(vtkhdf_group.id, "Type", strtype, dataspace_id, HDF5.H5P_DEFAULT, HDF5.H5P_DEFAULT)
# Write the string "ImageData" to the 'Type' attribute
HDF5.h5a_write(attr_id, strtype, codeunits("ImageData"))

whole_extent = [0, 3, 0, 1, 0, 0]
write_attribute(vtkhdf_group,"WholeExtent", whole_extent)

write_attribute(vtkhdf_group, "Origin",    [0.0, 0.0, 0.0])
write_attribute(vtkhdf_group, "Spacing", [1.0, 1.0, 1.0])
write_attribute(vtkhdf_group, "Direction", [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0])

# Create the PointData group and the dataset inside it
field_data_group = create_group(vtkhdf_group, "PointData")
attr = create_attribute(field_data_group, "Scalars", datatype(String), HDF5.dataspace(1))
HDF5.API.h5a_write(attr, datatype(String), Ref{Cstring}(["PNGImage"]))
field_data_group["PNGImage"] = Array(UInt8[1 2 3 4; 5 6 7 8]')
close(file)

# Path to the file for reference
filename

This neatly produces:

And now no one hopefully has to be frustrated over finding the right functions haha. Now it is of course “just” about getting PolyData to work.

Kind regards

1 Like

I actually have a code that generates polydata VTKHDF files, but it’s for describing lines rather than points, and it’s quite complex.

But I managed to create a simple example with particles (and some attached data) which works on my side. I hope it’s helpful!

Note that the connectivities seem to be needed for things to work.

using HDF5: HDF5, h5open
using StaticArrays: SVector
using Random

# Random points to be written (as a vector of SVector).
points = rand(SVector{3, Float64}, 1000)
velocities = randn!(similar(points))

h5open("particles.vtkhdf", "w") do io
    # Create toplevel group /VTKHDF
    gtop = HDF5.create_group(io, "VTKHDF")

    # Write version of VTKHDF format as an attribute
    HDF5.attrs(gtop)["Version"] = [2, 1]

    # Write type of dataset ("PolyData") as an ASCII string to a "Type" attribute.
    # This is a bit tricky because VTK/ParaView don't support UTF-8 strings here, which is
    # the default in HDF5.jl.
    let s = "PolyData"
        dtype = HDF5.datatype(s)
        HDF5.API.h5t_set_cset(dtype.id, HDF5.API.H5T_CSET_ASCII)
        dspace = HDF5.dataspace(s)
        attr = HDF5.create_attribute(gtop, "Type", dtype, dspace)
        HDF5.write_attribute(attr, dtype, s)
    end

    # Write points + number of points.
    # Note that we need to reinterpret the vector of SVector onto a 3×Np matrix.
    Np = length(points)
    gtop["NumberOfPoints"] = [Np]
    gtop["Points"] = reinterpret(reshape, eltype(eltype(points)), points)

    # Write velocities as point data.
    let g = HDF5.create_group(gtop, "PointData")
        g["Velocity"] = reinterpret(reshape, eltype(eltype(velocities)), velocities)
    end

    # Create and fill Vertices group.
    let g = HDF5.create_group(gtop, "Vertices")
        # In our case 1 point == 1 cell.
        g["NumberOfCells"] = [Np]
        g["NumberOfConnectivityIds"] = [Np]
        g["Connectivity"] = collect(0:(Np - 1))
        g["Offsets"] = collect(0:Np)
        close(g)
    end

    # Add unused PolyData types. ParaView expects this, even if they're empty.
    for type ∈ ("Lines", "Polygons", "Strips")
        gempty = HDF5.create_group(gtop, type)
        gempty["NumberOfCells"] = [0]
        gempty["NumberOfConnectivityIds"] = [0]
        gempty["Connectivity"] = Int[]
        gempty["Offsets"] = [0]
        close(gempty)
    end
end
2 Likes

@jipolanco thank you so much!

I’ve marked your solution as the answer after all, since I believe it is the right answer, to my “wrong question”. As you pointed out this new format is exactly what I need, it has the compression of hdf and no issue in regards to opening in Paraview, which is a must for my physical simulations.

A huge thanks to all the guys in the thread helping me out, I’ve read and benefitted from all your comments.

After receiving help I am trying to make it a habit when possible to share the impact. Thanks to the script provided by jlpolanco, I am now able to save data at extreme speeds:

With 400 files taking 1.64 s, compared to roughly 10s earlier using my custom vtp output function (similar performance to the one in WriteVTK.jl). I do not close files after reading, rather at the end of the simulation since as you can see it is actually a bottle neck and not something I need to wait on.

The impact of this file save optimization is huge; when running a multi-threaded code, the file save is still often serial and when we start to save data, the whole simulation stops. This means that the time saved with faster writes is in a sense “double”, since we get to move on with our simulation faster.

The single minimal downside of the vtkhdf for now is that it takes up about 10% more space.

I’ve merged a functional version into my code here: SPHExample/example/StillWedge.jl at e3bd565618c68f3e4fa6abc3bcc4ebf28a3cd715 · AhmedSalih3d/SPHExample · GitHub

Oh and as we like to compare with other software, we beat DualSPHysics in save time by a factor of 10 for a ~similar amount of data:

And there it hasnt even been processed yet for viewing in Paraview, what a way to start the weekend!

1 Like