Error writing or reading HDF5 file

I am trying to write an hdf5 file in julia 1.0 and win 10. the code is given below

The following code works

A = collect(reshape(1:120, 15, 8));

h5write("test3.h5", "mygroup2", A)
data = h5read("test3.h5", "mygroup2")

15×8 Array{Int64,2}:
  1  16  31  46  61  76   91  106
  2  17  32  47  62  77   92  107
  3  18  33  48  63  78   93  108
  4  19  34  49  64  79   94  109

However using the iris file, an error is thrown.

using HDF5,DataFrames
using CSV
df=CSV.read("iris.csv")
df1=convert(Array,df)
h5write("iris1.h5","abc",df1);

MethodError: no method matching write(::HDF5File, ::String, ::Array{Any,2})
Closest candidates are:
  write(::Union{HDF5File, HDF5Group}, ::String, ::Any, !Matched::String, !Matched::Any, !Matched::Any...) at C:\Users\chatura\.julia\packages\HDF5\Ft2KJ\src\HDF5.jl:1627
  write(!Matched::AbstractString, ::Any, ::Any...) at io.jl:283
  write(!Matched::IO, ::Any, ::Any...) at io.jl:500
  ...

Stacktrace:
 [1] h5write(::String, ::String, ::Array{Any,2}) at C:\Users\chatura\.julia\packages\HDF5\Ft2KJ\src\HDF5.jl:720
 [2] top-level scope at In[10]:6245:

When I attempt to read a hdf5 created in python. the output thrown is like this. I was expecting a proper output in the form of a data frame but I don’t know how to use the output.

df=h5read(file1,"dpli_data_mar18_df2")

Dict{String,Any} with 8 entries:
  "axis1"         => [57559.0, 57560.0, 57561.0, 57562.0, 57563.0, 57564.0, 575…
  "axis0"         => ["Age", "Sex", "SA", "APE", "Plan", "Mode", "LOB", "AGT", …
  "block0_values" => [47.0 36.0 … 3.0 49.0; 50000.0 77530.0 … 342239.0 2.926e6;…
  "block1_items"  => ["Sex", "Plan", "Mode", "LOB", "AGT", "Type", "SubStatus",…
  "block1_values" => [0 1 … 1 0; 5 6 … 28 50; … ; 2 2 … 2 2; 2 2 … 2 2]
  "block0_items"  => ["Age", "SA", "APE", "EXPR", "EMR", "PT", "PPT", "TPCL", "…
  "block2_values" => Int32[196307 197404 … 201410 196808; 60 12 … 12 12]
  "block2_items"  => ["DOBLA", "DUR"]

The problem is that HDF5 can’t write arrays of mixed types, and that is exactly what your df1 variable is:

julia> df1
150×5 Array{Any,2}:
 5.1  3.5  1.4  0.2  "setosa"
 4.9  3.0  1.4  0.2  "setosa"
 4.7  3.2  1.3  0.2  "setosa"
 ⋮
 6.9  3.1  5.1  2.3  "virginica"
 5.8  2.7  5.1  1.9  "virginica"
 6.8  3.2  5.9  2.3  "virginica"

A simple workaround is to write the float and string parts of the array separately:

julia> h5write("iris1.h5","abc",float.(df1[:,1:4]))

julia> h5write("iris1.h5","def",string.(df1[:,5]))

Then after reading you can reassemble them again if you absolutely need to.

1 Like

I would argue that the Dict output you got is the appropriate format for this data. Note that the content of the keys called "*_values" are all matrices, so your data can’t fit into a single DataFrame. Just access the individual fields directly and do what you want. For example, df["block1_values"] would return a matrix of integers. (And you should probably rename df to something else since you no longer have a DataFrame.)

@NiclasMattsson Thanks. The hdf5 file from python was created out of a pandas data frame and which I am able to read back in a data frame in python directly. The iris data is also being written and read nicely following your suggestion. But probably the differences in hdf5 format between python and Julia are making the column values output as a matrices.