Reading and writing Arrow column metadata

I am trying to read and write column metadata for the arrow format. I went to the Arrow, runtest.jl file to try to figure it out, but still am having troubles. If someone can get the MWE working it will help me along.

Specifically

  1. I would like to write the metadata
  2. I would like to read the metadata
  3. I would like to change the column names

Thanks in advance

using Arrow
using Tables
# go to runtest.jl in Arrow package for hints on how to use metadata and colmetadata

filename = "test.arrow"
meta = ["file" => "test.arrow", "when" => "now", "why" => "gain experience"]
metacol1 = ["id" => "col1", "comment" => "Col 1 comment"]
metacol2 = ["id" => "col2", "comment" => "Col 2 comment"]
writer = open(Arrow.Writer, filename, metadata=meta, colmetadata = Dict(:Column1 => metacol1, :Column2 => metacol2))

for i in 1:10
    result = rand(4, 2) .+ i
    Arrow.write(writer, Tables.table(result))
end
close(writer)

tbl = Arrow.Table(filename)
rm(filename)

tblmeta = Arrow.getmetadata(tbl)
tblmeta["why"]

tblmetachan1 = Arrow.getmetadata(tbl.Column1)
tblmetachan1

This is how you can do it now:

Arrow.write("test2.arrow", Tables.table(rand(4, 2), header=[:a, :b]), metadata=meta, colmetadata=Dict(:a => metacol1, :b => metacol2))
tbl = Arrow.Table("test2.arrow")
Arrow.getmetadata(tbl)
Arrow.getmetadata(tbl.a)

But indeed we should finish Support DataAPI.jl metadata API · Issue #337 · apache/arrow-julia · GitHub for a complete support.

1 Like

Thanks

This works for writing a single block of data, but my application requires writing multiple blocks, as shown in the for loop in my example. I have not figured out how I can write the column metadata, nor change the column names.

Using the default column names is not a big deal, and I have figured out how I can put all the column specific metadata into the metadata, but it would be cleaner to have the column metadata attached to the column.

Perhaps this is not possible?

The code from BKadmin indeed works, but as mentioned is not applicable to my use case.

Looking a bit closer at what is happening, I get the following output from the open command

julia> writer = open(Arrow.Writer, filename, metadata=meta, colmetadata = Dict(:Column1 => metacol1, :Column2 => metacol2))
Arrow.Writer{IOStream}(IOStream(<file test.arrow>), true, nothing, true, false, true, false, false, true, 8, 6, Base.ImmutableDict("why" => "gain experience", "when" => "now", "file" => "test.arrow"), Base.ImmutableDict(:Column1 => Base.ImmutableDict("comment" => "Col 1 comment", "id" => "col1"), :Column2 => Base.ImmutableDict("comment" => "Col 2 comment", "id" => "col2")), ConcurrentUtilities.OrderedSynchronizer(Task (runnable) @0x0000025285e45370, Base.GenericCondition{ReentrantLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), (8, 2554407032536, 2553475134096))), 2, false), Channel{Arrow.Message}(2147483647), Base.RefValue{Tables.Schema}(#undef), Base.RefValue{Any}(#undef), Dict{Int64, Any}(), (Arrow.Block[], Arrow.Block[]), Task (runnable) @0x0000025285b91560, Base.Threads.Atomic{Bool}(false), Base.RefValue{Any}(#undef), 1, false)

If you review the output the metadata and colmetadata Dictionaries are both present .

Then I do a couple of writes and find the metadata and colmetadata are still present

for i in 1:2
    result = rand(2, 2) .+ i
    Arrow.write(writer, Tables.table(result)) #, header=[:a,:b])
end
julia> writer
Arrow.Writer{IOStream}(IOStream(<file test.arrow>), true, nothing, true, false, true, false, false, true, 8, 6, Base.ImmutableDict("why" => "gain experience", "when" => "now", "file" => "test.arrow"), Base.ImmutableDict(:Column1 => Base.ImmutableDict("comment" => "Col 1 comment", "id" => "col1"), :Column2 => Base.ImmutableDict("comment" => "Col 2 comment", "id" => "col2")), ConcurrentUtilities.OrderedSynchronizer(Task (runnable) @0x0000025285e45370, Base.GenericCondition{ReentrantLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), ReentrantLock(nothing, 0x00000000, 0x00, Base.GenericCondition{Base.Threads.SpinLock}(Base.IntrusiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), (8, 2554407032536, 2553475134096))), 3, false), Channel{Arrow.Message}(2147483647), Base.RefValue{Tables.Schema}(Tables.Schema:
 :Column1  Float64
 :Column2  Float64), Base.RefValue{Any}(Arrow.ToArrowTable(Tables.Schema:
 :Column1  Float64
 :Column2  Float64, Any[[1.6538333373705436, 1.1112311527966914], [1.5588581037092983, 1.230890307877189]], Base.ImmutableDict("file" => "test.arrow", "when" => "now", "why" => "gain experience"), Arrow.DictEncoding[])), Dict{Int64, Any}(), (Arrow.Block[Arrow.Block(520, 192, 32), Arrow.Block(744, 192, 32)], Arrow.Block[]), Task (runnable) @0x0000025285b91560, Base.Threads.Atomic{Bool}(false), Base.RefValue{Any}(#undef), 3, false)

Then I close the writer and and read the table and find the metadata is read, but the colmetadata shows as “nothing”.

julia> close(writer)

julia> tbl = Arrow.Table(filename)
Arrow.Table with 4 rows, 2 columns, and schema:
 :Column1  Float64
 :Column2  Float64

with metadata given by a Base.ImmutableDict{String, String} with 3 entries:
  "why"  => "gain experience"
  "when" => "now"
  "file" => "test.arrow"

julia> Arrow.getmetadata(tbl)
Base.ImmutableDict{String, String} with 3 entries:
  "why"  => "gain experience"
  "when" => "now"
  "file" => "test.arrow"
julia> Arrow.getmetadata(tbl.Column1) === nothing
true

julia> Arrow.getmetadata(tbl.Column2) === nothing
true

So in Summary it seems like a reading problem rather than a writing problem. Perhaps it is solved in the PR 481?

Is there a timeline to merge the suggested PR?

We are waiting for @quinnj as he is currently offline for a few weeks.

If you could remind me how to download this PR so I can test it, that would be appreciated.

In local project do add https://github.com/apache/arrow-julia/tree/bk/metadata

This PR does not fix my particular problem! Should I file this as an issue or can the PR be modified to rectify this problem?

My guess is that the problem is in the code around src/table.jl line 454 which then calls line 462 (I think), but I am not sure.

Yes - please file an issue as I am not maintaining Arrow.jl so I cannot really help with details.

1 Like

For reference, if I change the for loop to only loop once as in

for i in 1:1
    result = rand(2, 2) .+ i
    Arrow.write(writer, Tables.table(result)) #, header=[:a,:b])
end

Then the colmetadata works.