Transfer/ Unbox Julia's DataFrame objects and use as C++ object

NOTE: Cross posting from Transfer/ Unbox Julia's DataFrame objects and use as C++ object - Stack Overflow

I would like to (un)box complex objects like DataFrames created in Julia, from C++. So I can use as input data for my C++ code.

For example, by loading MixedModels sample data dyestuff, and describing it as DataFrame, I can get:

using DataFrames, MixedModels

dyestuff = MixedModels.dataset(:dyestuff)
describe(DataFrame(dyestuff))

image

Note that MixedModel output is of type Arrow.Table. By using fieldnames(typeof(dyestuff)), the available attributes are (:names, :types, :columns, :lookup, :schema, :metadata).

How can I effectively and efficiently transform/unbox complex object (like data frames) from Julia to C++?

Because it is not possible (AFAIK) to share such complex data frame object, I felt it could be convenient to use the separated attributes as primitive objects and create my own DataFrame class in C++. So in C++ I am trying:

// to make sure I can execute julia code from C++
jl_eval_string("println(describe(DataFrame(dyestuff)))"); // 2×7 DataFrame
jl_eval_string("println(typeof(dyestuff))");              // Arrow.Table

// Get the length of column names attibute in julia's dyestuff dataframe.
jl_value_t *n_p = jl_eval_string("length(getfield(dyestuff, :names))");
int n = jl_unbox_int16(n_p); // output is 2 because we have 2 column names.

// Trying to load the actual column names (as array of strings) by following the Julia's manual:
jl_array_t *names_list = (jl_array_t *)jl_eval_string("String.(getfield(dyestuff, :names))");
string *names = (string *)jl_array_data(names_list);
cout << jl_array_len(names_list) << endl;
for (size_t i = 0; i < jl_array_len(names_list); i++)
{
    cout << " " << names[i] << endl;
}

But when printing out the names I am getting strange characters which seems to be binary of the object, but not the column names.

 batch��
��Rmath��
��yield�F���1ath��
��NaNMath�F���min th�F��� min  ��
��QuadGK�F���1adGKG���1adGK��
��GLMs(G���max eXG��� max  ��
��JSON3epG���1ON3�G���1ON3��
��NLopt�G���  pt�G���1optL��
��AdaptH���   tH���   0H���     HH���│9`H���  dfxH���90df_�H���90df_�H���Any �H��� Any  �H���90ets�H���90ndI���Any 4 I��� Any  8I���90dlPI���   inghI���90pe�I���90it2�I���┼ets�I���  code�I���  1tf��
��   1 ��
��│J��� 

Note that out put of Julia’s code is:

julia> String.(getfield(dyestuff, :names))
2-element Vector{String}:
 "batch"
 "yield"

Getting the names attribute is just the first step. But other challenges will come when working with other Data Frame attibutes like types, columns, metadata, lookup, etc. So any suggestion in that front will be helpful. Thanks in advance!

If this is an Arrow.Table why don’t you just use Arrow?

https://arrow.apache.org/docs/cpp/

Language interop is half rhe purpose of Arrow.

https://arrow.apache.org/docs/cpp/

Thanks @mkitti for the idea. Actually that was my first approach, but had issues to compile and use Arrow from C++ for any reason. And I took this other approach as exercise to understand how to share complex objects or structures between the two languages.

you shouldn’t, DataFrames.jl doesn’t have a defined memory layout, it’s not meant to be used as IPC. You should use Arrow IPC stream

Yeah, I see that I should give another try to Arrow. However, Imagine that we are not talking about data frames, but other simpler but still structured data, like array of strings (as described in the problem) or a dictionary in Julia which I want to access from C++ directly. Let me be more precise in the questions:

  1. I’ve read that jl_string_ptr is used to access to strings (instead of using an hypothetical jl_unbox_string), but How to access if such strings are inside an array, like in the problem I explained?

  2. How can I access/ “unbox” dictionary (which values are arrays) or JSON objects defined in Julia, from C++?

This is very useful in scenarios in which I generate data from Julia but still I want to use my C++ algorithms as they are. I’d like access such structures from my C++ code, and the legacy code does not use Arrow, but I can manage such objects (dictionaries and arrays of strings).

Thanks!

Julia String is UTF-8 and not null-terminated etc. So again you probably shouldn’t rely on it in the long term.

The only thing you can rely on on isbit types, and the way to do it is outlined here: Embedding Julia · The Julia Language

Thanks @jling for pointing this information. Very helpful as reference!
However, from the same, I understand that Julia’s strings type is bits-type. Also, you are right that strings/chars in C++ are null-terminated, and that’s why Julia has Cstring types which is returning a memory position to get shared with C/C++, I believe.

I verified that Julia string are bits-type:

julia> isbits(string)
true

Also, In Bits Type, they point that Array{T,N} is also a bits type and say:

" When an array is passed to C as a Ptr{T} argument, it is not reinterpret-cast: Julia requires that the element type of the array matches T , and the address of the first element is passed.

If an array of eltype Ptr{T} is passed as a Ptr{Ptr{T}} argument, Base.cconvert will attempt to first make a null-terminated copy of the array with each element replaced by its Base.cconvert version. This allows, for example, passing an argv pointer array of type Vector{String} to an argument of type Ptr{Ptr{Cchar}}
"

julia> s = String.(getfield(dyestuff, :names))
2-element Vector{String}:
 "batch"
 "yield"

julia> s[1]
"batch"

julia> Base.unsafe_convert(Cstring, s)
Cstring(0x00007f162afd7c30)

Hence, I still feel there could be a way to share a Julia’s Array of strings to C.

CString indeed works, and maybe interface may auto convert. I was saying you can’t pluck a pointer to Vector of any good old Julia string and expect C++ to just be able to use the same memory. That only works for a small set of primitive ish types

Sure. Thanks for pointing and confirming this.

Do you know how would it be? Any reference page that you may know? Would be helpful.
Thanks.

C Interface · The Julia Language probably

If you want to keep things simple I recommend

  • Decompose your data into sufficiently primitive types on the Julia side.
  • If you want to call Julia functions from C++, go through @cfunction pointers.

In general I try to keep the interaction with the julia.h functions as small as possible. The code in GitHub - GunnarFarneback/DynamicallyLoadedEmbedding.jl: Embed Julia with dynamical loading of libjulia at runtime. is fairly minimalistic in that respect.

Only C strings are null terminated. C++ std::string explicitly tracks the string length.

It looks like at least in newish C++, one can very easily go from/to C string and seems like the underlying data is also null terminated?