Initializing a dataframe

Hi,

I would like to initialize a dataframe with only the column names and no data and then add rows of mixed type. For example:

using DataFrames.jl
 df = DF.DataFrame([:a, :b])
 push!(df, [[1,2,3], .3])
push(df, [[3,4,5], .5])

Needless to say, this does not work. Is there a way to initialize the dataframe with column names in such a way that each column can be an Any type to be determined the first time a row is appended? I am not concerned with efficiency at this point.

Thanks,

Gordon

Does this do the trick? You can initialize empty containers of the type Any or get more specific with the type of containers youโ€™d like (Int, String etc.).

#Any
julia> df = DataFrame(a = Any[], b = Any[])
0ร—2 DataFrame


julia>  push!(df, [3  "cat"])
1ร—2 DataFrame
โ”‚ Row โ”‚ a   โ”‚ b   โ”‚
โ”‚     โ”‚ Any โ”‚ Any โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 3   โ”‚ cat โ”‚

#More specific types
julia> df = DataFrame(a = Int[], b = String[])
0ร—2 DataFrame


julia>  push!(df, [3  "cat"])
1ร—2 DataFrame
โ”‚ Row โ”‚ a     โ”‚ b      โ”‚
โ”‚     โ”‚ Int64 โ”‚ String โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 3     โ”‚ cat    โ”‚

Check ?DataFrame for various constructors. You may be looking for

DataFrame([Vector{Int},Float64], [:a, :b])
1 Like

That is what I was looking for. Now, what about if I have 15 column vectors. In other words, my initialization would be something like:

DataFrame([Vector{Int}, Vector{Int}, โ€ฆ, Vector{Int}, Float64], [:a, :b, :c, โ€ฆ:m])

Is there a shortcut notation so that I do not have to type Vector{Int} 15 times? More generally, supposed I do not know ahead

of time how many Vectors are needed, so I would like a constructor such as


function InitializeDF(nb_vec, [:sym1, :sym2, :sym3, ..., :sym_nb_vec])

df = DataFrame( What do I put in here?)

end

Does this require macros or is there a way to do this without it? Thanks!

You can use [fill(Vector{Int}, 15); Float64] if you donโ€™t want to type Vector{Int} 15 times

2 Likes

The easiest thing to do here would be to, instead of pushing row arrays, push a NamedTuple to your DataFrame. Something like

df = DataFrame()
push!(df, (a = 1, b = 2)

works.

This might change your existing workflow a bit, but itโ€™s useful.

1 Like

Actually, I think what I would do is

  1. Use Tuples as your object type that you are pushing
  2. Make an Array of Tuples by using map. Something like
vec_of_tuples = map(1:100) do _
    (rand(), rand(), rand())
end
  1. Call `DataFrame(vec_of_tuples)
  2. Call rename!(df, vec_of_your_new_names)

In general, think of a row of a DataFrame as a NamedTuple or Tuple instead of a row vector.

Novel, for me. I do not fully understand. So you have 100 columns and each column is a Vector of 3 elements.
Yes, this is interesting. I did not know about the rename! command. I will look into this as well. Thanks.

other way around. 100 rows and each row has three elements.

The distinction between Tuple and Vector can be hard to understand, but right now Tables.jl, which DataFrames uses in itโ€™s constructor, treats a Vector of Tuples as the most โ€œbasicโ€ form of table in some way. So when in doubt about the DataFrames constructor, just make either a Vector of Tuples or NamedTuples, or a Tuple or NamedTuple of Vectors.

One thing that might be new coming to Julia is that often the behavior you want comes from changing the input you have into a function rather than the way you call the function itself.

1 Like

Let us continue:
Consider this section of code from a larger code.

function callback(u, t, integrator)
	C  = @view(u[1:1:2])  # Calcium
    Ce = @view(u[3:1:4])  # Calcium in ER
    I  = @view(u[5:1:6])  # IP3
    h  = @view(u[7:1:8])
	# How to I get my parameter array as a local variable?
	#J5P, J3K, J_ฮฒ, J_ฮด, Jleak, JIPR, JSERCA, hinf, OmegaH, minf, Q2 = getCurrents(C, Ce, I, h, t, p)
	currents = getCurrents(C, Ce, I, h, t, pars)
	println("currents= ", currents)
	try
		println("***** inside try ****")
	    push!(df, currents)
    catch
		V = Vector{Any}
		S = Float64
		println("**** inside catch *****")
        df = DF.DataFrame([V,V,V,V,V,V,V,V,V,V,V],
		             [:J_5P, :J_3K, :J_ฮฒ, :J_ฮด, :Jleak, :JIPR, :JSERCA, :hinf, :OmegaH, :minf, :Q2])
	    push!(df, currents)
	end
	return df
end

โ€œgetCurrentโ€ returns a Vector of Array{Any}. I initialize a database and then fill it. Since I am not concerned with efficiencly, I use try/catch to initialize the dataframe. Unfortunately, I get the error:

(it would be nice if I could select the red part, but I cannot, so I use a png image.)
I have no idea what this is about. What is an object of type โ€œModuleโ€?
Thanks.

# How to I get my parameter array as a local variable?
   #J5P, J3K, J_ฮฒ, J_ฮด, Jleak, JIPR, JSERCA, hinf, OmegaH, minf, Q2 = getCurrents(C, Ce, I, h, t, p)

Make getCurrents return a Tuple or NamedTuple

You havenโ€™t defined df in your code, so I donโ€™t know if exists. Your error message indicates that you have df aliased as DataFrames somewhere. A Module is something like DataFrames, the package.

Like I said before, your solution is to have getCurrents return a NamedTuple. Then do push!(df, getCurrents(...)). You can do this on an empty DataFrame defined as df = DataFrame().

I also advised above for you to make an array of NamedTuples and call DataFrame on that.

Please read the documentation if you havenโ€™t already and re-read my answers above.

Will do. And then Iโ€™ll get back to you with my findings. I am just a beginner in this language.

I am now getting back to you :slight_smile.
I was able to constructed the named_tuple, but I canot initialize the dataframe in the way you suggest.
Why do I need DF? Because I imported by mistake both Pandas and DataFrames and Julia does not provide the means to remove Pandas. As a results, my choices are to use DF, or to reimport all my packages (using using), which I do not wish to do. Do not feel like waiting. If there is an alternative approach, I am interested in finding out more. But I searched.

Here is what I have:

DF = DataFrames
V = Vector{Any}
cols = (:minf, :Jleak, :JIPR, :JSERCA, :J_ฮฒ, :J_ฮด, :J5P, :J3K, :Q2, :OmegaH, :hinf)
tuple = Tuple{V,V,V,V,V,V,V,V,V,V,V}
named_tuple = NamedTuple{cols, Tuple{V,V,V,V,V,V,V,V,V,V,V}}
push!(DF.DataFrame(), named_tuple)

Here is the error:

julia> 
โ”Œ Warning: In the future `push!` will not allow passing collections of type DataType to be pushed into a DataFrame. Only `Tuple`, `Abstrac
tArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` will be allowed.
โ”‚   caller = tst() at dataframes_experiments.jl:213
โ”” @ Main ~/Documents/src/2019/AstrocytesWithBrian2/Code/julia/dataframes_experiments.jl:213

julia> 

along with this image:

I donโ€™t understand what you are saying here.

It would be easier to help if you posted an MWE.

Thanks. Here is a specific minimal example with various approaches I have tried.
My objective is to simply initialize a database directly using types, and then
append to the database. The fly in the ointment is that my type is Array{Any,1}, equivalent I believe to Vector{Any}

I have commented out various approaches. The demonstrations here in consider creating a dataframe with two integer columns, in which case my type is Tuple(Int64, 64). The actual problem I am interested is when the type is Tuple(V, V) where V=Vector{Any}.

using DataFrames
#---------------------------------------------------------
function tst1()
	# I added Pandas to the packages, and removed it, and yet, the functsions are still available.
	# Since both Pandas and DataFrames have a DataFrame method, I am forced to prepend DataFrames to
	# some or most of the DataFrames methods.
	Jleak = [0.1, 0.3]
	JIPR  = [0.2, 0.4]
	println(typeof(Jleak)) # Array{Float64,1}

	V = Vector{Any}
	types = ([V for i โˆˆ 1:2])

	nb_el = 2
	tuple1 = Tuple([[0.,0.] for i โˆˆ 1:nb_el])
	tuple2 = Tuple([V for i โˆˆ 1:nb_el])
	named_tuple1 = NamedTuple{(:a,:b),Tuple{Int64,Int64}}
	named_tuple2 = NamedTuple{(:a,:b),Tuple{V,V}}
	DF = DataFrames

        # APPROACHES I HAVE TRIED
	# The next five lines all give errors.
	#append!(DF.DataFrame(), named_tuple1)
	#push!(DF.DataFrame(), named_tuple1)  # does not work
	#append!(DF.DataFrame(), named_tuple2)
	#df = DF.DataFrame(named_tuple1)  # does not work
	#df = DF.DataFrame(named_tuple2)  # does not work

	# However, I can initiailize the DataFrame with values
	named_tuple3 = NamedTuple{(:a,:b),Tuple{1., 2.}}
	named_tuple4 = NamedTuple{(:a, :b)}([1., 2.])
	append!(DF.DataFrame(), named_tuple3)  # does not work
	append!(DF.DataFrame(), named_tuple4)
	#named_tuple4 = NamedTuple{(:a,:b),Tuple{V,V}}
end
tst1()

Here is what I did get to work:

function tst2()
	V = Vector{Any}
	cols = (:minf, :Jleak, :JIPR, :JSERCA, :J_ฮฒ, :J_ฮด, :J5P, :J3K, :Q2, :OmegaH, :hinf)
	vals = ([V for i โˆˆ 1:7])
	tuple1 = Tuple([V for i โˆˆ 1:11])
	tuple2 = Tuple([[0.,0.] for i โˆˆ 1:11])
	println(tuple)
	df = DF.DataFrame(tuple2)
	append!(df, tuple2)
	println(df)
	println(df)
	#named_tuple = NamedTuple{cols, Tuple{V,V,V,V,V,V,V,V,V,V,V}}
	#push!(DF.DataFrame(), named_tuple)
end
tst()

Notice that I initialized the database with a Tuple and not a NamedTuple. I was told I could initialied a DataFrame directly with a NamedTuple. The DataFrame produced is not what I intended. The output of the second function tst2() gives:

([0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0])
4ร—11 DataFrames.DataFrame
โ”‚ Row โ”‚ x1      โ”‚ x2      โ”‚ x3      โ”‚ x4      โ”‚ x5      โ”‚ x6      โ”‚ x7      โ”‚ x8      โ”‚ x9      โ”‚ x10     โ”‚ x11     โ”‚
โ”‚     โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚
โ”‚ 2   โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚
โ”‚ 3   โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚
โ”‚ 4   โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚ 0.0     โ”‚

This is incorrect. I was expecting only two rows, where each column was a vector of two elements, as defined by tuple2.

I hope this is clearer.

You can initialize an empty dataframe and then push a named tuple to it.

julia> function getCurrents()
           r  = rand()
           s = [5, 6]
           return (a = r, b = s)
       end
getCurrents (generic function with 1 method)

julia> df = DataFrame()
0ร—0 DataFrame


julia> for i in 1:100
           output_named_tuple = getCurrents()
           push!(df, output_named_tuple)
       end

julia> df
100ร—2 DataFrame
โ”‚ Row โ”‚ a         โ”‚ b      โ”‚
โ”‚     โ”‚ Float64   โ”‚ Arrayโ€ฆ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 0.34909   โ”‚ [5, 6] โ”‚
โ”‚ 2   โ”‚ 0.677999  โ”‚ [5, 6] โ”‚
โ”‚ 3   โ”‚ 0.672286  โ”‚ [5, 6] โ”‚
โ”‚ 4   โ”‚ 0.102374  โ”‚ [5, 6] โ”‚
โ”‚ 5   โ”‚ 0.24611   โ”‚ [5, 6] โ”‚
โ”‚ 6   โ”‚ 0.440008  โ”‚ [5, 6] โ”‚
โ”‚ 7   โ”‚ 0.814904  โ”‚ [5, 6] โ”‚
โ”‚ 8   โ”‚ 0.303771  โ”‚ [5, 6] โ”‚
โ‹ฎ
โ”‚ 92  โ”‚ 0.30015   โ”‚ [5, 6] โ”‚
โ”‚ 93  โ”‚ 0.200876  โ”‚ [5, 6] โ”‚
โ”‚ 94  โ”‚ 0.0793795 โ”‚ [5, 6] โ”‚
โ”‚ 95  โ”‚ 0.831372  โ”‚ [5, 6] โ”‚
โ”‚ 96  โ”‚ 0.440376  โ”‚ [5, 6] โ”‚
โ”‚ 97  โ”‚ 0.454384  โ”‚ [5, 6] โ”‚
โ”‚ 98  โ”‚ 0.973474  โ”‚ [5, 6] โ”‚
โ”‚ 99  โ”‚ 0.46432   โ”‚ [5, 6] โ”‚
โ”‚ 100 โ”‚ 0.953668  โ”‚ [5, 6] โ”‚

This will specify the types for you and all the column names will work.

you shouldnโ€™t push to an empty dataframe each time, which I think is what you are doing. I am saying that if you push a named tuple to an empty dataframe, the types and the columns names will be set up automatically for you.

1 Like

This is great, thanks! This will solve my problem, albeit, not in what would consider an elegant way. But without metaprogramming, it might be te only way.

So in my notation, I would have something like:

using DataFrames

function getCurrents()
           random  = rand()
           J_ฮด = [8, 6]      
           return (J_ฮด = J_ฮด, c = random)    # <<<<<<<<<<<<<<<<<
end

DF = DataFrames
df = DF.DataFrame()

for i in 1:5
     output_named_tuple = getCurrents()
     push!(df, output_named_tuple)
end

println(df)

In an idea world, I would prefer to first initialize an empty DataFrame with column names, and then append without having to write J_\delta = J_\delta (which seems redundant). Seems like a macros is necessary to accomplish what I am after, although I read that most things a macro can do can be done without macros.

If I add a variable definition in getCurrents(), I would have to add its name in three location. For example:

function getCurrents()
    new_var = [3,4,5]
    ...
    return(....., new_var=new_var)

The syntax seems over the top. My preference in terms of DRY programming, would be to
add only the line

   new_var = [3,4,5]

and use metaprogramming or other techniques to achieve the same result as above. This is to say:
it would be nice to be able to write the getCurrents() function without the return statement and yet achieve the same effect. Or have a way to bypass NamedTuples.

My real situation is that I am invoking a callback mechanism available in the DifferentialEquations.jl module. If I was programming in C++ or Python, I would be working from within a class and would initialize the DataFrame in the constructor. How does one approach something similar in Julia?

I have seen constructs such as:

let  df = DF.DataFrame()
  global function myCallback(.....)
     ....
  (J_\delta, J_\beta = getCurrents(...)
  push!(df, J_\beta=J_\beta,  J_\delta = J_\delta

I donโ€™t know how to handle this, hopefully someone else can chime in. But it is likely that you would be best served by re-writing your code in a less class-based way. You might want to also consider not using a DataFrame in this context but rather a vector of Structs. Itโ€™s not clear what you want to do and this seems like a case of the X-Y problem.

The functionality you describe to not type as much is available in NamedTupleTools.jl.

I donโ€™t know why you are so interested in declaring all the types in your data frame at the outset, rather than pushing one at a time. If you are interested in saving typing, then a named-tuple method seems like an easier solution. Remember my comment above that you can make a vector of named tuples rather than a DataFrame and then call DataFrame on that instead.

But if you are committed to declaring your data frame before hand, then yes this should work

julia> df = DataFrame([Vector{Int}[], Int[]], [:a, :b])
0ร—2 DataFrame


julia> push!(df, [[1, 2], 5])
1ร—2 DataFrame
โ”‚ Row โ”‚ a      โ”‚ b     โ”‚
โ”‚     โ”‚ Arrayโ€ฆ โ”‚ Int64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ [1, 2] โ”‚ 5     โ”‚

Thank you; you gave me much to think about. My history is as follows: I started with Forth on the Mac, then Pascal, then Fortran, followed by C, C++, Java, Python, some Ruby, and now Julia. So I have to retrain myself to think in the host language each time, and the is difficult. I like the DRY principle (non-repeating). Brian2 (Brian 2 documentation โ€” Brian 2 2.5.1 documentation) is a Python-based framework to easily define complex networks of biological systems. It translates to C (or C++). It has an ODE system solver (not as sophisticated as Juliaโ€™s DifferentialEquations), but it has a Monitor structure that makes it very easy to add variables to monitor during the simulation, very useful also for debugging.

So for example:

Eqs = โ€œโ€"
J1 = tanh(u)
J2 = sin(v)
du/dt = -1 + J1 + J2
dv/dt = -v + log(u)
โ€œโ€"

mon = Monitor([โ€œJ1โ€, โ€œJ2โ€])
run_simulation()

I can then have access to my variables at the end of the simulation using

mon.J1 and mon.J2

So adding a new variable to track is only a matter of adding a variable to the list in the Monitor object.
I find this approach very appealing, and would like to work towards it in Julia to the extent possible.

I have found that many things are elegant in Julia and can be efficiently optimized, but there are some constructs and capabilities that are clunky or missing, even though they are likely implementable. Of course, I am talking as a non-Julia expert, but first impressions count. Some things I am extremely impressed by, some other things very much less so.

No disrespect intended. Jus a few thoughts. I really appreciate all the help I am getting on this Forum.

Gordon

This is still quite easy to do in Julia

struct MyOutput
   J1
   J2
end

MyOutput(v::Vector)
    MyOutput(v...)
end

vec = ComplicatedDiffEqFunction()
mon = MyOutput(vec)
println(mon.J1)

So you can get results in vector form from a Simulation and then put them in your output struct.

1 Like

Or even, with your DRY example

function vec_to_NT(vec)
    (J1 = vec[1], J2 = vec[2])
end

then you can call this function on your vector of results and push to a data frame like before. no need to even define as struct.