Initializing a dataframe

You can use [fill(Vector{Int}, 15); Float64] if you don’t want to type Vector{Int} 15 times

2 Likes

The easiest thing to do here would be to, instead of pushing row arrays, push a NamedTuple to your DataFrame. Something like

df = DataFrame()
push!(df, (a = 1, b = 2)

works.

This might change your existing workflow a bit, but it’s useful.

1 Like

Actually, I think what I would do is

  1. Use Tuples as your object type that you are pushing
  2. Make an Array of Tuples by using map. Something like
vec_of_tuples = map(1:100) do _
    (rand(), rand(), rand())
end
  1. Call `DataFrame(vec_of_tuples)
  2. Call rename!(df, vec_of_your_new_names)

In general, think of a row of a DataFrame as a NamedTuple or Tuple instead of a row vector.

Novel, for me. I do not fully understand. So you have 100 columns and each column is a Vector of 3 elements.
Yes, this is interesting. I did not know about the rename! command. I will look into this as well. Thanks.

other way around. 100 rows and each row has three elements.

The distinction between Tuple and Vector can be hard to understand, but right now Tables.jl, which DataFrames uses in it’s constructor, treats a Vector of Tuples as the most “basic” form of table in some way. So when in doubt about the DataFrames constructor, just make either a Vector of Tuples or NamedTuples, or a Tuple or NamedTuple of Vectors.

One thing that might be new coming to Julia is that often the behavior you want comes from changing the input you have into a function rather than the way you call the function itself.

1 Like

Let us continue:
Consider this section of code from a larger code.

function callback(u, t, integrator)
	C  = @view(u[1:1:2])  # Calcium
    Ce = @view(u[3:1:4])  # Calcium in ER
    I  = @view(u[5:1:6])  # IP3
    h  = @view(u[7:1:8])
	# How to I get my parameter array as a local variable?
	#J5P, J3K, J_β, J_δ, Jleak, JIPR, JSERCA, hinf, OmegaH, minf, Q2 = getCurrents(C, Ce, I, h, t, p)
	currents = getCurrents(C, Ce, I, h, t, pars)
	println("currents= ", currents)
	try
		println("***** inside try ****")
	    push!(df, currents)
    catch
		V = Vector{Any}
		S = Float64
		println("**** inside catch *****")
        df = DF.DataFrame([V,V,V,V,V,V,V,V,V,V,V],
		             [:J_5P, :J_3K, :J_β, :J_δ, :Jleak, :JIPR, :JSERCA, :hinf, :OmegaH, :minf, :Q2])
	    push!(df, currents)
	end
	return df
end

“getCurrent” returns a Vector of Array{Any}. I initialize a database and then fill it. Since I am not concerned with efficiencly, I use try/catch to initialize the dataframe. Unfortunately, I get the error:

(it would be nice if I could select the red part, but I cannot, so I use a png image.)
I have no idea what this is about. What is an object of type “Module”?
Thanks.

# How to I get my parameter array as a local variable?
   #J5P, J3K, J_β, J_δ, Jleak, JIPR, JSERCA, hinf, OmegaH, minf, Q2 = getCurrents(C, Ce, I, h, t, p)

Make getCurrents return a Tuple or NamedTuple

You haven’t defined df in your code, so I don’t know if exists. Your error message indicates that you have df aliased as DataFrames somewhere. A Module is something like DataFrames, the package.

Like I said before, your solution is to have getCurrents return a NamedTuple. Then do push!(df, getCurrents(...)). You can do this on an empty DataFrame defined as df = DataFrame().

I also advised above for you to make an array of NamedTuples and call DataFrame on that.

Please read the documentation if you haven’t already and re-read my answers above.

Will do. And then I’ll get back to you with my findings. I am just a beginner in this language.

I am now getting back to you :slight_smile.
I was able to constructed the named_tuple, but I canot initialize the dataframe in the way you suggest.
Why do I need DF? Because I imported by mistake both Pandas and DataFrames and Julia does not provide the means to remove Pandas. As a results, my choices are to use DF, or to reimport all my packages (using using), which I do not wish to do. Do not feel like waiting. If there is an alternative approach, I am interested in finding out more. But I searched.

Here is what I have:

DF = DataFrames
V = Vector{Any}
cols = (:minf, :Jleak, :JIPR, :JSERCA, :J_β, :J_δ, :J5P, :J3K, :Q2, :OmegaH, :hinf)
tuple = Tuple{V,V,V,V,V,V,V,V,V,V,V}
named_tuple = NamedTuple{cols, Tuple{V,V,V,V,V,V,V,V,V,V,V}}
push!(DF.DataFrame(), named_tuple)

Here is the error:

julia> 
┌ Warning: In the future `push!` will not allow passing collections of type DataType to be pushed into a DataFrame. Only `Tuple`, `Abstrac
tArray`, `AbstractDict`, `DataFrameRow` and `NamedTuple` will be allowed.
│   caller = tst() at dataframes_experiments.jl:213
└ @ Main ~/Documents/src/2019/AstrocytesWithBrian2/Code/julia/dataframes_experiments.jl:213

julia> 

along with this image:

I don’t understand what you are saying here.

It would be easier to help if you posted an MWE.

Thanks. Here is a specific minimal example with various approaches I have tried.
My objective is to simply initialize a database directly using types, and then
append to the database. The fly in the ointment is that my type is Array{Any,1}, equivalent I believe to Vector{Any}

I have commented out various approaches. The demonstrations here in consider creating a dataframe with two integer columns, in which case my type is Tuple(Int64, 64). The actual problem I am interested is when the type is Tuple(V, V) where V=Vector{Any}.

using DataFrames
#---------------------------------------------------------
function tst1()
	# I added Pandas to the packages, and removed it, and yet, the functsions are still available.
	# Since both Pandas and DataFrames have a DataFrame method, I am forced to prepend DataFrames to
	# some or most of the DataFrames methods.
	Jleak = [0.1, 0.3]
	JIPR  = [0.2, 0.4]
	println(typeof(Jleak)) # Array{Float64,1}

	V = Vector{Any}
	types = ([V for i ∈ 1:2])

	nb_el = 2
	tuple1 = Tuple([[0.,0.] for i ∈ 1:nb_el])
	tuple2 = Tuple([V for i ∈ 1:nb_el])
	named_tuple1 = NamedTuple{(:a,:b),Tuple{Int64,Int64}}
	named_tuple2 = NamedTuple{(:a,:b),Tuple{V,V}}
	DF = DataFrames

        # APPROACHES I HAVE TRIED
	# The next five lines all give errors.
	#append!(DF.DataFrame(), named_tuple1)
	#push!(DF.DataFrame(), named_tuple1)  # does not work
	#append!(DF.DataFrame(), named_tuple2)
	#df = DF.DataFrame(named_tuple1)  # does not work
	#df = DF.DataFrame(named_tuple2)  # does not work

	# However, I can initiailize the DataFrame with values
	named_tuple3 = NamedTuple{(:a,:b),Tuple{1., 2.}}
	named_tuple4 = NamedTuple{(:a, :b)}([1., 2.])
	append!(DF.DataFrame(), named_tuple3)  # does not work
	append!(DF.DataFrame(), named_tuple4)
	#named_tuple4 = NamedTuple{(:a,:b),Tuple{V,V}}
end
tst1()

Here is what I did get to work:

function tst2()
	V = Vector{Any}
	cols = (:minf, :Jleak, :JIPR, :JSERCA, :J_β, :J_δ, :J5P, :J3K, :Q2, :OmegaH, :hinf)
	vals = ([V for i ∈ 1:7])
	tuple1 = Tuple([V for i ∈ 1:11])
	tuple2 = Tuple([[0.,0.] for i ∈ 1:11])
	println(tuple)
	df = DF.DataFrame(tuple2)
	append!(df, tuple2)
	println(df)
	println(df)
	#named_tuple = NamedTuple{cols, Tuple{V,V,V,V,V,V,V,V,V,V,V}}
	#push!(DF.DataFrame(), named_tuple)
end
tst()

Notice that I initialized the database with a Tuple and not a NamedTuple. I was told I could initialied a DataFrame directly with a NamedTuple. The DataFrame produced is not what I intended. The output of the second function tst2() gives:

([0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0])
4×11 DataFrames.DataFrame
│ Row │ x1      │ x2      │ x3      │ x4      │ x5      │ x6      │ x7      │ x8      │ x9      │ x10     │ x11     │
│     │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
│ 2   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
│ 3   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │
│ 4   │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │ 0.0     │

This is incorrect. I was expecting only two rows, where each column was a vector of two elements, as defined by tuple2.

I hope this is clearer.

You can initialize an empty dataframe and then push a named tuple to it.

julia> function getCurrents()
           r  = rand()
           s = [5, 6]
           return (a = r, b = s)
       end
getCurrents (generic function with 1 method)

julia> df = DataFrame()
0×0 DataFrame


julia> for i in 1:100
           output_named_tuple = getCurrents()
           push!(df, output_named_tuple)
       end

julia> df
100×2 DataFrame
│ Row │ a         │ b      │
│     │ Float64   │ Array… │
├─────┼───────────┼────────┤
│ 1   │ 0.34909   │ [5, 6] │
│ 2   │ 0.677999  │ [5, 6] │
│ 3   │ 0.672286  │ [5, 6] │
│ 4   │ 0.102374  │ [5, 6] │
│ 5   │ 0.24611   │ [5, 6] │
│ 6   │ 0.440008  │ [5, 6] │
│ 7   │ 0.814904  │ [5, 6] │
│ 8   │ 0.303771  │ [5, 6] │
⋮
│ 92  │ 0.30015   │ [5, 6] │
│ 93  │ 0.200876  │ [5, 6] │
│ 94  │ 0.0793795 │ [5, 6] │
│ 95  │ 0.831372  │ [5, 6] │
│ 96  │ 0.440376  │ [5, 6] │
│ 97  │ 0.454384  │ [5, 6] │
│ 98  │ 0.973474  │ [5, 6] │
│ 99  │ 0.46432   │ [5, 6] │
│ 100 │ 0.953668  │ [5, 6] │

This will specify the types for you and all the column names will work.

you shouldn’t push to an empty dataframe each time, which I think is what you are doing. I am saying that if you push a named tuple to an empty dataframe, the types and the columns names will be set up automatically for you.

1 Like

This is great, thanks! This will solve my problem, albeit, not in what would consider an elegant way. But without metaprogramming, it might be te only way.

So in my notation, I would have something like:

using DataFrames

function getCurrents()
           random  = rand()
           J_δ = [8, 6]      
           return (J_δ = J_δ, c = random)    # <<<<<<<<<<<<<<<<<
end

DF = DataFrames
df = DF.DataFrame()

for i in 1:5
     output_named_tuple = getCurrents()
     push!(df, output_named_tuple)
end

println(df)

In an idea world, I would prefer to first initialize an empty DataFrame with column names, and then append without having to write J_\delta = J_\delta (which seems redundant). Seems like a macros is necessary to accomplish what I am after, although I read that most things a macro can do can be done without macros.

If I add a variable definition in getCurrents(), I would have to add its name in three location. For example:

function getCurrents()
    new_var = [3,4,5]
    ...
    return(....., new_var=new_var)

The syntax seems over the top. My preference in terms of DRY programming, would be to
add only the line

   new_var = [3,4,5]

and use metaprogramming or other techniques to achieve the same result as above. This is to say:
it would be nice to be able to write the getCurrents() function without the return statement and yet achieve the same effect. Or have a way to bypass NamedTuples.

My real situation is that I am invoking a callback mechanism available in the DifferentialEquations.jl module. If I was programming in C++ or Python, I would be working from within a class and would initialize the DataFrame in the constructor. How does one approach something similar in Julia?

I have seen constructs such as:

let  df = DF.DataFrame()
  global function myCallback(.....)
     ....
  (J_\delta, J_\beta = getCurrents(...)
  push!(df, J_\beta=J_\beta,  J_\delta = J_\delta

I don’t know how to handle this, hopefully someone else can chime in. But it is likely that you would be best served by re-writing your code in a less class-based way. You might want to also consider not using a DataFrame in this context but rather a vector of Structs. It’s not clear what you want to do and this seems like a case of the X-Y problem.

The functionality you describe to not type as much is available in NamedTupleTools.jl.

I don’t know why you are so interested in declaring all the types in your data frame at the outset, rather than pushing one at a time. If you are interested in saving typing, then a named-tuple method seems like an easier solution. Remember my comment above that you can make a vector of named tuples rather than a DataFrame and then call DataFrame on that instead.

But if you are committed to declaring your data frame before hand, then yes this should work

julia> df = DataFrame([Vector{Int}[], Int[]], [:a, :b])
0×2 DataFrame


julia> push!(df, [[1, 2], 5])
1×2 DataFrame
│ Row │ a      │ b     │
│     │ Array… │ Int64 │
├─────┼────────┼───────┤
│ 1   │ [1, 2] │ 5     │

Thank you; you gave me much to think about. My history is as follows: I started with Forth on the Mac, then Pascal, then Fortran, followed by C, C++, Java, Python, some Ruby, and now Julia. So I have to retrain myself to think in the host language each time, and the is difficult. I like the DRY principle (non-repeating). Brian2 (Brian 2 documentation — Brian 2 2.5.1 documentation) is a Python-based framework to easily define complex networks of biological systems. It translates to C (or C++). It has an ODE system solver (not as sophisticated as Julia’s DifferentialEquations), but it has a Monitor structure that makes it very easy to add variables to monitor during the simulation, very useful also for debugging.

So for example:

Eqs = “”"
J1 = tanh(u)
J2 = sin(v)
du/dt = -1 + J1 + J2
dv/dt = -v + log(u)
“”"

mon = Monitor([“J1”, “J2”])
run_simulation()

I can then have access to my variables at the end of the simulation using

mon.J1 and mon.J2

So adding a new variable to track is only a matter of adding a variable to the list in the Monitor object.
I find this approach very appealing, and would like to work towards it in Julia to the extent possible.

I have found that many things are elegant in Julia and can be efficiently optimized, but there are some constructs and capabilities that are clunky or missing, even though they are likely implementable. Of course, I am talking as a non-Julia expert, but first impressions count. Some things I am extremely impressed by, some other things very much less so.

No disrespect intended. Jus a few thoughts. I really appreciate all the help I am getting on this Forum.

Gordon

This is still quite easy to do in Julia

struct MyOutput
   J1
   J2
end

MyOutput(v::Vector)
    MyOutput(v...)
end

vec = ComplicatedDiffEqFunction()
mon = MyOutput(vec)
println(mon.J1)

So you can get results in vector form from a Simulation and then put them in your output struct.

1 Like

Or even, with your DRY example

function vec_to_NT(vec)
    (J1 = vec[1], J2 = vec[2])
end

then you can call this function on your vector of results and push to a data frame like before. no need to even define as struct.

Here are some further experiments, getting closer to what I really want. At this point, it is no longer about my original problem, but about me being obsessed with getting a specific result.

struct A end
struct B end

function getCurrents(::A)
           random  = rand()
           println("A")
           J_δ = [8, 6]
           J_β = [3,23]
           return (J_δ = J_δ, J_β = J_β, c = random)
end

function getCurrents(::B)
           dict = Dict()
           random  = rand()
           dict[:J_δ] = [8, 6]
           dict[:J_β] = [3,25]
           return dict
end

a = A();  output_named_tuple = getCurrents(a)
for (k,v) in zip(keys(output_named_tuple), output_named_tuple)
  # eval evaluates in the global context
  eval(:($k = $v))
end

b = B(); dict = getCurrents(b)

for (k,v) in zip(keys(dict), dict)
  # eval evaluates in the global context
  eval(:($k = $v))  # Defines J_β, J_δ
end

My function getCurrents, either returns a NamedTuple or a Dictionary. When it returns a dictionary, J_\alpha and J_\beta only appear once. In the original solution, J_\beta and J_\delta appear each three times, which violates the DRY principle. Lower down, I have two loops, one over the NamedTuple, and one over the dictionary. In both cases, the variables J_\beta and J_\delta are instantiated in the global space (perhaps the wrong term in Julia) in the sense that in any function I could type something like

  z = J_\alpha + J_\beta

and I would get a result.

My real objective is to specify the variables I wish to track only ONCE, and have my callback collect the value of these variables once per time step of an ODE solver, and store the results in a DataFrame. Of course it is possible. Anything is possible: after all, that is what packages and modules are all about. Creating easy to use functionality that is not simple with the current structure of julia and packaging it for the user.

I am open to any suggestions you might have. This little excursion has taught me about NamedTuples, Structures, zip, loops of various kinds and the most basic form of metaprogramming.

One question: I wonder how efficient or non-efficient my approach is. Note that efficiency is not the point here, but feasibility.

When using DifferentialEquations.jl, if I must unpack a dictionary everything the right-hand side is invokedm there might be a penalty. Those are experiments I might drive myself to run.

Another issue: in the right-hand side routine, without the callback, J_\beta and J_\delta are variables local to the method. In the current approach, J_\alpha and J_\beta are defined in the global space, which is never a good idea. So one question to answer is whether it is possible to apply a macro to create a variable in a local context of some kind.

I have read the following three links: ’

Thank you,

Gordon

I suspect that there is a simple and idiomatic solution, but I have to admit that with this meandering topic I no longer have a clear idea what the problem is.

If you want collect the arguments a function was called with, you could define a container and make it callable. Eg

struct CollectingVector{T}
    vector::Vector{T}
end

CollectingVector{T}() where T = CollectingVector(Vector{T}())

function (cv::CollectingVector)(x)
    push!(cv.vector, x)
    nothing
end

julia> cv = CollectingVector{typeof((J_δ = 1.0, c =  1.0))}(); # template for type

julia> cv((J_δ = 1, c = 2))

julia> cv.vector
1-element Array{NamedTuple{(:J_δ, :c),Tuple{Float64,Float64}},1}:
 (J_δ = 1.0, c = 2.0)

You almost certainly don’t need metaprogramming, but instead of learning about very basic things like loops and composite types (stuct) in the course of solving the problem, you may benefit from just working through the manual first.

Julia is a powerful language, but you won’t be able to harness that power without making an initial investment in some structured form.

There is:

colnames= Symbol.('A':'Z')
df = DataFrame(fill(Int, length(colnames)), colnames)

#now you can push your data line by line onto your df
for line in eachrow(rand(1:100, 100, length(colnames)))
    push!(df, Tuple(line))
end

The trick does the Array(created by fill) which contains the type of each column.

Thank you, everybody, I have learned a lot from you, beyond my initial studies of the language. I appreciate it.

Just for reference, I have done lots of reading, and experimentations, but have formed first impressions of the language. Another possible approach I also had not considered is using the possibly right tool for the job, if efficiency is not an: calling Python code. I will certainly keep reading all the great information out there and close the issue, which has certainly meandered.

1 Like