DataFrames & IJulia?


#1

I have used pandas in Python to typeset a table in Jupyter notebook (standard Anaconda installation, nothing fancy). The data consisted in a simple list of dictionaries: using pandas, this resulted in a table with dictionary keys as headline, and dictionary values in the rows below.

Question: what is the command for doing the same with data frames in IJulia?

Example of data:
sr_org.getQuantities()[0:2]

leading to:

[{'Changeable': 'false', 'Description': 'Initializing temperature in reactor, K', 'Name': 'T', 'Value': None, 'Variability': 'continuous', 'alias': 'noAlias', 'aliasvariable': None}, {'Changeable': 'false', 'Description': 'Initializing concentration of A in reactor, mol/L', 'Name': 'cA', 'Value': None, 'Variability': 'continuous', 'alias': 'noAlias', 'aliasvariable': None}]

This looks as follows in Python/Jupyter (import pandas as pd):


#2

I use

using DataFrames

function df_from_dicts(arr::AbstractArray; missing_value=missing)
    cols = Set{Symbol}()
    for di in arr union!(cols, keys(di)) end
    df = DataFrame()
    for col=cols
        df[col] = [get(di, col, missing_value) for di=arr]
    end
    return df
end

df_from_dicts(your_vector_of_dicts)

#3

Thanks, cstjean – I finally figured out your code…

First, I tried with the following example:

dict1 = Dict("name"=>"William", "surname"=>"Shakespeare", "bornyear"=>1564);
dict2 = Dict("name"=>"Miguel", "surname"=>"de Cervantes", "bornyear"=>1547);
dict3 = Dict("name"=>"William", "surname"=>"Tyndale", "bornyear"=>1484); 
dlist = [dict1,dict2,dict3]; 
df_from_dicts(dlist)

Didn’t work, and gave an error message. Then I tried:

dict1 = Dict(:name=>"William", :surname=>"Shakespeare", :bornyear=>1564);
dict2 = Dict(:name=>"Miguel", :surname=>"de Cervantes", :bornyear=>1547);
dict3 = Dict(:name=>"William", :surname=>"Tyndale", :bornyear=>1484); 
dlist = [dict1,dict2,dict3]; 
df_from_dicts(dlist)

and now it works.

Questions:

  1. Is it becoming a standard that Dict keys should be symbols? Or is that just a choice?
  2. Are there any advantages in using symbols for keys?
  3. If no to #1 and #2, how should one generalize the code to handle any case?

-B


#4

Sorry that my example was a bit terse; I was short on time.

DataFrames in Julia use symbols for column names (Pandas uses strings), so it’s logical to use symbols in the input to df_from_dicts. Technically, Dicts can map from anything to anything.

You mean, in general? AFAIK, symbols in Julia are “interned”, which means that comparing/hashing them is really fast (O(1): like comparing two numbers), whereas string comparison is, I believe, O(N).

You could use df[Symbol(col)] = [get(di, col, missing_value) for di=arr] in the loop above (and use a Set{String}(). As much as possible, I would favour manipulating symbols over strings.


#5

Thanks a lot.


#6

OK… a couple of questions about Strings and Symbols… I can convert between strings and symbols as follows:

julia> sy1 = Symbol("derT")
:derT
julia> String(sy1)
"derT"

I can create the same symbol with short-hand colon notation:

julia> sy2 = :derT
:derT
julia> sy3 = :(derT)
:derT

However, the following appear to be different:

julia> sy4 = Symbol("der(T)")
Symbol("der(T)")
julia> sy5 = :(der(T))
:(der(T))

Attempting to convert back to string leads to:

julia> String(sy4)
"der(T)"
julia> String(sy5)
MethodError: Cannot `convert` an object of type Expr to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.

Stacktrace:
 [1] String(::Expr) at .\sysimg.jl:77
 [2] include_string(::String, ::String) at .\loading.jl:522

Anyone knows the system/rationale here?


#7

The quoting syntax :( ... ) returns objects of various types.

Both :a and :(a) return the symbol :a.

:(1) returns an Int with value 1.

:(der(T)) returns an expression, an object of type Expr.

julia> dump(:(der(T)))
Expr
  head: Symbol call
  args: Array{Any}((2,))
    1: Symbol der
    2: Symbol T
  typ: Any

These are exactly the objects that Julia returns when parsing code into a syntax tree.

This

julia> sy4 = Symbol("der(T)")
Symbol("der(T)")

creates a Symbol. Julia displays all objects as strings in such a way that they can be reconstructed by parsing and evaluating. But, :(der(T)) gives an expression rather than a symbol. So Julia displays this Symbol as Symbol("der(T)").

Finally, you can convert an expression to a string:

julia> string(:(der(T)))
"der(T)"

Note this example uses string rather than String.


#8

Thanks… the distinction between Symbols and :(...) remains somewhat obscure to me… The following two statements lead to different objects:

julia> Symbol("der(T)")
Symbol("der(T)")
julia> :(der(T))
:(der(T))

Both of them are valid Dict keys, but none of them are typeset as table headings in IJulia using the df_from_dicts() function. However, simpler :(...) objects are typeset as a table heading…


#9

Yes, it takes a little bit of study. Maybe the best place to start is the “Metaprogramming” section of the Julia manual. In particular, it says

The : character has two syntactic purposes in Julia.The first form creates a Symbol, an interned string used as one building-block of expressions.

This does not seem quite correct to me. The two purposes are really almost the same. For example :a and :(a) both return the Symbol a. And :1 and :(1) both return the integer 1. It’s just that, if an expression is complicated enough, it must be enclosed in parens in order to be parsed correctly.

Any object can be a Dict key in Julia. But only Symbols are allowed as the names of columns in DataFrames.

A couple of years ago, IIUC, column names were required to be valid identifiers. Apparently its possible to use any symbol now.

If you want to be sure you are constructing a Symbol and not an expression, use Symbol("..."). For example Symbol("a") and Symbol("1") return Symbols, in contrast to the the examples using : above.


#10

Thanks for tips. OK – so column names in DataFrames are symbols, then, thus Symbol("der(T)") should work, while :(der(T)) may not work.