Issues querying a DataFrame

Suppose I have a DataFrame df which was generated by a Dictionary which may include whitespace in the keys

using DataFrames

myDict = Dict()
myDict["aKey"] = 1:10
myDict["anotherKey"] = zeros(10)
myDict["a_nice_key"] = rand(Bool, 10)
myDict["a nasty key"] = fill("oof", 10)
df = DataFrame(myDict)

If I wanted to select the colomns a_nice_key and anotherKey where aKey is less than say 4, this is easy to do with Query.jl

using Query

x = @from i in df begin
    @where i.aKey < 4
    @select {i.a_nice_key, i.anotherKey}
    @collect DataFrame
end

What I cannot figure out is how to select elements from the a nasty key column since its symbol is not simple. For example, this does not work

x = @from i in df begin
    @where i.aKey < 4
    @select i[!, Symbol("a nasty key")]
    @collect DataFrame
end

Is there some way I can work around this? I have tried various permutations but cannot find any way to access these columns in my DataFrame.

I get into the same trouble with the DataFramesMeta package (resetting the REPL to clear the name conflicts with Query.jl)

using DataFramesMeta, Lazy

x = @> begin
    df
    @where(:aKey .< 4)
    @select(:a_nice_key, Symbol("a nasty key"))
end

(Interestingly enough, select works with one string turned into a Symbol, but no more than one).

Finally, in neither the Query not DataFramesMeta packages can I figure out how to splat a predefined list of column names to select. For example, with DataFramesMeta:

colsOfInterest = [:anotherKey, :a_nice_key]

x = @> begin
    df
    @where(:aKey .< 4)
    @select(colsOfInterest...)
end

fails. Is there a way to resolve both these two issues in either one of these (or another querying) package?

1 Like

Thanks for this, OP.

DataFramesMeta definitely needs some work on this. You should file an issue there.

Sorry for the frustration on this, I understand the benefit of a tidyverse-style string of commands. Hopefully progress will be made on this soon.

1 Like

Thanks!

With regards to my first problem (symbols with whitespace, or hyphens, or other nasty characters that cannot be represented with a preceding colon), I see that there is already a related unresolved issue in DataFramesMeta, so I didn’t raise a new issue there. (I will note that the suggested β€œhack” using a cols function seems to be no longer available.)

With regards to my second problem, I found a loosely related closed issue which inspired me to try the following and fortuitously works for solving both of my problems! I hadn’t considered passing the vector of symbols without splatting.

julia> using DataFrames, DataFramesMeta, Lazy

julia> df = DataFrame(Dict("aKey" => 1:5,
                           "anotherKey" => zeros(5),
                           "yetAnotherKey" => ones(5),
                           "a nasty key" => fill(2, 5),
                           "another nasty key" => 2:6,))
5Γ—5 DataFrame
β”‚ Row β”‚ a nasty key β”‚ aKey  β”‚ another nasty key β”‚ anotherKey β”‚ yetAnotherKey β”‚
β”‚     β”‚ Int64       β”‚ Int64 β”‚ Int64             β”‚ Float64    β”‚ Float64       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2           β”‚ 1     β”‚ 2                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 2   β”‚ 2           β”‚ 2     β”‚ 3                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 3   β”‚ 2           β”‚ 3     β”‚ 4                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 4   β”‚ 2           β”‚ 4     β”‚ 5                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 5   β”‚ 2           β”‚ 5     β”‚ 6                 β”‚ 0.0        β”‚ 1.0           β”‚

julia> colsOfInterest = Symbol.(["a nasty key", "another nasty key", "anotherKey"])
3-element Array{Symbol,1}:
 Symbol("a nasty key")      
 Symbol("another nasty key")
 :anotherKey                

julia> x = @> begin
           df
           @where(:aKey .< 4)
           @select(colsOfInterest)
       end
3Γ—3 DataFrame
β”‚ Row β”‚ a nasty key β”‚ another nasty key β”‚ anotherKey β”‚
β”‚     β”‚ Int64       β”‚ Int64             β”‚ Float64    β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2           β”‚ 2                 β”‚ 0.0        β”‚
β”‚ 2   β”‚ 2           β”‚ 3                 β”‚ 0.0        β”‚
β”‚ 3   β”‚ 2           β”‚ 4                 β”‚ 0.0        β”‚

Since I have a solution that for now seems to work, albeit with different syntax than I naΓ―vely expected, do you still suggest I raise an issue in DataFramesMeta?

You should still file an issue for sure.

OP do the right thing and abandon symbols with whitespace!

While that sounds like a sensible solution, it isn’t worthwhile in my application to create all the boilerplate necessary to implement non-whitespace-symbols in practice.

DataFrames are a convenient internal data structure for me to reason about my input data (which I have no control over and may include whitespace/hyphens) and to manipulate it without having to build my own bespoke objects and query functions. (Any performance penalties suffered for this abstraction are negligible for this use case.)

Since I was able to get it to work in the end with whitespaces, etc., I am content. I will raise an issue though at the request of @pdeffebach.