Issues querying a DataFrame

Suppose I have a DataFrame df which was generated by a Dictionary which may include whitespace in the keys

using DataFrames

myDict = Dict()
myDict["aKey"] = 1:10
myDict["anotherKey"] = zeros(10)
myDict["a_nice_key"] = rand(Bool, 10)
myDict["a nasty key"] = fill("oof", 10)
df = DataFrame(myDict)

If I wanted to select the colomns a_nice_key and anotherKey where aKey is less than say 4, this is easy to do with Query.jl

using Query

x = @from i in df begin
    @where i.aKey < 4
    @select {i.a_nice_key, i.anotherKey}
    @collect DataFrame
end

What I cannot figure out is how to select elements from the a nasty key column since its symbol is not simple. For example, this does not work

x = @from i in df begin
    @where i.aKey < 4
    @select i[!, Symbol("a nasty key")]
    @collect DataFrame
end

Is there some way I can work around this? I have tried various permutations but cannot find any way to access these columns in my DataFrame.

I get into the same trouble with the DataFramesMeta package (resetting the REPL to clear the name conflicts with Query.jl)

using DataFramesMeta, Lazy

x = @> begin
    df
    @where(:aKey .< 4)
    @select(:a_nice_key, Symbol("a nasty key"))
end

(Interestingly enough, select works with one string turned into a Symbol, but no more than one).

Finally, in neither the Query not DataFramesMeta packages can I figure out how to splat a predefined list of column names to select. For example, with DataFramesMeta:

colsOfInterest = [:anotherKey, :a_nice_key]

x = @> begin
    df
    @where(:aKey .< 4)
    @select(colsOfInterest...)
end

fails. Is there a way to resolve both these two issues in either one of these (or another querying) package?

Thanks for this, OP.

DataFramesMeta definitely needs some work on this. You should file an issue there.

Sorry for the frustration on this, I understand the benefit of a tidyverse-style string of commands. Hopefully progress will be made on this soon.

Thanks!

With regards to my first problem (symbols with whitespace, or hyphens, or other nasty characters that cannot be represented with a preceding colon), I see that there is already a related unresolved issue in DataFramesMeta, so I didn’t raise a new issue there. (I will note that the suggested β€œhack” using a cols function seems to be no longer available.)

With regards to my second problem, I found a loosely related closed issue which inspired me to try the following and fortuitously works for solving both of my problems! I hadn’t considered passing the vector of symbols without splatting.

julia> using DataFrames, DataFramesMeta, Lazy

julia> df = DataFrame(Dict("aKey" => 1:5,
                           "anotherKey" => zeros(5),
                           "yetAnotherKey" => ones(5),
                           "a nasty key" => fill(2, 5),
                           "another nasty key" => 2:6,))
5Γ—5 DataFrame
β”‚ Row β”‚ a nasty key β”‚ aKey  β”‚ another nasty key β”‚ anotherKey β”‚ yetAnotherKey β”‚
β”‚     β”‚ Int64       β”‚ Int64 β”‚ Int64             β”‚ Float64    β”‚ Float64       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2           β”‚ 1     β”‚ 2                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 2   β”‚ 2           β”‚ 2     β”‚ 3                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 3   β”‚ 2           β”‚ 3     β”‚ 4                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 4   β”‚ 2           β”‚ 4     β”‚ 5                 β”‚ 0.0        β”‚ 1.0           β”‚
β”‚ 5   β”‚ 2           β”‚ 5     β”‚ 6                 β”‚ 0.0        β”‚ 1.0           β”‚

julia> colsOfInterest = Symbol.(["a nasty key", "another nasty key", "anotherKey"])
3-element Array{Symbol,1}:
 Symbol("a nasty key")      
 Symbol("another nasty key")
 :anotherKey                

julia> x = @> begin
           df
           @where(:aKey .< 4)
           @select(colsOfInterest)
       end
3Γ—3 DataFrame
β”‚ Row β”‚ a nasty key β”‚ another nasty key β”‚ anotherKey β”‚
β”‚     β”‚ Int64       β”‚ Int64             β”‚ Float64    β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 2           β”‚ 2                 β”‚ 0.0        β”‚
β”‚ 2   β”‚ 2           β”‚ 3                 β”‚ 0.0        β”‚
β”‚ 3   β”‚ 2           β”‚ 4                 β”‚ 0.0        β”‚

Since I have a solution that for now seems to work, albeit with different syntax than I naΓ―vely expected, do you still suggest I raise an issue in DataFramesMeta?

You should still file an issue for sure.

OP do the right thing and abandon symbols with whitespace!

While that sounds like a sensible solution, it isn’t worthwhile in my application to create all the boilerplate necessary to implement non-whitespace-symbols in practice.

DataFrames are a convenient internal data structure for me to reason about my input data (which I have no control over and may include whitespace/hyphens) and to manipulate it without having to build my own bespoke objects and query functions. (Any performance penalties suffered for this abstraction are negligible for this use case.)

Since I was able to get it to work in the end with whitespaces, etc., I am content. I will raise an issue though at the request of @pdeffebach.