Issues querying a DataFrame

bensetterholm · February 20, 2020, 11:05pm

Suppose I have a DataFrame df which was generated by a Dictionary which may include whitespace in the keys

using DataFrames

myDict = Dict()
myDict["aKey"] = 1:10
myDict["anotherKey"] = zeros(10)
myDict["a_nice_key"] = rand(Bool, 10)
myDict["a nasty key"] = fill("oof", 10)
df = DataFrame(myDict)

If I wanted to select the colomns a_nice_key and anotherKey where aKey is less than say 4, this is easy to do with Query.jl

using Query

x = @from i in df begin
    @where i.aKey < 4
    @select {i.a_nice_key, i.anotherKey}
    @collect DataFrame
end

What I cannot figure out is how to select elements from the a nasty key column since its symbol is not simple. For example, this does not work

x = @from i in df begin
    @where i.aKey < 4
    @select i[!, Symbol("a nasty key")]
    @collect DataFrame
end

Is there some way I can work around this? I have tried various permutations but cannot find any way to access these columns in my DataFrame.

I get into the same trouble with the DataFramesMeta package (resetting the REPL to clear the name conflicts with Query.jl)

using DataFramesMeta, Lazy

x = @> begin
    df
    @where(:aKey .< 4)
    @select(:a_nice_key, Symbol("a nasty key"))
end

(Interestingly enough, select works with one string turned into a Symbol, but no more than one).

Finally, in neither the Query not DataFramesMeta packages can I figure out how to splat a predefined list of column names to select. For example, with DataFramesMeta:

colsOfInterest = [:anotherKey, :a_nice_key]

x = @> begin
    df
    @where(:aKey .< 4)
    @select(colsOfInterest...)
end

fails. Is there a way to resolve both these two issues in either one of these (or another querying) package?

pdeffebach · February 21, 2020, 12:34am

Thanks for this, OP.

DataFramesMeta definitely needs some work on this. You should file an issue there.

Sorry for the frustration on this, I understand the benefit of a tidyverse-style string of commands. Hopefully progress will be made on this soon.

bensetterholm · February 21, 2020, 2:37pm

Thanks!

With regards to my first problem (symbols with whitespace, or hyphens, or other nasty characters that cannot be represented with a preceding colon), I see that there is already a related unresolved issue in DataFramesMeta, so I didn’t raise a new issue there. (I will note that the suggested “hack” using a cols function seems to be no longer available.)

With regards to my second problem, I found a loosely related closed issue which inspired me to try the following and fortuitously works for solving both of my problems! I hadn’t considered passing the vector of symbols without splatting.

julia> using DataFrames, DataFramesMeta, Lazy

julia> df = DataFrame(Dict("aKey" => 1:5,
                           "anotherKey" => zeros(5),
                           "yetAnotherKey" => ones(5),
                           "a nasty key" => fill(2, 5),
                           "another nasty key" => 2:6,))
5×5 DataFrame
│ Row │ a nasty key │ aKey  │ another nasty key │ anotherKey │ yetAnotherKey │
│     │ Int64       │ Int64 │ Int64             │ Float64    │ Float64       │
├─────┼─────────────┼───────┼───────────────────┼────────────┼───────────────┤
│ 1   │ 2           │ 1     │ 2                 │ 0.0        │ 1.0           │
│ 2   │ 2           │ 2     │ 3                 │ 0.0        │ 1.0           │
│ 3   │ 2           │ 3     │ 4                 │ 0.0        │ 1.0           │
│ 4   │ 2           │ 4     │ 5                 │ 0.0        │ 1.0           │
│ 5   │ 2           │ 5     │ 6                 │ 0.0        │ 1.0           │

julia> colsOfInterest = Symbol.(["a nasty key", "another nasty key", "anotherKey"])
3-element Array{Symbol,1}:
 Symbol("a nasty key")      
 Symbol("another nasty key")
 :anotherKey                

julia> x = @> begin
           df
           @where(:aKey .< 4)
           @select(colsOfInterest)
       end
3×3 DataFrame
│ Row │ a nasty key │ another nasty key │ anotherKey │
│     │ Int64       │ Int64             │ Float64    │
├─────┼─────────────┼───────────────────┼────────────┤
│ 1   │ 2           │ 2                 │ 0.0        │
│ 2   │ 2           │ 3                 │ 0.0        │
│ 3   │ 2           │ 4                 │ 0.0        │

Since I have a solution that for now seems to work, albeit with different syntax than I naïvely expected, do you still suggest I raise an issue in DataFramesMeta?

pdeffebach · February 21, 2020, 6:19pm

You should still file an issue for sure.

tbeason · February 21, 2020, 7:47pm

OP do the right thing and abandon symbols with whitespace!

bensetterholm · February 21, 2020, 8:21pm

While that sounds like a sensible solution, it isn’t worthwhile in my application to create all the boilerplate necessary to implement non-whitespace-symbols in practice.

DataFrames are a convenient internal data structure for me to reason about my input data (which I have no control over and may include whitespace/hyphens) and to manipulate it without having to build my own bespoke objects and query functions. (Any performance penalties suffered for this abstraction are negligible for this use case.)

Since I was able to get it to work in the end with whitespaces, etc., I am content. I will raise an issue though at the request of @pdeffebach.

Topic		Replies	Views
What is the DataFramesMeta way to specify a column by its name in a variable? General Usage question , dataframes	6	1039	March 17, 2020
How to call a column in DataFrame that has a space in the name New to Julia question , dataframes	10	1507	January 10, 2023
Selecting DataFrame columns with Symbol Syntax for columns with non-alphabetical characters General Usage dataframes	7	3217	October 27, 2021
DataFrame column names with symbols General Usage dataframes	7	1164	July 13, 2021
Expressiveness for queries Performance dataframes	3	260	June 6, 2022

Issues querying a DataFrame

Related topics