Spaces in query.jl

@davidanthoff How will this work for a situation like this:

@from i in df begin
	       @where i.Parking_Tax == true
	       @select i
	       @collect DataFrame
       end

where the name of the column is “Parking Tax”. Previous version of DataFrames converted the name in “Parking_Tax” but not the latest version. So now I’m getting: type NamedTuple has no field Parking_Tax

I mean this does work but it’s part of a tutorial for beginners and it really makes things ugly and complicated:

@from i in df begin
	@where getproperty(i, Symbol("Parking Tax")) == true
	@select i
	@collect DataFrame
end

Update 1

This is acceptable, but if there’s a better way, I’d be grateful to learn about it:

@from i in df begin
	@where i[Symbol("Parking Tax")] == true
	@select i
	@collect DataFrame
end

Thanks

Not sure it applies here, but foo.”bar” is valid Julia syntax and calls getproperty with a string.

Thanks, yes, I tried that but it errors out because there is no getproperty defined which accepts a string as its second argument.

I ended up renaming the columns

rename!(df, [n => replace(string(n), " "=>"_") |> Symbol for n in names(df)])
1 Like

If you create macro S_str then you could use @where i[S"Parking Tax"]==true . Is that unacceptable for you too?

I was played with MWE maybe it could be useful for somebody:

julia> using DataFrames, Query, CSV

julia> macro S_str(a) :(Symbol($a)) end

julia> io = IOBuffer("""Parking Tax,col2
       true,2
       false,6""");

julia> df = CSV.File(io) |> DataFrame
2×2 DataFrame
│ Row │ Parking Tax │ col2   │
│     │ Bool⍰       │ Int64⍰ │
├─────┼─────────────┼────────┤
│ 1   │ true        │ 2      │
│ 2   │ false       │ 6      │

julia> @from i in df begin
               @where i[S"Parking Tax"]==true
               @select i
               @collect DataFrame
       end
1×2 DataFrame
│ Row │ Parking Tax │ col2   │
│     │ Bool⍰       │ Int64⍰ │
├─────┼─────────────┼────────┤
│ 1   │ true        │ 2      │

2 Likes

I think that this looks quite nice - but I’d rather stay away from it in a beginners tutorial.

Oh sorry! I missed that it is for beginners. But do you plan to put there this line?

Comprehensions have been previously introduced - but maybe you’re right and should be done with an iteration. :thinking:

Edit 1:
Yes, definitely, good point! At least I can show the iteration and add that it can be done with a comprehension. Having the two versions side by side should clarify the comprehension syntax too.

rename! can also take a function if you want to apply the same transformation to all names:

julia> rename!(df) do n
           s = replace(string(n), ' ' => '_')
           Symbol(s)
       end

which is not a one liner but is probably a bit easier to parse visually.

Overall it feels like renaming columns is the best solution as it also offers this nice extra exercise.

I don’t have a better idea than what was posted here already… At the end of the day this boils down to how one can interact with named tuples in julia.

I do think a macro a la s"foo bar" that is equivalent to Symbol("foo bar") would be nice, but that should probably be done (if at all) in base…

FWIW, you can also do CSV.File(..., normalizenames=true) on import to avoid such names.

3 Likes

What about adding a getproperty method for strings, so that foo.“bar baz” Just Works?

1 Like

I like that idea, but it would have to be implemented in base for NamedTuple, otherwise it would be a bad case of type piracy.

I just wanted to give a shout out to my fellow S_str macro implementors. Working with data in the wild with lots of spaces in the column names I ended up independently stumbling upon this as well as a way to save lots of typing while not having to worry about normalization. If a critical mass of practitioners end up doing this maybe it should get a home of its own… somewhere.

I just tried this out now with my data having spaces in the column names, and it worked Symbol("hello world")

With the latest DataFrames version, this works:

julia> using DataFrames

julia> df = DataFrame("My column name" => rand(5))
5×1 DataFrame
│ Row │ My column name │
│     │ Float64        │
├─────┼────────────────┤
│ 1   │ 0.532892       │
│ 2   │ 0.572881       │
│ 3   │ 0.0794647      │
│ 4   │ 0.189205       │
│ 5   │ 0.19475        │

julia> df."My column name"
5-element Array{Float64,1}:
 0.5328924031274789
 0.5728810432735243
 0.07946465108113787
 0.18920465409371334
 0.19475031867591586
2 Likes

The Query.jl story just depends on what is supported for standard named tuples. If base added support for x."field name" for named tuples, then one could use that syntax in Query.jl as well.

But I think base might think that x.var"field name" is good enough? That should work now already.