Manage spaces in Dataframe/Table column names

Hi,

I discovered Julia recently when looking to improve performances of some processes I was running in Python, and most notably while working with dataframes-like structures. I am currently trying to load data from CSV files into Dataframes, however these CSV files contain columns with spaces in them.

Let’s suppose now that I want to get all the data in a particular column, let’s say “ID Object”, in Python, I would simply do:

df = pd.read_csv("myobjects.csv")
object_ids = df["ID Object"]

In Julia, from what I understood, the brackets is not the correct syntax and you should instead use a dot, for instance if I want to select a column “a” I’d do:

df = CSV.read("myobjects.csv", DataFrame)
a = df.a

However how would I go about getting this column if its name was “ID Object” now? I tried:

object_ids = df["ID Object"]
object_ids = df."ID Object"
object_ids = df.Symbol("ID Object")

But none of those worked. I’ve tried looking up online and all I’ve seen is people saying that it would be nice to add this to the language, so is it impossible? If this is indeed not possible yet, how would I go about renaming a column that has a space if I cannot select it? I only found renaming examples for columns that did not suffer from this issue.

If anyone has a suggestion on the matter it would be really appreciated! Thanks :slight_smile:

df["ID Object"] deliberately throws an error because it’s ambiguous if it should return a row or a column. Different users have different expectations for what df[1] should return, so we require the expliciteness of df[:, "ID Object"].

What version of Julia and DataFrames are you using? Using getproperty works for me on 1.6.0 and DataFrames 1.1

julia> df = DataFrame("ID Object" => [1, 2, 3])
3×1 DataFrame
 Row │ ID Object 
     │ Int64     
─────┼───────────
   1 │         1
   2 │         2
   3 │         3

julia> df."ID Object"
3-element Vector{Int64}:
 1
 2
 3

It’s a fresh install so Julia 1.6.1 and DataFrames 1.1.1.

I just reran and now it works fine for some reason, before it was telling me that it could not find the field “ID”.
However it still does not work for TypedTables, how should I go about it on these structures? I still get:

ERROR: LoadError: MethodError: no method matching getproperty(::Table{NamedTuple{(Symbol("ID Object"), ...}}, ::String)

Do TypedTables have other ways of getting the elements? If I want to write more idiomatic Julia code, should I prefer the first solution with brackets or the getproperty?

DataFrames.jl supports strings as column names, but TypedTables.jl doesn’t, I think. You can try getproperty(t, Symbol("Object ID") in that case.

Not to plug DataFrames too much, but it is be possible to do a lot of type-stable operations with DataFrames using transform and similar functions. It’s not always necessary to use both TypedTables and DataFrames in the same session.

Ah indeed getproperty(t, Symbol("ID Object") works! Thanks a lot you saved me ahah

I’m still not too sure about the differences, pros/cons of DataFrames vs TypedTables ; from what I understood TypedTables were more performant if the structure of the tabular data was not going to change, but I may be completely mistaken about this. Since I’m trying to see whether Julia will be a viable alternative to my current Python usage for performance purposes, I want to make sure that I learn how to properly use the language and its packages to write efficient code.

The computation-intensive task I am currently tackling mainly involves filtering/selecting operations, do you think TypedTables would present any advantage in these cases? (though I think I’m going a bit off topic for this thread)

It’s up to you to benchmark the solutions and figure out which is best.

I just wish to point out that even though DataFrames is not type-stable in the sense that Julia can’t infer the vector type of df.x, lots of work has been done to provide convenient syntax for avoiding those kinds of problems.

You should try writing the same code in both, benchmarking it (inside a function and with @btime from BenchmarkTools.jl used appropriately), and comparing the results. Then post here again to see if there are any improvements to be made with the use of either library.

1 Like

Alright, I’ll try this out then. Thanks again for your help :slight_smile:

And maybe do post here if you find a case where DataFrames performs particularly badly - people will be more than happy to help speed things up, or you might happen on an actual performance issue which can be fixed in DataFrames. My (entirely gut-based) guess is the userbase of DataFrames is orders of magnitude larger than that of TypedTables, so it’ll likely be much easier to get help with DataFrames issues than with TypedTables ones.