How to call a column in DataFrame that has a space in the name

Silly question here but just wondering how do you call a column in a DataFrame that has a space in its name?
e.g

dfs =  DataFrame(Symbol("first") => [6,7,8,9], Symbol("first last") => [1,2,3,4])
4×2 DataFrame
 Row │ first  first last 
     │ Int64  Int64      
─────┼───────────────────
   1 │     6           1
   2 │     7           2
   3 │     8           3
   4 │     9           4

I can call first via

dfs.first
4-element Vector{Int64}:
 6
 7
 8
 9

but how would I call the first last column given the whitespace in between?

First idea:

julia> getproperty(dfs, :"first last")
4-element Vector{Int64}:
 1
 2
 3
 4

Second:

julia> dfs[!,"first last"]

Lastly, perhaps change the design not to have to do so?

1 Like

thanks! yes I realized a little late that it was a poor design choice.

This is normal for columns to have spaces in their names. DataFrames.jl was designed to handle them fully - you just need to use strings instead of symbols:

julia> dfs =  DataFrame("first" => [6,7,8,9], "first last" => [1,2,3,4])
4×2 DataFrame
 Row │ first  first last
     │ Int64  Int64
─────┼───────────────────
   1 │     6           1
   2 │     7           2
   3 │     8           3
   4 │     9           4

julia> dfs."first last"
4-element Vector{Int64}:
 1
 2
 3
 4
1 Like

Thanks for pointing that out! I’m linking your post about Strings vs Symbols in DataFrames here in case anyone else needs help understanding the distinction.

Yes - in general in my blog I try to keep answers to all typical questions users might have :smile:.

If there is something important that I have not covered yet, please let me know and I will write a post about it.

2 Likes

Awesome thank you so much! I really appreciate it and will let you know if I come across anything not covered.

Sorry I’m a little confused about the different rules under different versions of julia. I assume this is why you mention that “any project should be accompanied by a complete specification of environment.”

So under Julia 1.7.2, DataFrames.jl 1.3.4 “the convenience syntax for getproperty using. accessor does not work for symbols containing spaces and we need to do an explicit getproperty call.”

Is it the case that under Julia v1.8 and DataFrames v1.4.4 the . accessor now works on both Symbol and string for symbols containing spaces without the get property call?

dfsymbl = DataFrame(Symbol("first") => [6,7,8,9], Symbol("first last") => [1,2,3,4])
4×2 DataFrame
 Row │ first  first last 
     │ Int64  Int64      
─────┼───────────────────
   1 │     6           1
   2 │     7           2
   3 │     8           3
   4 │     9           4

dfsymbl."first last"
4-element Vector{Int64}:
 1
 2
 3
 4

dfstrng =  DataFrame("first" => [6,7,8,9], "first last" => [1,2,3,4])
4×2 DataFrame
 Row │ first  first last 
     │ Int64  Int64      
─────┼───────────────────
   1 │     6           1
   2 │     7           2
   3 │     8           3
   4 │     9           4

dfstrng."first last"
4-element Vector{Int64}:
 1
 2
 3
 4

I’m not sure if this is a version issue or if I was accidentally generating strings in both cases?

Also under Julia v1.7.2 and DataFrames. jl v1.3.4.

“The second important aspect is that all functions that manipulate column names in DataFrames.jl work with strings. This is natural, as symbol manipulation is not supported by Julia.”

Is it correct to assume that even with future versions of Julia, some DataFrame column name manipulations may continue to work exclusively for strings and not work for Symbols?

The example provided seems to work with Symbols but I wasn’t sure whether this should be regarded as an exception or seen as a rule going forward?

julia> df = DataFrame(:col1 => 1, Symbol("col 2") => 2)
1×2 DataFrame
 Row │ col1   col 2 
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2

select(df, Cols(startswith("c")) .=> identity .=> uppercase)
1×2 DataFrame
 Row │ COL1   COL 2 
     │ Int64  Int64 
─────┼──────────────
   1 │     1      2

Can you please refer to the source of your quotations? In general - accessing columns with strings like df."some column" worked under Julia 1.7, works under Julia 1.8 and will continue to work. Similarly it worked under DataFrames.jl 1.3, works under DataFrames.jl 1.4 and is planned to continue to work.

Therefore - I am not fully clear what you refer to exactly. Can you please clarify? (the source is likely referring to some other issue - in particular passing strings via variables, but I need to see exactly what you refer to to assess). Thank you!

2 Likes

It seems to be a quote from Strings vs symbols in DataFrames.jl column indexing | Blog by Bogumił Kamiński.

@phantom I think you misunderstood what it means here to work with symbols or strings. In your examples you check what works with the following data frames:

dfsymbl = DataFrame(Symbol("first") => [6,7,8,9], Symbol("first last") => [1,2,3,4])
dfstrng =  DataFrame("first" => [6,7,8,9], "first last" => [1,2,3,4])

but these are exactly the same! It doesn’t matter if you create the columns with a symbol or a string, they are always stored in the same way.

The text that you quote is only talking about column access. When you write df.first you are actually using the symbol :first to access a column. When you write df."first" you are using the string "first". The quoted text is simply saying that the symbol version doesn’t work e.g. when there is a space in the name. You can access the column with a symbol value like Symbol("first last") by calling getproperty, but you cannot use this symbol literally as in the df.first syntax.

2 Likes

Got it, thank you so much! This clears up a lot.

I apologize for not making the reference clear. The quotes come from Strings vs symbols in DataFrames.jl column indexing. Given my inadequate background I completely misunderstood the well written post.

I took the following examples to mean that a DataFrame column label was being stored as either a symbol or a string and must then be indexed in accordance with the type with which it was created.

( This sent me down a rabbit hole as to why the columns in @sijo 's example here could be accessed with a string as the transform operation output NamedTuples as new columns. Which lead to the above post.)

Now my understanding is that no matter how I create the DataFrame Column, it is stored in one way (a symbol?), and can be indexed freely with both symbols and strings.

Sorry don’t mean to create so much confusion on the discussion board. @bkamins book came out on amazon today so I should have a better understanding of things going forward.

But now I understand that all columns are stored the same way

You are welcome to ask - this is what this forum is for.

Your understanding is correct. Quoting the crucial passage from your post above:

Column names in a DataFrame are labels. For this reason both symbols and strings are allowed to be used when referencing them without introducing an ambiguity.

So from user’s perspective you do not need to think how technically column names are stored. Think of them as “labels”. You can use either Symbol or string to reference the column.


Now, given you asked (but this should not matter to you as a user). Internally column names are stored as Symbols, and if you use a string to look-up a column it is internally converted to Symbol (but as a user you do not have to think about it). The reason why we internally use Symbol is that symbol lookup is faster. However, we also allow strings as they are easier to manipulate. This is explained in Section 6.6 of “Julia for Data Analysis” book (in general - i.e. without specific reference to DataFrame object) and then section 8.3 (specifically in DataFrame) context.

2 Likes