Spaces in query.jl

Using DataFrames and CSV it is possible to create a table that has spaces in the column names. When this is done is there a way to reference these columns using query.jl? Or do I need to rename the columns to remove spaces?

I think you’ll have to rename them…

Yeah, spaces in column names should only be produced when you really ask for them (e.g. because you need to preserve them exactly). I’ve filed https://github.com/JuliaData/CSV.jl/issues/158.

What do other similar packages in Python or R do when there are spaces in column names?

1 Like

R’s data frames allow column attributes, so a package like Haven will allow variable labels.

I think DataFrames could do this with a simple Dict for column meta-data, once I get Julia working on my work computer again I’m going to give it a try.

R’s read.csv replaces spaces with . by default (dots are the equivalent of underscores in R since they can appear in identifiers). haven doesn’t support CSV files, but the accompanying readr package provides read_csv, which keeps spaces in column names (without an option to remove them). data.table’s fread doesn’t apply any transformation by default either, but support the check.names arguments for that.

So the existing programs adopt a variety of solutions. We could at least support an argument, but whether it should be the default isn’t clear.

R also allows backticks like:

d$`hello world` = 3

For DataFramesMeta, I’ve thought about trying to parse backticks as symbols with spaces. It’d be trickier to do that in Query because you need to parse:

x.`hello world` = 3
# or maybe
x."hello world" = 3     # this parses a little better in Julia

What about pandas? (which everybody tells me is the cat’s pyjamas :wink: )

1 Like

Pandas and R can both use strings for column names.

d["hello world"]

In that case, maybe the default for Julia’s DataFrames should also be to allow them without renaming?
Could optionally aliases of the names with _ replacing the spaces also be added (for ease of typing)?

i.e. d[Symbol("hello world") or d[:hello_world] would both work

1 Like

R was the reason I asked the initial question. While I don’t think spaces are used frequently in base R dplyr is designed to support spaces in columns names by surrounding all column names with backticks (``).

For example, this is basic R code that takes a file date field that was read in as part of CSV data and creates a new field that is the year of the file as an integer. Notice that the field read from the file was labeled “FILE DATE”.

    x %>% mutate(`year` = format(`FILE DATE`, "%Y"))

Personally I do not think supporting spaces is necessary. But I do think that white-space / special character handling should be consistent across the more popular packages to reduce confusion and the need to write code to translate column names between packages.

I tend to think we should make it possible to work with variables containing spaces (if only via Symbol("...")), but avoid creating them by default because they are inconvenient and not very useful most of the time (AFAICT). We could also allowing indexing with strings, converting them automatically to symbols, but that wouldn’t help for macros.

I’m not a fan of the idea of automatically having aliases matching underscores to spaces. That really sounds like the worst features of R, which make programming unpredictable. This kind of thing should rather be handled by auto-completion.

3 Likes

I agree with @nalimilan, that sounds like the best path forward.

For Query.jl, this at the end of the day boils down to what supported named tuples have for fields with spaces in them. I think something like row[Symbol("foo bar")] actually works, and maybe with this whole new constant folding this might not even result in a type instability?

In theory one could add support to named tuples to index with a string, so that row["foo bar"] would work, but that strikes me as too weird.

One thing that would really help here was a string macro that creates a symbol, so something like row[s"foo bar"] would work, that to me seems like a good compromise for those cases?

Indeed, a very simple string macro works for that:

macro s_str(s)
   quote
       Symbol($s)
   end
end
1 Like

s can’t be used, because it’s part of the Regex syntax, although maybe S would be fine (and maybe indicates “Symbol” even better).

I like the idea! :+1:

It’s been proposed before, but one potential issue is that it doesn’t play nicely with ., i.e. foo.bar"zzz" parses as foo.@bar_str("zzz").

Once upon a time when I was super into literate programming I’d have variables in R like acceleration of gravity in meters per second squared. I eventually went back to underscores, which make clear where a variable starts and where it ends. But it would be nice in user facing output to have underscores automatically replaced with spaces (tables, graphs, etc.)

I’d rather use variable labels for this. See Metadata for columns and/or DataFrames · Issue #35 · JuliaData/DataFrames.jl · GitHub.

1 Like

Variable labels would definitely be a nice feature. But one of the whole points of literal programming is that you shouldn’t need variable labels: your variables should be descriptive enough as is. Maybe variable labels that are default to the variable name with spaces instead of underscores?

The benefit of variable labels is that it eases the transition from code to published document. If you are able to tie each variable a nice label, then you can generate tables, graphs, etc. that look very good very easily. It saves hours of work in reformatting.