Spaces in query.jl

RandomString123 · January 22, 2018, 5:18pm

Using DataFrames and CSV it is possible to create a table that has spaces in the column names. When this is done is there a way to reference these columns using query.jl? Or do I need to rename the columns to remove spaces?

davidanthoff · January 22, 2018, 11:25pm

I think you’ll have to rename them…

nalimilan · January 23, 2018, 12:39pm

Yeah, spaces in column names should only be produced when you really ask for them (e.g. because you need to preserve them exactly). I’ve filed https://github.com/JuliaData/CSV.jl/issues/158.

ScottPJones · January 23, 2018, 2:29pm

What do other similar packages in Python or R do when there are spaces in column names?

pdeffebach · January 23, 2018, 2:44pm

R’s data frames allow column attributes, so a package like Haven will allow variable labels.

I think DataFrames could do this with a simple Dict for column meta-data, once I get Julia working on my work computer again I’m going to give it a try.

nalimilan · January 23, 2018, 2:54pm

R’s read.csv replaces spaces with . by default (dots are the equivalent of underscores in R since they can appear in identifiers). haven doesn’t support CSV files, but the accompanying readr package provides read_csv, which keeps spaces in column names (without an option to remove them). data.table’s fread doesn’t apply any transformation by default either, but support the check.names arguments for that.

So the existing programs adopt a variety of solutions. We could at least support an argument, but whether it should be the default isn’t clear.

tshort · January 23, 2018, 3:20pm

R also allows backticks like:

d$`hello world` = 3

For DataFramesMeta, I’ve thought about trying to parse backticks as symbols with spaces. It’d be trickier to do that in Query because you need to parse:

x.`hello world` = 3
# or maybe
x."hello world" = 3     # this parses a little better in Julia

ScottPJones · January 23, 2018, 3:22pm

What about pandas? (which everybody tells me is the cat’s pyjamas )

tshort · January 23, 2018, 3:26pm

Pandas and R can both use strings for column names.

d["hello world"]

ScottPJones · January 23, 2018, 3:28pm

In that case, maybe the default for Julia’s DataFrames should also be to allow them without renaming?
Could optionally aliases of the names with _ replacing the spaces also be added (for ease of typing)?

i.e. d[Symbol("hello world") or d[:hello_world] would both work

RandomString123 · January 23, 2018, 3:34pm

R was the reason I asked the initial question. While I don’t think spaces are used frequently in base R dplyr is designed to support spaces in columns names by surrounding all column names with backticks (``).

For example, this is basic R code that takes a file date field that was read in as part of CSV data and creates a new field that is the year of the file as an integer. Notice that the field read from the file was labeled “FILE DATE”.

    x %>% mutate(`year` = format(`FILE DATE`, "%Y"))

Personally I do not think supporting spaces is necessary. But I do think that white-space / special character handling should be consistent across the more popular packages to reduce confusion and the need to write code to translate column names between packages.

nalimilan · January 23, 2018, 4:18pm

I tend to think we should make it possible to work with variables containing spaces (if only via Symbol("...")), but avoid creating them by default because they are inconvenient and not very useful most of the time (AFAICT). We could also allowing indexing with strings, converting them automatically to symbols, but that wouldn’t help for macros.

I’m not a fan of the idea of automatically having aliases matching underscores to spaces. That really sounds like the worst features of R, which make programming unpredictable. This kind of thing should rather be handled by auto-completion.

davidanthoff · January 23, 2018, 5:26pm

I agree with @nalimilan, that sounds like the best path forward.

For Query.jl, this at the end of the day boils down to what supported named tuples have for fields with spaces in them. I think something like row[Symbol("foo bar")] actually works, and maybe with this whole new constant folding this might not even result in a type instability?

In theory one could add support to named tuples to index with a string, so that row["foo bar"] would work, but that strikes me as too weird.

One thing that would really help here was a string macro that creates a symbol, so something like row[s"foo bar"] would work, that to me seems like a good compromise for those cases?

nalimilan · January 23, 2018, 9:48pm

Indeed, a very simple string macro works for that:

macro s_str(s)
   quote
       Symbol($s)
   end
end

ScottPJones · January 23, 2018, 9:54pm

s can’t be used, because it’s part of the Regex syntax, although maybe S would be fine (and maybe indicates “Symbol” even better).

I like the idea!

simonbyrne · January 23, 2018, 11:12pm

It’s been proposed before, but one potential issue is that it doesn’t play nicely with ., i.e. foo.bar"zzz" parses as foo.@bar_str("zzz").

bramtayl · January 24, 2018, 4:53am

Once upon a time when I was super into literate programming I’d have variables in R like acceleration of gravity in meters per second squared. I eventually went back to underscores, which make clear where a variable starts and where it ends. But it would be nice in user facing output to have underscores automatically replaced with spaces (tables, graphs, etc.)

nalimilan · January 24, 2018, 8:52am

I’d rather use variable labels for this. See Metadata for columns and/or DataFrames · Issue #35 · JuliaData/DataFrames.jl · GitHub.

bramtayl · January 24, 2018, 2:51pm

Variable labels would definitely be a nice feature. But one of the whole points of literal programming is that you shouldn’t need variable labels: your variables should be descriptive enough as is. Maybe variable labels that are default to the variable name with spaces instead of underscores?

pdeffebach · January 24, 2018, 6:17pm

The benefit of variable labels is that it eases the transition from code to published document. If you are able to tie each variable a nice label, then you can generate tables, graphs, etc. that look very good very easily. It saves hours of work in reformatting.

Topic		Replies	Views
Manage spaces in Dataframe/Table column names Data question	7	3305	June 16, 2021
Query - column names with spaces General Usage query , dataframes	5	1490	April 6, 2023
How to call a column in DataFrame that has a space in the name New to Julia question , dataframes	10	1507	January 10, 2023
Can't refer to columns with spaces in names in @mutate New to Julia dataframes	4	174	December 29, 2024
Issues querying a DataFrame General Usage query , dataframes	5	534	February 21, 2020

Spaces in query.jl

Related topics