Avoiding global variables while using DataFrames

How does one follow the manual’s guidance for performance and avoid global variables while working with DataFrames? For small tables, one could take an approach like this:

construct1() = DataFrame(A=1:3, B=5:7, fixed=1)

typeof(construct1().A)
names(construct1())
propertynames(construct1())

function foo()
    bar = [1, 2, 3]
    baz = ["a", "b", "c"]
    DataFrame(; bar, baz)
end
foo()

When working with large DFs needing to be read from disk each time via CSV.read this approach seems impractical — even if a fast reader like Arrow is used. Much of working with a new DF involves exploring, understanding, and cleaning the data. During this process, I see no way to avoid globals. Once that exploration is complete, one could wrap the stable processing code and DF inside a function to be reused with new data.

Do others take a different approach? Or is this a situation to just use global variables?

A DataFrame can be global, as long as you use functions to work with them.

The transform, select, combine functions are written so that the data frame can be global but the operations are fast. DataFramesMeta.jl is the same, and provides more utilities for making working with data frames fast in global scope.

Side note:

construct1() = DataFrame(A=1:3, B=5:7, fixed=1)

typeof(construct1().A)
names(construct1())
propertynames(construct1())

is not doing what you think it’s doing. It’s constructing a new data frame every time you call construct1. This will make it impossible to modify a data frame and will be very very slow.

Thanks. Yeah, I understood that a new DF is created with each function call. One could still compose it with other functions, bar(construct1()), etc. to get transformed results returned from that particular call. However, for large DFs this didn’t seem practical, as you state above. I’d like to understand better how DFs can be global and yet avoid the performance penalty.

This section in the performance tips will help.

1 Like