Avoiding global variables while using DataFrames

George9000 · October 25, 2021, 2:43am

How does one follow the manual’s guidance for performance and avoid global variables while working with DataFrames? For small tables, one could take an approach like this:

construct1() = DataFrame(A=1:3, B=5:7, fixed=1)

typeof(construct1().A)
names(construct1())
propertynames(construct1())

function foo()
    bar = [1, 2, 3]
    baz = ["a", "b", "c"]
    DataFrame(; bar, baz)
end
foo()

When working with large DFs needing to be read from disk each time via CSV.read this approach seems impractical — even if a fast reader like Arrow is used. Much of working with a new DF involves exploring, understanding, and cleaning the data. During this process, I see no way to avoid globals. Once that exploration is complete, one could wrap the stable processing code and DF inside a function to be reused with new data.

Do others take a different approach? Or is this a situation to just use global variables?

pdeffebach · October 25, 2021, 1:33pm

A DataFrame can be global, as long as you use functions to work with them.

The transform, select, combine functions are written so that the data frame can be global but the operations are fast. DataFramesMeta.jl is the same, and provides more utilities for making working with data frames fast in global scope.

Side note:

construct1() = DataFrame(A=1:3, B=5:7, fixed=1)

typeof(construct1().A)
names(construct1())
propertynames(construct1())

is not doing what you think it’s doing. It’s constructing a new data frame every time you call construct1. This will make it impossible to modify a data frame and will be very very slow.

George9000 · October 25, 2021, 2:59pm

Thanks. Yeah, I understood that a new DF is created with each function call. One could still compose it with other functions, bar(construct1()), etc. to get transformed results returned from that particular call. However, for large DFs this didn’t seem practical, as you state above. I’d like to understand better how DFs can be global and yet avoid the performance penalty.

pdeffebach · October 25, 2021, 3:15pm

This section in the performance tips will help.

Topic		Replies	Views
Does passing a dataframe declared outside a function as an argument improves performance? Performance dataframes	4	1949	October 6, 2022
Better design pattern for this type of development..? General Usage design-pattern	5	455	August 7, 2020
Performance: Fast way to access numbers in Dataframes or alternatives Performance dataframes , data_structures	12	1179	November 15, 2022
Dynamic name of DataFrames by iteration New to Julia question	6	3580	March 3, 2020
DataFrame transformation is so slow, what am I doing wrong? Performance compilation , dataframes	17	337	May 19, 2024

Avoiding global variables while using DataFrames

Related topics