Argument passing with dataframes


I am wondering the following with dataframes:

function addcol1(x)
	x.mycol = x.ID .+ 1

function join1(x)
	jobs = DataFrame(ID = [20, 40,45], Job = ["Lawyer", "Doctor","teacher"])
	x = innerjoin(x, jobs, on = :ID)
	return x

Run this with

people = DataFrame(ID = [20, 40,45], name = ["Jane", "Kalle","Petri"], h = [10,16,25])


The first command changes variable people in the calling scope, the second does not. Why?

In the join1 function, initially the variable x represents the object passed as argument. Later you do this:

x = innerjoin(x, jobs, on = :ID)

This means that x now represents a different object: the return value of innerjoin. It does not mean that the previous object will be changed.

This is a common source of confusion for people used to some other programming languages. See this FAQ entry.

Also, this is a mutating operation. There are no mutating operations in the 2nd version.

1 Like

So whenever there is assignment

x = something

this means that the connection to the original object is lost?

My next question is: should I return the resulting dataframe as return value; or is there a way to copy the result of innerjoin to x so that x still represents the original object?

Yes, exactly!

You should return it as a return value. If you were just adding a column for example, you could modify the existing object. But with a “join” operation you create a new object, so you should return it.

You could go around these rules by defining a macro instead of a function, but that’s generally a bad idea (macros should only be used for good reason as they make the code less readable/predictable and they are more difficult to write without bugs).