General help with looping through dataframe and types declarations

Total Julia noob here.

I’m used to Python and R (and I really dislike the Python and R syntax and generally think the Julia syntax makes more sense for me) but cannot get my head around the types declarations in Julia. There is just something about it that I just can’t grasp. I’ve read through that documentation several times.

I created this dummy example to debug a little issue I’m having and thought it would make a good self training example. I think I’m doing at least 5 things wrong here. Can someone help me with the better way to do it (although this does work as is but probably very inefficiently)?


using ODBC
using DataFrames
using Statistics

dsn = ODBC.DSN("TEST","TEST","TEST")

createtablecmd = """
    CREATE TABLE TEST_JULIA
    (
    X INTEGER NOT NULL
    , Y INTEGER NOT NULL
    , Z INTEGER NOT NULL
    )
"""

ODBC.execute!(dsn, createtablecmd)

summarydf = DataFrame(ODBCSummary = Float64[], DFSummary = Float64[])

for i in 1:100
    for x in 1:10
        for y in 1:10
            z = convert(Int64, round(rand()))
            ODBC.execute!(dsn, "INSERT INTO test_julia (X, Y, Z) VALUES ($x,$y,$z)")
        end
    end

    data = ODBC.query(dsn, "SELECT * FROM test_julia")

    summary = ODBC.query(dsn, "SELECT avg(z) as z FROM test_julia")[:Z][1]
    dfsummary = mean(data[:Z])

    push!(summarydf, [summary dfsummary])
    ODBC.execute!(dsn, "TRUNCATE TABLE test_julia")
end 

summarydf[:ratio] = Float64(0)
summarydf[:delta] = Float64(0)

for i in 1:length(summarydf[:ratio])
    summarydf[i,:ratio] = summarydf[i,:ODBCSummary]/summarydf[i,:DFSummary]
    summarydf[i,:delta] = summarydf[i,:ODBCSummary]-summarydf[i,:DFSummary]
end


ODBC.execute!(dsn, "DROP TABLE TEST_JULIA")

The questions I’m wondering about

  • Is the way I’m generating the empty dataframe correct? I’m guessing at how the type declarations are defined, although they seem to work.
  • Is the way that I’m adding a row to the bottom of the dataframe the best way?
  • is the way I’m retrieving a single result from ODBC the right way? It seems very inefficient (although it’s similar in Python Pandas.read_sql)
  • It seems like there should be an easier way to create a calculated column for the ratio and delta, but the dot notation isn’t working here. I’m getting this error

julia> summarydf[:ratio] .= summarydf[:ODBCSummary] / summarydf[:DFSummary]
ERROR: KeyError: key :ratio not found

so I create a column first (which you don’t need to do in python or R)
julia> summarydf[:ratio] = Float64(0)
julia> summarydf[:ratio] .= summarydf[:ODBCSummary] / summarydf[:DFSummary]
ERROR: DimensionMismatch(“cannot broadcast array to have fewer dimensions”)

which is why I’m looping one row at a time which doesn’t feel right to me although it does work

Anyways, I’m hoping someone can give me some pointers. I think overall, the type declarations are the one thing holding me back from being able to fully use Julia. I seem to be having some sort of mental block there.

Thanks in advance.

Float64(0) will simply return the floating point value 0.00, so you’re trying to broadcast the new column onto a single value which doesn’t work.

Similarly, you replace the columns in your data frames with a single zero after your first loop.

You don’t have to pre-allocate a column, see the getting started section of the DataFrames docs, which includes this example:

julia> df = DataFrame()
0×0 DataFrames.DataFrame


julia> df[:A] = 1:8
1:8

julia> df[:B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
8-element Array{String,1}:
 "M"
 "F"
 "F"
 "M"
 "F"
 "M"
 "M"
 "F"

Can you test that? because using df[:X] = Float64(0) does correctly allocate 0 to every row in that column of the dataframe for me.

The reason I need to pre-allocate is that if I don’t, then the row by row for loop errors out.

Apologies, that’s what I get for making things up without actually running the MWE (I don’t have a working version of ODBC installed hence couldn’t run your example).

Testing it, you’re of course right. The following works for me:


using DataFrames

df = DataFrame(rand(100, 2), [:A, :B])

df[:divide] = Float64(0)

for i ∈ 1:size(df)[1]
    df[i, :divide] = df[i, :A]/df[i, :B]
end

Is there a specific reason why you need the loop rather than a broadcasted version like:

df[:divide] = df[:A]./df[:B]

The only reason is ignorance. I couldn’t get that working.

Now I see, I had the dot notation incorrect. I had it on the = not on the /

Yes, now it works, thanks.

Any suggestions about the type declarations? That’s one area where I’m totally lost.

Anyways, thanks for your help.

What you’re doing is fine, an alternative would be the below, but I don’t think there’s an upside or downside to either method:

DataFrame([Int, Float64, String], [:A, :B, :C], 0)