DataFrame Row Indexing and Transaction Data Aggregation

Hello,

Still learning the language, but I am excited to do some processing of financial data that I have not been pleased with doing in excel.

I have a DataFrame of transaction data in the form of:

38838×7 DataFrame
│ Row   │ Name                               │ SKU            │ L2           │ L3            │ Qty   │ Rev     │ Date       │
│       │ String                             │ String         │ String       │ String        │ Int64 │ Float64 │ Date       │

I want to convert from a transaction date into a timeseries using the product name as a unique key. I have done that, but I was not pleased with my implementation as I am certain there are better ways of doing it. Probably a bit nit picky, but I am trying to learn the language better.

# Creates a new row for the timeseries dataframe
function newRow(row)::DataFrame
    new_row = DataFrame(Name=row.Name, SKU=row.SKU, L2=row.L2, L3=row.L3)
    for date in Date(2019, 1):Dates.Month(1):Date(2020, 12)
        new_row [string(date)] = 0
    end
    return new_row 
end

# Converts the transaction data into timeseries data
function makeTimeSeries(data)::DataFrame
    # Init ts
    ts = DataFrame()
    append!(ts,newRow(data[1,:]))

    # Build row by row
    map(eachrow(data)) do row

        # Check to see if it is new
        did_find = false
        for (i, name) in enumerate(ts.Name)
            if row.Name == name
                did_find = true
                ts[i,:][string(row.Date)] += row.Qty
                break
            end
        end

        # Add Product to dataframe if it is new
        if !did_find
            append!(ts, newRow(row))
            last(ts)[string(row.Date)] += row.Qty       
        end 
    end
    return ts
end

I feel like there is probably a more julian way of transforming this data, but I am unsure of how I would go about that. I’m not entirely sure why my allocations are so high, and I feel like the transformation should be quicker from what I have seen from julia so far. Nothing obvious jumped out at me in the performance documentation, though I am still fairly new.

@time ts = makeTimeSeries(data)
  4.693930 seconds (94.55 M allocations: 2.512 GiB, 12.47% gc time)
856×28 DataFrame. Omitted printing of 21 columns
│ Row │ Name                                  │ SKU            │ L2           │ L3                  │ 2019-01-01 │ 2019-02-01 │ 2019-03-01 │  
│     │ String                                │ String         │ String       │ String              │ Int64      │ Int64      │ Int64      │

Anyways, any advice the community can give would be greatly appreciated

groupby(data, :Name) ? also looks like you want: Reshaping · DataFrames.jl

Sorry, I don’t quite understand all your code. But here is a way to get a time-period for a panel data frame

julia> df = DataFrame(group = repeat(["group$i" for i in 1:10], 10), val = rand(100));

julia> transform!(groupby(df, "group"), "val" => (t -> 1:length(t)) => "timeperiod");

What language are you coming over from? That may help us understand what you want better

1 Like

I am coming from a cpp background mainly, though I have spent a fair bit of time in many languages like python, rust, javascript, dart, etc. I like learning languages and seeing how different groups think and solve different problems. This effort is mainly that I think excel is boring to solve this problem. I have shifted from en engineering to a product management roll, which is all well and good, but my hobby is learning languages and trying them out.

Main issue being is I can throw a few pivot tables down and have the data moved around exactly how I want, but that is not a fair comparison because I know excel, and data analysis is hardly the domain of excel. I want to do much more in depth visualizations and filtering than I am currently able to do.

Side note, I did re-implement and got allocations down, but not faster :frowning:

function betterTS(data)::DataFrame
    types = [String,String,String,String,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64]
    names = ["Name","SKU","L2","L3"]
    for date in Date(2019, 1):Dates.Month(1):Date(2020, 12)
        push!(names, string(date))
    end
    # pre-allocate
    ts = DataFrame(types,names,nrow(unique(data,:Name)))

    # Copy Columns
    ts.Name = unique(data,:Name).Name
    ts.SKU = unique(data,:Name).SKU
    ts.L2 = unique(data,:Name).L2
    ts.L3 = unique(data,:Name).L3
    for date in Date(2019, 1):Dates.Month(1):Date(2020, 12)
        ts[string(date)] = 0
    end

    for row in eachrow(ts)
        for data_row in eachrow(data)
            if data_row.Name == row.Name
                row[string(data_row.Date)] += data_row.Qty
            end
        end
    end
    return ts
end

@time ts = betterTS(data)
  5.339110 seconds (113.60 M allocations: 1.714 GiB, 5.49% gc time)
856×28 DataFrame. Omitted printing of 18 columns
│ Row │ Name                               │ SKU            │ L2           │ L3            │ 2019-01-01 │ 2019-02-01 │ 2019-03-01 │ 2019-04-01 │ 2019-05-01 │ 2019-06-01 │
│     │ String                             │ String         │ String       │ String        │ Int64      │ Int64      │ Int64      │ Int64      │ Int64      │ Int64      │

Wait so you want to go from long to wide data? If you were to give a full MWE, something we can copy and paste, that would be very helpful.

That said, please read the docs, particularly with regards to combine. You definitely want to use combine. Everything you are doing here I think is easily done using combine rather than a for loop.

1 Like

Pretty much what I am looking to do is sum quantity of an order in each month per product. What that will do it give me a time series of monthly totals of product sales , which can be useful for projecting forecasts. I think I am seeing how the grouping could get me where I need to be. I can group on product, then group by date. From there I think I could more easily sum the values, which makes sense. I hadn’t thought about nested grouping. I will try that. Thank you!

Hmmm I don’t see a double grouping in your code… but grouping is the way to go.

Another piece of “code smell” in your code is that you are making a very wide data frame, with dates as columns. This is fine in Julia, you won’t incur performance costs and you can still work with things normally, however combine and the grouping infrastructure are designed to work with “long” data, rather than “wide” data

No grouping yet, working on seeing if that would work. Still figuring out how to iterate the groups and make that work for what I am looking for.

you can either:

combine(groupedf) do df
 # do something to individual df
end

or, loop over the grouped dataframes directly

I am able to get mostly what I am looking for using combine and grouping, though I am sure I am still missing some key concepts.

arr = []
for (k1,g1) in pairs(groupby(sap,:Name))
    for (k2,g2) in pairs(groupby(g1,:Date))
     push!(arr,[values(k1)[1],values(k2)[1],sum(g2.Qty)])
     end
end

Heck of a lot faster at least

if you post a full MWE with a sample data set, we will be able to help you write totally performant code with idiomatic julia usage.

1 Like

Finally got back to this and made a much more efficient method. As expected it was more of an issue with my logic that the effectiveness of Julia. Starting to figure out some of the fun of the Dataframes.

function evenBetterTS(data::DataFrame)::DataFrame
    df = unique(data,:Name)[[:Name,:SKU,:L2,:L3]]
    for date in Date(2019,1):Dates.Month(1):Date(2020,12)
       df[Symbol("Rev ",date)] = 0.0
       df[Symbol("Qty ",date)] = 0
    end
    name = data[:Name][1]
    index = 1
    for row in eachrow(data)
        if row.Name !=name
            name = row.Name
            index += 1
        end
        df[Symbol("Qty ",row.Date)][index] += row.Qty
        df[Symbol("Rev ",row.Date)][index] += row.Rev
    end
    return df
end

@time evenBetterTS(data)
  0.110750 seconds (1.73 M allocations: 55.902 MiB, 9.85% gc time)