Updating a value using a column

I want to select a row from a dataframe and update a value on that row using a message received via zmq. So the message would provide

Bucket = “bucket_1” # this would add a new row to the empty dataframe
field_in = “Latency”
value_in = 34.90

so the intent is to add the bucket_1 row to the empty dataframe and use the “field_in” variable to locate df_test.latency and update the new row.

from docs

I can see that I can

julia> df[!, r"x"]
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

so I set up an empty dataframe and gave it some column names.

julia> test_col_names = ["Bucket","Tolling","Latency"]
julia> df_test = DataFrame(test_col_names .=> rand.([String,Float64,Float64],0))
julia> describe(df_test)

3×7 DataFrame
 Row │ variable  mean    min      median   max      nmissing eltype
     │ Symbol    Union…  Nothing  Nothing  Nothing  Int64 DataType
─────┼─────────────────────────────────────────────────────────────────
   1 │ Bucket                                              0 String
   2 │ Tolling   NaN                                       0 Float64
   3 │ Latency   NaN                                       0 Float64

typed in a test value for field_in and tried out what I “thought” the docs said would work.


julia> field_in = "Tolling"
"Tolling"

julia> df_test[!,rfield_in]

and got

ERROR: UndefVarError: rfield_in not defined
Stacktrace:
 [1] top-level scope
   @ REPL[23]:1

can someone point out what I am missing please.

I want to end up with df_test looking like this

bucket    |  tolling  | latency

bucket_1  | 0    |  34.90

and later on if I get a message

Bucket = “bucket_1” # this would add a new row to the empty dataframe
field_in = “Tolling”
value_in = 1.9902

bucket    |  tolling  | latency

bucket_1  | 1.9902  |  34.90

but first I have to get the column selection working. Any help appreciated.

Your problem

You get error message rfield_in not defined which means that rfield_in is not defined.
And indeed it is not defined as you defined field_in instead.

General comment

Do you want:

  1. add a new row to the data frame, or
  2. update an existing row with new data, or
  3. do 1 or 2 depending on whether the row is present in the data frame

?

as it is not clear from your description what you want.

Hi there,
thank you for looking into this. My intent is 3

 do 1 or 2 depending on whether the row is present in the data frame

My intent is to have a message come in via zmq of the form

<row index><column><value>

for example
<“bucket_1”><“Latency”><34.90>

this would be used to:

locate the row OR add it if not present.

select the column using what I “thought” was the construct to search for a column name by putting the message value into a variable field_in. My intent was to use “r” and field_in instead of r"column name" as per the documentation.

df_test[!,rfield_in]

then

update the value located at

so in my use case the message

<“bucket_1”><“Latency”><34.90>

would add to the empty dataframe resulting in

bucket    |  tolling  | latency

bucket_1  | 0    |  34.90

and the subsequent message

< “bucket_1” ><“Tolling”><199.02>

would result in

bucket    |  tolling  | latency

bucket_1  | 199.02    |  34.90

Could you point me to the place where you got this recommendation in the documentation? It is incorrect and I need to fix/clarify the documentation. r"..." is a regular expression allowing you to select multiple columns matching some pattern, so this is a different use case.

Now if your use case is:

do 1 or 2 depending on whether the row is present in the data frame

I recommend you do not use data frame, but use Dict instead. In such dictionary “bucket” should be dictionary key, and “Tooling” and “Latency” should be stored as value. Then, when you get new data you just update the dictionary.

Hi there
in the original post I cited the the julia docs where I saw the example

It is also possible to use a regular expression as a selector of columns matching it:

julia> df = DataFrame(x1=1, x2=2, y=3)
1×3 DataFrame
 Row │ x1     x2     y
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      3

julia> df[!, r"x"]
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

I am SURE it’s a misread of the documentation on my part. My thinking was that I could use the regex to locate the column and it would only return one value as the column headers are unique.

I “thought” the language construct could have handled my substitution of

df[!, r"x"]

with

field_in = "Tolling"

df[!,rfield_in]

I think I tried using a Symbol. PLEASE remember I am a noob!

df[!,r:field_in]

I was trying to avoid using Dict as I am not too keen on them, no real reason I just prefer dataframes or arrays. I will certainly look into your suggestion but, to me, it’s an uphill battle.

I don’t really see how “bucket” would be the key and “Tolling”/“Latency” values BUT thanks for the suggestion, something for my to concentrate on when I look in Dict.

thank you for taking the time to look at this and I am REALLY looking forward to you book.

theakson

  1. To select column whose name is stored in field_in string just write df[!, field_in].
  2. How to create a dictionary I am talking about:
julia> d = Dict("bucket_1" => (tooling=0, latency=34.90)) # some initial data as an example
Dict{String, NamedTuple{(:tooling, :latency), Tuple{Int64, Float64}}} with 1 entry:
  "bucket_1" => (tooling = 0, latency = 34.9)

julia> new_data = (bucket="bucket_2", tooling=1, latency=35.12)
(bucket = "bucket_2", tooling = 1, latency = 35.12)

julia> d[new_data.bucket] = new_data[(:tooling, :latency)]
(tooling = 1, latency = 35.12)

julia> d
Dict{String, NamedTuple{(:tooling, :latency), Tuple{Int64, Float64}}} with 2 entries:
  "bucket_2" => (tooling = 1, latency = 35.12)
  "bucket_1" => (tooling = 0, latency = 34.9)
  1. You can use DataFrame also, but it will be slower and more complicated:
julia> using DataFrames

julia> df = DataFrame(bucket="bucket_1", tooling=0, latency=34.90) # some initial data as an example
1×3 DataFrame
 Row │ bucket    tooling  latency
     │ String    Int64    Float64
─────┼────────────────────────────
   1 │ bucket_1        0     34.9

julia> new_data = (bucket="bucket_2", tooling=1, latency=35.12)
(bucket = "bucket_2", tooling = 1, latency = 35.12)

julia> loc = findfirst(==(new_data.bucket), df.bucket) # check if new bucket is in the data

julia> if isnothing(loc) # new value
           push!(df, new_data)
       else # value already present
           df[loc, :tooling] = new_data.tooling
           df[loc, :latency] = new_data.latency
       end
2×3 DataFrame
 Row │ bucket    tooling  latency
     │ String    Int64    Float64
─────┼────────────────────────────
   1 │ bucket_1        0    34.9
   2 │ bucket_2        1    35.12
1 Like

thank you so much for taking the time, Really helps me set a foundation. I have one question. In my usecase the situation is that messages only update Tolling or Latency not BOTH which, I think, your examples seem to show. So the situation is more like this

<“bucket_1”><“Latency”><34.90>

julia> new_data = (bucket="bucket_1",  latency=34.90)

then, at some later stage

<“bucket_1”><“Tolling”><3200.00>

julia> new_data = (bucket="bucket_1",  Tolling=3200.00)

so the updates are single column only, Latency or Tolling after the initial row insert.
Does this still work?

Mostly yes - you would just need two separate codes. One for updating latency, and other for updating tolling.

Hi there
thanks AGAIN @bkamins for taking the time. I took your advice and concentrated on Dict. I can load up the Key fields with bucket_names as it’s a known set. I can build the FIELDs as I go just in case something crops up in the future. I think the code below should be ok for a NOOB.
thanks again.
theakson

bucket_names = ["bucket_1", "bucket_2", "bucket_3"]
d_buckets = Dict(buck => Dict{String,Any}() for buck in bucket_names)

function update_d_buckets(bucket,field,value)
                       d_buckets[bucket][field] = value
              end

here’s a test run I just did.

 
julia> bucket_names = ["bucket_1", "bucket_2", "bucket_3"]
3-element Vector{String}:
 "bucket_1"
 "bucket_2"
 "bucket_3"

julia> d_buckets = Dict(buck => Dict{String,Any}() for buck in bucket_names)
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict()
  "bucket_1" => Dict()

julia> function update_d_buckets(bucket,field,value)
                       d_buckets[bucket][field] = value
              end
update_d_buckets (generic function with 1 method)


julia> update_d_buckets("bucket_1","Latency",1.0)
1.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict()
  "bucket_1" => Dict("Latency"=>1.0)

julia> update_d_buckets("bucket_2","Tolling",1.0)
1.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Tolling"=>1.0)
  "bucket_1" => Dict("Latency"=>1.0)

julia> update_d_buckets("bucket_1","Tolling",1.0)
1.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Tolling"=>1.0)
  "bucket_1" => Dict("Latency"=>1.0, "Tolling"=>1.0)

julia> update_d_buckets("bucket_2","Latency",1.0)
1.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Latency"=>1.0, "Tolling"=>1.0)
  "bucket_1" => Dict("Latency"=>1.0, "Tolling"=>1.0)

julia> update_d_buckets("bucket_2","Tolling",2.0)
2.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Latency"=>1.0, "Tolling"=>2.0)
  "bucket_1" => Dict("Latency"=>1.0, "Tolling"=>1.0)

julia> update_d_buckets("bucket_1","Tolling",2.0)
2.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Latency"=>1.0, "Tolling"=>2.0)
  "bucket_1" => Dict("Latency"=>1.0, "Tolling"=>2.0)

julia> update_d_buckets("bucket_1","Latency",2.0)
2.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Latency"=>1.0, "Tolling"=>2.0)
  "bucket_1" => Dict("Latency"=>2.0, "Tolling"=>2.0)

julia> update_d_buckets("bucket_2","Latency",2.0)
2.0

julia> d_buckets
Dict{String, Dict{String, Any}} with 3 entries:
  "bucket_3" => Dict()
  "bucket_2" => Dict("Latency"=>2.0, "Tolling"=>2.0)
  "bucket_1" => Dict("Latency"=>2.0, "Tolling"=>2.0)

Yes - this approach allows you to flexibly store key-value mappings for each bucket. You even could avoid pre-allocating buckets by writing:

function update_d_buckets(bucket,field,value)
                       bucket_dict = get!(d_buckets, bucket) do
                           Dict{String, Any}()
                       end
                       bucket_dict[field] = value
              end

to dynamically add new bucket only when needed.

1 Like

AHA! another tool added to my bogumil toolkit. Thanks again for spending time on this.
theakson