Leading Zeros got truncated from df column

I have dataframe with integer field and it’s usually consists of time for ex. 0001, 1020, 2359 etc.

When there are leading zeros in the column they are being truncated and I’ve tried applying lpad which seems to be doesn’t work on dataframes?

Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │    21
   3 │   333

df.a .= lpad(string(df.a),4,"0")
3-element Vector{String}:
 "[1, 21, 333]"
 "[1, 21, 333]"
 "[1, 21, 333]"

The expected results are like below:
"0001"
"0021"
"0333"

Try broadcasting with dot syntax:

df.a .= lpad.(string.(df.a), 4, '0')

or with macro:

@. df.a = lpad(string(df.a), 4, '0')

or
transform!(df, :a => ByRow(x -> lpad(string(x), 4, "0")) => :a)

or with DataFramesMeta.jl

@rtransform!(df, :a = lpad(string(:a), 4, "0"))
1 Like

@rafael.guerra Thank you for the reply, It seems like I am using double quotes instead of single quote.

@adienes Thank you, this solution also works but seems like broadcasting appears to be fastest one considering the time taken.

if running time is important, consider using @rafael.guerra 's proposal in the following form

julia> using DataFrames, BenchmarkTools

julia> @btime begin
           lp=maximum(length ∘ string, df.a)+1
           [lpad(r,$lp,'0') for r in $df.a ]
           end  # 800ns
  302.381 ns (3 allocations: 128 bytes)
3-element Vector{String}:
 "0001"
 "0021"
 "0333"
julia> @btime [lpad(r,4,'0') for r in df.a ]
  184.818 ns (2 allocations: 96 bytes)
3-element Vector{String}:
 "0001"
 "0021"
 "0333"


julia> @btime df.a .= lpad.(string.(df.a), 4, '0')
  825.581 ns (5 allocations: 304 bytes)
3-element Vector{String}:
 "0001"
 "0021"
 "0333"

I came here to recommend you only convert your numbers to string when you go to print results using Printf or Formatting (not when you load your data), so you can analyze your data using integers. However, the strings actually worked better than I expected with transformation functions.

julia> df
4×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  0001
   2 │    10  0010
   3 │     2  0002
   4 │    20  0020

julia> sort(df.a)
4-element Vector{Int64}:
  1
  2
 10
 20

julia> sort(df.b)
4-element Vector{String}:
 "0001"
 "0002"
 "0010"
 "0020"

julia> subset(df, :a => ByRow(<(10)))
2×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  0001
   2 │     2  0002

julia> subset(df, :b => ByRow(<("0010")))
2×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  0001
   2 │     2  0002

I think the double quotes were not the issue, rather the missing period after your lpad. [But with single quotes is likely slightly faster. I would actually like to get rid of the distinction, and eliminate Char from Julia, same as in Swift, I have an idea, that I haven’t implemented yet, for improving String handling that would do that and more.]

That may be good advise. Possibly DataFrames should have an (optional) way to format numbers, e.g. integers (like in COBOL, its PICTURE clause)?

What’s the default when you import 0 prefixed numbers? Since it may be meaningful (e.g. Excel drops them), they should be imported as strings. [Something like lpad(100, 2, '0') wouldn’t sort correctly, when the padding isn’t sufficient. Natural sorting for strings would fix that, what I want as defaults for me new String data type. There’s already a package for that, just opt-in; also just using integers fixes that problem.]