Leading Zeros got truncated from df column

Sandy45 · April 12, 2023, 10:57pm

I have dataframe with integer field and it’s usually consists of time for ex. 0001, 1020, 2359 etc.

When there are leading zeros in the column they are being truncated and I’ve tried applying lpad which seems to be doesn’t work on dataframes?

Row │ a     
     │ Int64 
─────┼───────
   1 │     1
   2 │    21
   3 │   333

df.a .= lpad(string(df.a),4,"0")
3-element Vector{String}:
 "[1, 21, 333]"
 "[1, 21, 333]"
 "[1, 21, 333]"

The expected results are like below:
"0001"
"0021"
"0333"

rafael.guerra · April 12, 2023, 11:09pm

Try broadcasting with dot syntax:

df.a .= lpad.(string.(df.a), 4, '0')

or with macro:

@. df.a = lpad(string(df.a), 4, '0')

adienes · April 12, 2023, 11:09pm

or
transform!(df, :a => ByRow(x -> lpad(string(x), 4, "0")) => :a)

or with DataFramesMeta.jl

@rtransform!(df, :a = lpad(string(:a), 4, "0"))

Sandy45 · April 12, 2023, 11:20pm

@rafael.guerra Thank you for the reply, It seems like I am using double quotes instead of single quote.

Sandy45 · April 12, 2023, 11:22pm

@adienes Thank you, this solution also works but seems like broadcasting appears to be fastest one considering the time taken.

rocco_sprmnt21 · April 13, 2023, 5:49pm

if running time is important, consider using @rafael.guerra 's proposal in the following form

julia> using DataFrames, BenchmarkTools

julia> @btime begin
           lp=maximum(length ∘ string, df.a)+1
           [lpad(r,$lp,'0') for r in $df.a ]
           end  # 800ns
  302.381 ns (3 allocations: 128 bytes)
3-element Vector{String}:
 "0001"
 "0021"
 "0333"
julia> @btime [lpad(r,4,'0') for r in df.a ]
  184.818 ns (2 allocations: 96 bytes)
3-element Vector{String}:
 "0001"
 "0021"
 "0333"


julia> @btime df.a .= lpad.(string.(df.a), 4, '0')
  825.581 ns (5 allocations: 304 bytes)
3-element Vector{String}:
 "0001"
 "0021"
 "0333"

Nathan_Boyer · April 13, 2023, 7:18pm

I came here to recommend you only convert your numbers to string when you go to print results using Printf or Formatting (not when you load your data), so you can analyze your data using integers. However, the strings actually worked better than I expected with transformation functions.

julia> df
4×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  0001
   2 │    10  0010
   3 │     2  0002
   4 │    20  0020

julia> sort(df.a)
4-element Vector{Int64}:
  1
  2
 10
 20

julia> sort(df.b)
4-element Vector{String}:
 "0001"
 "0002"
 "0010"
 "0020"

julia> subset(df, :a => ByRow(<(10)))
2×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  0001
   2 │     2  0002

julia> subset(df, :b => ByRow(<("0010")))
2×2 DataFrame
 Row │ a      b
     │ Int64  String
─────┼───────────────
   1 │     1  0001
   2 │     2  0002

Palli · April 13, 2023, 9:03pm

I think the double quotes were not the issue, rather the missing period after your lpad. [But with single quotes is likely slightly faster. I would actually like to get rid of the distinction, and eliminate Char from Julia, same as in Swift, I have an idea, that I haven’t implemented yet, for improving String handling that would do that and more.]

That may be good advise. Possibly DataFrames should have an (optional) way to format numbers, e.g. integers (like in COBOL, its PICTURE clause)?

What’s the default when you import 0 prefixed numbers? Since it may be meaningful (e.g. Excel drops them), they should be imported as strings. [Something like lpad(100, 2, '0') wouldn’t sort correctly, when the padding isn’t sufficient. Natural sorting for strings would fix that, what I want as defaults for me new String data type. There’s already a package for that, just opt-in; also just using integers fixes that problem.]

Topic		Replies	Views
String column dataframe: replace for another string built from a substring from each row Data	1	1291	January 20, 2021
Leading Zeros General Usage	10	3690	October 30, 2019
Cropping dataframe table data left or center General Usage question , dataframes , prettytables	13	427	February 19, 2024
Cannot place this error in dataframes New to Julia dataframes	5	824	November 26, 2021
Extending a string by adding 0's to the end New to Julia	7	1771	July 28, 2020

Leading Zeros got truncated from df column

Related topics