Create an indexed line from a vector

Hi,

I would like to take a vector like:

julia> vector
5-element Vector{Float64}:
 90.51
 51.79
 84.81
 66.11
 56.68

and output

 "0"
 "1:90.51"
 "2:51.79"
 "3:84.81"
 "4:66.11"
 "5:56.68"

that will be written into a file (using writedlm) as:

 0 1:90.51 2:51.79 3:84.81 4:66.11 5:56.68

So I wrote and tested this:

using BenchmarkTools

vector=rand(.1:.01:100.99, 50000)

function to_svm(vec)
    SVM_line=["0"]
    index=0
    for value in vec
        index+=1
        push!(SVM_line, string(index) * ":" * string(value))
    end
    return SVM_line
end

function to_svm_v2(vec)
SVM_line=["0"]
    for iter in eachindex(vec)
        push!(SVM_line, string(iter) * ":" * string(vec[iter]))
    end
    return SVM_line
end

@benchmark line=to_svm(vector)
@benchmark line_v2=to_svm_v2(vector)

The two methods take the same time, but I’d like to know if there’s a way to go faster. I have to process a lot of vector β†’ lines before writing to the file so this is really the part of the code that I’d like to optimize.

This is shorter but not faster

function to_svm_c(v)
    vcat("0", [string(i) * ":" * string(v[i]) for i in eachindex(v)])
end

It would be nice to have a version of string(), where you can reuse the buffer. I don’t know, if such a version exists. string(n::Int) is faster than string(f::Float64). If you need only fixed precision, this might be of interest.

1 Like

If you are dealing with a restricted range in your input vector, you could try something like this.

d_float = Dict([f => string(f) for f in .1:.01:100.99])

function to_svm_d(ds, v)
    vcat("0", [string(i) * ":" * d_float[v[i]] for i in eachindex(v)])
end

On my machine this takes half the time compared to to_svm.

1 Like

Do you produce the vector just to write it to a file or do you need it later in your code?

If it’s the former, you can just skip the creation of the vector and directly write the content to a file, that might save you a little bit of extra time and memory (but perhaps not a huge amount).

Something like this (based on @zweiglimmergneis suggestion)

function to_file(v, filename)
    open(filename, "w") do file
        writedlm(file, ("0"))
        writedlm(file, (string(i) * ":" * string(v[i]) for i in eachindex(v)))
    end
end

Alternatively, you could use writedlm without even creating the lines manually (although that seems to be slightly slower on my system):

function to_file_v2(v, filename)
    open(filename, "w") do file
        writedlm(file, ("0"))
        writedlm(file, pairs(v), ':')
    end
end

EDIT:

So far the fastest I could get is ditching writedlm altogether and just writing everthing manually after joining the lines together… :sweat_smile:

function to_file(v, filename)
    open(filename, "w") do file
        write(file, "0\n")
        write(file, join((string(i) * ":" * string(val) for (i,val) in pairs(v)), '\n'))
    end
end

Side note: to get accurate benchmark results, the function arguments are usually quoted, e.g. @benchmark to_svm($vector) ):
https://juliaci.github.io/BenchmarkTools.jl/stable/

1 Like

print to an IOBuffer. (This is what string does internally anyway.) For even more re-use, you can make a non-destructive String-like view of an IOBuffer with StringViews.jl.

Should be better still to ditch the string and join calls entirely and just print directly to the file, e.g. something like:

function to_file(v, filename)
    open(filename, "w") do file
        println(file, 0)
        for (i, val) in pairs(v)
            println(file, i, ':', val)
        end
    end
end

(which is also more readable than the join version in my opinion). Note that I use print instead of write to output the text representations of i and val directly. You could also use enumerate instead of pairs to allow v to be an iterator rather than a vector (and to guarantee that the output indices start at 1).

A version of this idea is also in the Julia performance tips (β€œAvoid string interpolation for I/O”): don’t construct an intermediate string just to write it to a file.

1 Like

That’s what I thought as well (I tried it with write which still requires a string conversion), but neither the write version, nor the print version beat the join version on my system.

These are the three variants and their timings on a 2021 Macbook:

Code
using BenchmarkTools

vector=rand(.1:.01:100.99, 50000)

function to_file_print(v, filename)
    open(filename, "w") do file
        print(file, "0\n")
        for (i, val) in pairs(v)
            println(file, i, ':', val)
        end
    end
end

function to_file_write(v, filename)
    open(filename, "w") do file
        write(file, "0\n")
        for (i, val) in pairs(v)
            write(file, string(i), ':', string(val), '\n')
        end
    end
end

function to_file_join(v, filename)
    open(filename, "w") do file
        write(file, "0\n")
        write(file, join((string(i) * ":" * string(val) for (i,val) in pairs(v)), '\n'))
        write(file, '\n')
    end
end

@benchmark to_file_print($vector, $"saved_vector_print.txt")
@benchmark to_file_write($vector, $"saved_vector_write.txt")
@benchmark to_file_join($vector, $"saved_vector_join.txt")

# outputs are identical
read("saved_vector_print.txt", String) == read("saved_vector_write.txt", String) == read("saved_vector_join.txt", String) # true

Timings
julia> @benchmark to_file_print($vector, $"saved_vector_print.txt")
BenchmarkTools.Trial: 217 samples with 1 evaluation.
 Range (min … max):  20.115 ms … 106.253 ms  β”Š GC (min … max): 0.00% … 1.06%
 Time  (median):     22.413 ms               β”Š GC (median):    3.57%
 Time  (mean Β± Οƒ):   23.078 ms Β±   5.968 ms  β”Š GC (mean Β± Οƒ):  2.68% Β± 2.38%

       β–‚β–‚β–„β–„β–ˆβ–‚β–„β–…β–ƒ                                                
  β–ƒβ–„β–„β–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–†β–†β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–ƒ β–ƒ
  20.1 ms         Histogram: frequency by time         32.6 ms <

 Memory estimate: 26.31 MiB, allocs estimate: 299501.

julia> @benchmark to_file_write($vector, $"saved_vector_write.txt")
BenchmarkTools.Trial: 244 samples with 1 evaluation.
 Range (min … max):  17.986 ms … 41.734 ms  β”Š GC (min … max): 0.00% … 3.22%
 Time  (median):     19.924 ms              β”Š GC (median):    4.89%
 Time  (mean Β± Οƒ):   20.518 ms Β±  2.952 ms  β”Š GC (mean Β± Οƒ):  3.26% Β± 2.87%

   β–β–ƒβ–…β–‚β–ˆβ–†β–ƒβ–‚                                                    
  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–„β–ƒβ–ƒβ–β–ƒβ–β–ƒβ–β–‚β–β–β–β–‚β–‚β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–‚β–β–β–β–β–β–β–β–β–β–‚ β–ƒ
  18 ms           Histogram: frequency by time        37.5 ms <

 Memory estimate: 24.80 MiB, allocs estimate: 200012.

julia> @benchmark to_file_join($vector, $"saved_vector_join.txt")
BenchmarkTools.Trial: 369 samples with 1 evaluation.
 Range (min … max):  10.134 ms … 70.317 ms  β”Š GC (min … max):  0.00% … 11.10%
 Time  (median):     12.996 ms              β”Š GC (median):    10.07%
 Time  (mean Β± Οƒ):   13.548 ms Β±  4.263 ms  β”Š GC (mean Β± Οƒ):   7.31% Β±  5.83%

      β–β–β–‚β–ƒβ–β–ˆβ–†β–†β–‚β–‚                                               
  β–ƒβ–ƒβ–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–…β–…β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–ƒβ–β–‚β–β–β–β–β–β–‚β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–‚β–‚β–β–β–β–β–β–β–β–β–‚ β–ƒ
  10.1 ms         Histogram: frequency by time        26.6 ms <

 Memory estimate: 26.88 MiB, allocs estimate: 250030.
Versions

BenchmarkTools v1.5.0

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 Γ— Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, icelake-client)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Maybe it’s just the lack of buffering for small writes? Which should be fixed by wrapping the file object in a buffered stream from BufferedStreams.jl? Just set

bfile = BufferedOutputStream(file)

and pass bfile instead of file to the print or write functions. (You might also need close(bfile) at the end of the do block?)