Performance of creating a demo dataset

I have the following code:

using InMemoryDatasets

function demo_data()
    ds = Dataset(time=0.0, d1=10, d2=20, d3=30)
    time = 0.1
    for i in 1:100000
        if i == 5
            d1 = missing
        else
            d1 = 10+rand(1:30)
        end
        ds2 = Dataset(time=time, d1=d1, d2=20+i, d3=30+rand(1:30))
        append!(ds, ds2)
        time += 0.1
    end
    ds
end

demo_data()
@time ds = demo_data()

The second call needs 4.8s on my machine. For realistic tests I need 1e6 rows in the Dataset. How can I improve the speed and reduce the memory usage?

Typically the first step is to preallocate the needed space instead if incrementally append!-ing. Like:

n=100000
ds = Dataset(time=zeros(Float64,n), d1=zeros(Int,n), d2=zeros(Int,n), d3=zeros(Int,n))

Thanks! And how can I create a vector of random numbers in the range of 1…30?

Do you mean like

rand(1:30,n)

?
Yes, this is faster than doing single rand’s in the loop, I guess.

ds = Dataset(
    time= collect(range(start=0.0,step=0.1,length=n)), 
    d1=rand(11:40,n), 
    d2=collect(21:n+20), 
    d3=rand(31:60,n)
)

Now you need only inject the missings.

using InMemoryDatasets

function demo_data()
    n = 100000
    time = 0.1:0.1:n*0.1
    d1 = rand(Int8(11):Int8(40), n)
    d2 = 21:n + 20
    d3 = rand(Int8(31):Int8(60), n)
    ds = Dataset(time=time, d1=d1, d2=d2, d3=d3)
    ds.d1[5] = missing
    ds
end

demo_data()
@time ds=demo_data()

This looks quite nice… Only 2 to 8 ms instead of 5s! :slight_smile:

So a for loop is not always a good idea…

The problem here is not the loop, but the multiple calls to append! instead of preallocating everything.

1 Like

I do not think so. I think creating a new Dataset on each iteration was the bottleneck.