Type-stable reading from a file

I was analysing some code today and I am unable to read information from a simple file while keeping type stability. I would like to pass the results to functions that have to be very optimized, so I believe this is an important issue…

At the beginning, I had just a CSV file like this

a,b
test1,1.
test2,2.
test3,3.
test4,4.

that I was reading naively using CSV.jl and DataFrames.jl

function f()
    fracfile = DataFrame(CSV.File("test.csv"))
    x = fracfile[1,:b]
    return x
end

but runing @code_warntype f() says that x is of type Any. A fast search lead me to discover that DataFrames.jl are not type stable. However, the blog there suggests that one can use Tables.columntable in order to generate type-stable data. I tried with

function f()
    fracfile = DataFrame(CSV.File("test.csv"))
    stable = Tables.columntable(fracfile)[:b]
    x = stable[1]
    return x
end

but again no luck. Both stable and x are Any. I have further tried to cast variables to Vector{Float64}, but it does not change anything. I also tried to add particular types to the CSV.File function using the argument types = Dict(1=> String, 2=>Float64)). Finally, I tried to use TypedTables.jl instead of DataFrames.jl as apparently these should offer type stability. But again to no effect.

The only way I found to get type-stable code is hacky and probably not very correct…

function f()
    fracfile = DataFrame(CSV.File("test.csv"))
    #x::Float64 = parse(Float64, fracfile[1,2]) #all blue in codewarn! 
    x::Vector{Float64} = Vector{Float64}(fracfile[:,2]) #still one temp variable is red in codewarn, but x is of the intended type.
    return x
end

I understand that the compiler cannot know the contents of the file beforehand, so type inference is complicated. But I cannot really get why if I use TypedTables.jl or if I explicitly try to cast variables the compiler can still do nothing. Is there any simple solution for this? Thank you very much in advance!

There’s fundamentally no way to make such a reading function type-stable. After all, its type depends on the CSV file content.

Type-stable tables can be beneficial for further manipulation, even if the reading part has to be unstable. If you end up going this route, you don’t need dataframes at all: columntable(CSV.File("test.csv")) or CSV.read("test.csv", columntable) should work. Same with eg StructArray.

Then you end up with a table/container that other functions can operate with in a type-stable way.

2 Likes

Do you have a benchmark that shows that this type instability really costs a lot of performance?

I am asking because the usual use case Id imagine is to get some values from the file (which may be type unstable), and then feed that to your function once which runs a bit. Here you only pay the cost of runtime dispatch once at the beginning and it should be negligible.
If your workload really involves reading from a file many times in performance critical loops then it makes sense to remove the type instability. I personally see no real issue though with your “hacky” solution: essentially you are assuring that your data is of a certain format (throwing an error if its not) and then passing this on without any instability. Any other solution that I could think of would in essence to the same ( it should be possible to get rid of the warnings in code_warntype by assuming beforehand that your fill fulfills all criteria but i am not really sure it is actually worth to sacrifice the elegance of using an external package which can handle more types)

1 Like

There’s a related discussion here, in which I had to implement a fast parser for some file: Performance: read data from ascii file, replace `split`

1 Like

@aplavin Thanks! I think the suggestion is very similar to second snippet of code I posted. If I do

function f()
    fracfile = Tables.columntable(CSV.File("test.csv"))
    x = fracfile[:b]
    return x
end

which I think it’s what you suggest, nothing changes: x is not type stable. Or were you suggesting something else?

@Salmon I do not have a benchmark, mainly because I was not able to get rid of the instabilities to test how problematic they are. I understand the point though.

The code I have is basically a function that reads the data and then calls several other functions that do the heavy computations. Between the calls to the heavy computations there are some operations with the data. What I was afraid is that having many objects of type ::Any I would have problems, (a) in those operations between the calls, which are done all with Any types, and (b) maybe in generating properly optimized code inside the functions.

Thanks!!

Perhaps I’m misunderstanding what you are trying to achieve here, but most of the time, the result of reading a file being type unstable does not matter. It’s a one-time cost, and as long as the thing you get of it (in your case, x) has a concrete type, subsequent operations on this object will be type stable:

using DataFrames
using CSV
function f()
    fracfile = DataFrame(CSV.File("test.csv"))
    x = fracfile[:,:b]
    return x
end

x = f()

@code_warntype sum(x)

outputs:

MethodInstance for sum(::Vector{Float64})
  from sum(a::AbstractArray; dims, kw...) @ Base reducedim.jl:982
Arguments
  #self#::Core.Const(sum)
  a::Vector{Float64}
Body::Float64
1 ─      nothing
│   %2 = Base.:(var"#sum#933")::Core.Const(Base.var"#sum#933")
│   %3 = Base.:(:)::Core.Const(Colon())
│   %4 = Core.NamedTuple()::Core.Const(NamedTuple())
│   %5 = Base.pairs(%4)::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())
│   %6 = (%2)(%3, %5, #self#, a)::Float64
└──      return %6

Edit: This is essentially the concept of the “function barrier”: Performance Tips · The Julia Language: kernel-functions

6 Likes

Hi,

If you really want the file to be type stable then you could convert it to Arrow then read from that in place of the CSV after the fact. You’ll also benefit from a reduction in memory usage and out of core processing if your real data is bigger than your ram.

2 Likes

As people have said, you can do things after that rescue type inference. However, any initial step along the lines of fracfile = some_processing_here("test.csv") is inherently type unstable because the input is only String and the output data has to be just about any type for some_processing_here to figure out at runtime. Type stability at that point is only possible if we fix the possible output type like Base.parse.

1 Like

Thank you all for your answers! After @JonasWickman’s answer and reading carefully the concept of function barrier again I finally understand the full “problem”.

In my case the most hardcore computations are inside functions, something like

function f()
    fracfile = DataFrame(CSV.File("test.csv"))
    x = fracfile[: ,:b]
    result = simulation(x)
    return result
end

However, there are some “small operations” here and there between calls to the simulation function. Then the compiler will not know the type for these operations. My best bet is to put all these bits inside of the functions, and that’s the best I can do.

My confusion came because I was getting Any all the time so I thought Julia would not know the types even when calling the functions. The Performance Tips article says very explicitly that actually the compiler can specialize the function and this is the recommended way. It even mentions the case of reading from a file…

So it was a basic misunderstanding after all. I have read several times already the performance tips page and this keeps happening all the time! :sweat_smile:
Thank everybody for proposing solutions, very much appreciated :slight_smile:

the key here is guaranteeing that x has a stable type when passed to simulation. If x is a Vector{Any} inside the function f() that will be still problematic. But if it is, say, Vector{Float64}, the function simulation will be fast, even if inside f() you had to deal with a instability before passing x to the simulation function. (simulation will be a “function barrier”).

Yes @lmiq , that was the main motivation under my question. Notice that in the example I wrote before x is not even Vector{Any} but just Any, as checked with @code_warntype.

However, as far as I understand from previous answers and the Performance Tips guide, my last snippet should be good and the function will be able to get the type even if x is Any.

If that’s not the case, then my question remains: how do I get x to be type stable? And what is the difference from my last snippet and what is it stated in the performance tips?

If x in your snippet turns out to be Any that will propagate into simulation. For example, here f1 returns Any (see Body::Any), because the input of g is of type Any:

julia> x::Any = 1.0  # ::Any to be explicit that it can assume any value, it is the default
1.0

julia> g(x) = 2 * x
g (generic function with 1 method)

julia> function f1()
           y = x
           return g(y)
       end
f1 (generic function with 1 method)

julia> @code_warntype f1()
MethodInstance for f1()
  from f1() @ Main REPL[17]:1
Arguments
  #self#::Core.Const(Main.f1)
Locals
  y::Any
Body::Any
1 ─      (y = Main.x)
│   %2 = y::Any
│   %3 = Main.g(%2)::Any
└──      return %3

We can make the call to g type stable by asserting that x can be converted to a Float64, with an error being thrown otherwise (now we have Body::Float64):

julia> function f2()
           y::Float64 = x
           return g(y)
       end
f2 (generic function with 1 method)

julia> @code_warntype f2()
MethodInstance for f2()
  from f2() @ Main REPL[8]:1
Arguments
  #self#::Core.Const(Main.f2)
Locals
  y::Float64
  @_3::Any
Body::Float64
1 ─       Core.NewvarNode(:(y))
│   %2  = Main.x::Any
│         (@_3 = %2)
│   %4  = @_3::Any
│   %5  = (%4 isa Main.Float64)::Bool
└──       goto #3 if not %5
2 ─       goto #4
3 ─ %8  = @_3::Any
│   %9  = Base.convert(Main.Float64, %8)::Any
│   %10 = Main.Float64::Core.Const(Float64)
└──       (@_3 = Core.typeassert(%9, %10))
4 ┄ %12 = @_3::Float64
│         (y = %12)
│   %14 = y::Float64
│   %15 = Main.g(%14)::Float64
└──       return %15

julia> x = "a"
"a"

julia> f2()
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64

Thus, in your case, assuming that fracfile[:,:b] is being returned as a Vector{Any} but it can and should be converted to a Vector{Float64}, you could do:

julia> x = Any[1.0, 2.0]
2-element Vector{Any}:
 1.0
 2.0

julia> function f2()
           y::Vector{Float64} = x
           return g(y)
       end
f2 (generic function with 2 methods)

julia> @code_warntype f2()
MethodInstance for f2()
  from f2() @ Main REPL[26]:1
Arguments
  #self#::Core.Const(Main.f2)
Locals
  y::Vector{Float64}
  @_3::Any
Body::Vector{Float64}
1 ─       Core.NewvarNode(:(y))
│   %2  = Main.x::Any
│   %3  = Core.apply_type(Main.Vector, Main.Float64)::Core.Const(Vector{Float64})
│         (@_3 = %2)
│   %5  = @_3::Any
│   %6  = (%5 isa %3)::Bool
└──       goto #3 if not %6
2 ─       goto #4
3 ─ %9  = @_3::Any
│   %10 = Base.convert(%3, %9)::Vector{Float64}
└──       (@_3 = Core.typeassert(%10, %3))
4 ┄ %12 = @_3::Vector{Float64}
│         (y = %12)
│   %14 = y::Vector{Float64}
│   %15 = Main.g(%14)::Vector{Float64}
└──       return %15

such that the instability of x does not propagate into the call to g.

That said, it is likely that reading your data already returns columns with proper types and, then, your code is fine:

julia> fracfile = DataFrame(CSV.File("test.csv"))
2×2 DataFrame
 Row │ a       b      
     │ Int64  Float64 
─────┼────────────────
   1 │     1      2.0
   2 │     2      2.0

julia> typeof(fracfile[:, " b"])
Vector{Float64} (alias for Array{Float64, 1})

(which is what @JonasWickman explained in his answer)

Took some days to have time to check this properly. Thanks for the explanation @lmiq . This seems to be more consistent with the results I was having.

I think that for the case being I will just need to enforce the types by adding ::Vector{Float64}. Reading data from the file does not seem to return the types immediately. code_warntype always warns me about the data being Any after a read, which makes sense: the compiler cannot know what’s in the file.

1 Like

CSV.jl has a keyword argument that lets you specify the column types when reading: