Many of us have been grappling with the type stability of IO for years. As far as I know, the best attempt at handling this in a generic way is DataStreams.jl, though, this is implemented specifically for tabular data. In general I would say the best practice is to make sure the result of your IO gets passed into another function quickly, rather than wallowing in a function that doesn’t know what it is. Here’s a rather silly but illustrative example of the pattern that’s needed:
using Serialization
metadata_buff = IOBuffer()
serialize(metadata_buff, Float32)
serialize(metadata_buff, Float64)
seekstart(metadata_buff)
data_buff = IOBuffer()
write(data_buff, Float32(2.0))
write(data_buff, 3.0)
seekstart(data_buff)
g(x::Float32) = x^2
g(x::Float64) = x^3
function f(mdata::IO, data::IO)
dtype = deserialize(mdata)
x = read(data, dtype)
println(x^2)
g(x)
end
julia> f(metadata_buff, data_buff)
4.0f0
julia> f(metadata_buff, data_buff)
27.0
See the following:
julia> @code_warntype f(metadata_buff, data_buff)
Body::Union{Float32, Float64}
17 1 ── %1 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Any,1}, svec(Any, Int64), :(:ccall), 2, Array{Any,1}, 32, 32))::Array{Any,1} │╻╷╷╷╷╷ deserialize
│ %2 = %new(IdDict{Any,Any}, %1, 0, 0)::IdDict{Any,Any} ││┃│││ Type
│ %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Int64,1}, svec(Any, Int64), :(:ccall), 2, Array{Int64,1}, 0, 0))::Array{Int64,1} │││╻╷ Type
│ %4 = invoke Dict{UInt64,Any}()::Dict{UInt64,Any} ││││
│ %5 = %new(Serializer{Base.GenericIOBuffer{Array{UInt8,1}}}, mdata, 0, %2, %3, %4)::Serializer{Base.GenericIOBuffer{Array{UInt8,1}}} ││││
│ %6 = (Base.getfield)(%5, :io)::Base.GenericIOBuffer{Array{UInt8,1}} │││╻ getproperty
│ %7 = (Base.getfield)(%6, :readable)::Bool ││││╻ getproperty
└─── goto #5 if not %7 ││││
2 ── %9 = (Base.getfield)(%6, :ptr)::Int64 ││││╻ getproperty
│ %10 = (Base.getfield)(%6, :size)::Int64 ││││╻ getproperty
│ %11 = (Base.slt_int)(%10, %9)::Bool ││││╻╷ >
└─── goto #4 if not %11 ││││
3 ── (Base.throw)($(QuoteNode(EOFError()))) ││││
└─── $(Expr(:unreachable)) ││││
4 ── %15 = (Base.getfield)(%6, :data)::Array{UInt8,1} ││││╻ getproperty
│ %16 = (Base.arrayref)(false, %15, %9)::UInt8 ││││╻ getindex
│ %17 = (Base.add_int)(%9, 1)::Int64 ││││╻ +
│ (Base.setfield!)(%6, :ptr, %17) ││││╻ setproperty!
└─── goto #6 │││╻ read
5 ── %20 = %new(Core.ArgumentError, "read failed, IOBuffer is not readable")::ArgumentError ││││╻ Type
│ (Base.throw)(%20) ││││
└─── $(Expr(:unreachable)) ││││
6 ┄─ %23 = (Core.zext_int)(Core.Int32, %16)::Int32 ││││╻ toInt32
│ %24 = invoke Serialization.handle_deserialize(%5::Serializer{Base.GenericIOBuffer{Array{UInt8,1}}}, %23::Int32)::Any │││
└─── goto #7 │││
7 ── goto #8 ││
18 8 ── %27 = (Main.read)(data, %24)::Any │
19 │ %28 = Base.literal_pow::Core.Compiler.Const(Base.literal_pow, false) │
│ %29 = Main.:^::Core.Compiler.Const(^, false) │
│ %30 = (isa)(%27, Irrational{:ℯ})::Bool │
└─── goto #10 if not %30 │
9 ── %32 = π (%27, Irrational{:ℯ}) │
│ %33 = invoke %28(%29::typeof(^), %32::Irrational{:ℯ}, $(QuoteNode(Val{2}()))::Val{2})::Any │
└─── goto #11 │
10 ─ %35 = (Base.literal_pow)(Main.:^, %27, $(QuoteNode(Val{2}())))::Any │
└─── goto #11 │
11 ┄ %37 = φ (#9 => %33, #10 => %35)::Any │
│ (Main.println)(%37) │
20 │ %39 = (isa)(%27, Float64)::Bool │
└─── goto #13 if not %39 │
12 ─ %41 = π (%27, Float64) │
│ %42 = (Base.mul_float)(%41, %41)::Float64 │╻╷╷ g
│ %43 = (Base.mul_float)(%42, %41)::Float64 ││┃││ literal_pow
└─── goto #16 │
13 ─ %45 = (isa)(%27, Float32)::Bool │
└─── goto #15 if not %45 │
14 ─ %47 = π (%27, Float32) │
│ %48 = (Base.mul_float)(%47, %47)::Float32 ││╻╷ literal_pow
└─── goto #16 │
15 ─ %50 = (Main.g)(%27)::Union{Float32, Float64} │
└─── goto #16 │
16 ┄ %52 = φ (#12 => %43, #14 => %48, #15 => %50)::Union{Float32, Float64} │
└─── return %52
This is a little harder to parse without the nice highlighting of the macro, so I invite you to read it in your terminal. The gist here is that the compiler has absolutely no idea what type x
is when you call println(x^2)
but it does know what the output of g
is either a Float32
or a Float64
. (There is a real difference here! In the former case the compiler has no idea what ^
it’ll be calling when it compiles f
while in the latter case it knows it’ll be on a float when it compiles g
!) The rule of thumb is that you should always have dedicated parsing of metadata and then use that to pass to functions where types can be known. In this example you would not want to have lots of stuff happening within f
, you want it all in g
.
So getting back to your original question (which my not have been primarily about type stability?) essentially yes: all Julia code is multiple dispatch so you are of course dispatching on the contents of your files once you start doing anything with it but the more you resolve ambiguities with metadata the happier you’ll be.
In my usual workflow (which has only recently started finally solidifying to the point where I really feel like I know what I’m doing) I dispatch on types which I use to “tag” different pieces of data, for example tables, for example
abstract type TableTag end
struct TableA <: TableTag end
struct TableB <: TableTag end
(in practice these structs usually hold some sort of metadata as well). My functions which accept dataframes actually have signatures like
f(tag::TableA, df::AbstractDataFrame) = # do stuff to table A
f(tag::TableB, df::AbstractDataFrame) = # do stuff to table B
At some point I will have a standard way of serializing the tags so that I can store the metadata of my complete dataset. For the time being what I do is that the full path and file name of each table depends on its tag, so I have something like loadraw(tag::TableA)
. At some point I also hope to simply have the dataframes wrapped in the structs rather than having separate tags, but for the time being I lack an appropriate AbstractTable
type to inherit from.
Anyway, I feel like this turned into a long ramble that may not have much to do with your original question. Having IO appropriate for my workflow is something I’ve been working towards for a long time, and has touched many projects such as my re-write of Feather.jl. I’m planning on finally creating a generic package that can serve as a template for my workflow soon (the goal is to cleanly separate all of the data cleaning nonsense from the underlying abstract mathematical problem) so if you’re interested stay tuned.