Designing interfaces for objects constructable from a file

Hi,

I’m trying to understand how to design interfaces for types in Julia well.

More specifically, I am writing a type which is effectively a wrapper around a DataFrame.

Its purpose is to manage some configuration for an application.

Initially, I thought it should be constructable from a String, that String being the name of a file from which to read data. (Not the data itself.)

I then realized this was probably designing at the wrong level of abstraction.

A DataFrame is not constructable from a filename. (A String containing the name of a file.)

Instead, at least in the case of reading from CSV, it is constructable from a CSV File object.

julia> typeof(CSV.File("example.csv"))
CSVFile

You then use that to build a DataFrame like so.

julia> DataFrame(CSV.File("example.csv"))
DataFrame...

It would probably make a lot of sense to do something similar to what DataFrames does. So I tried to inspect what types are used in the constructor functions.

julia> methods(DataFrame())
... some stuff ... nothing obviously relevant to CSVFile ...

From this I was hoping to figure out what type is used to build a constructor function which can create a DataFrame object from a CSVFile object.

Of course, DataFrames can be constructed from other types of file encoding, not just CSV. So it should be something generic.

julia> supertype(CSV.File)
AbstractVector{Row}

The above provides a hint. A DataFrame is constructable from an AbstractVector, and a CSVFile is an AbstractVector.

But what is Row here, or how can I inspect it to find out more information?

There’s a new package called About.jl that your post reminded me of. Here’s how I used it:

using About
using CSV
using DataFrames

julia> about(CSV.File("C:\Users\mthel\Julia\src_data\test.csv"))
155-element CSV.File (<: AbstractVector{CSV.Row} <: Any), occupies 56B directly (referencing 31kB in total)
    name::String                   8B @ 0x000002841f89b498                                   "C:\\Users\\ … \\test.csv"
   names::Vector{Symbol}           8B @ 0x00000284086634a0                                   [Symbol("Cas … ntractors")]
   types::Vector{Type}             8B @ 0x0000028408169c40                                   Type[Int64,  … g1, String1]
    rows::Int64                    8B 00000000000000000000000000 … 0000000000000000010011011 155
    cols::Int64                    8B 00000000000000000000000000 … 0000000000000000000010000 16
 columns::Vector{CSV.Column}       8B @ 0x0000028408663540                                   Column[Colum … , nothing))]
  lookup::Dict{Symbol, CSV.Column} 8B Ptr?                                                   Dict{Symbol, … , nothing)))

 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
     *         *         *         8B        8B        *         *

 * = Pointer (8B)

julia> about(CSV.Row)
Concrete DataType defined in CSV, 32B
  CSV.Row <: Tables.AbstractRow <: Any

Struct with 4 fields:
• names   *Vector{Symbol}
• columns *Vector{CSV.Column}
• lookup  *Dict{Symbol, CSV.Column}
• row      Int64

 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
         *                *                *               8B

 * = Pointer (8B)

Is that helpful?

1 Like

The macros @which and @less are probably helpful. In addition, you can check out my TraceFuns.jl which will show (all) methods called when executing an expression:

julia> using DataFrames, CSV, TraceFuns

julia> @which DataFrame(CSV.File("/tmp/test.csv"))  # call on small example file
DataFrame(x; copycols)
     @ DataFrames ~/.julia/packages/DataFrames/kcA9R/src/other/tables.jl:48
# @less opens source file at that position

# See all methods from CSV and DataFrame used during that call
julia> @trace DataFrame(CSV.File("/tmp/test.csv")) DataFrames CSV
# -- very long output clipped --

Could Preferences.jl probably be useful?

1 Like

That’s really useful, thanks for sharing.

Thanks for the tip

I can see how this would be useful for some applications. The data I have here doesn’t fit well with TOML format however. It is inherently 2 dimensional, or at least tabular.

But thanks anyway, I will keep it in mind for the future.

Just wanted to pick up this thread again as I have made some progress, but not much.

I want to write a type something like this

struct ConfigStruct
    df_config::DataFrame 
    
    ConfigStruct(config::AbstractVector{Tables.AbstractRow}) = new(config)
end

and a test something like this.

# some data for test
test_data_vector::Vector{Vector{Any}} = (
    [
        ["A", "B", "C"],
        ["AA", "BA", "CA"],
        ["AB", "BB", "CB"],
        ["AC", "BC", "CC"],
    ]
)

config = (
    ConfigStruct(test_data_vector)
)

# rest of test (incomplete)

however this does not compile, producing the following error

MethodError: no method matching ConfigStruct(::Vector{Vector{Any}})
  The type `ConfigStruct` exists, but no method is defined for this combination of argument types when trying to construct it.

Closest candidates are:
  ConfigStruct(::AbstractVector{Tables.AbstractRow})

To explain why I have written this code:

  • I looked at the methods for the constructor of a DataFrame. Since this object ConfigStruct is a wrapper around a DataFrame, and I intend to initialize it using a CSV.File, then I concluded the method signiture had to be AbstractVector{Tables.AbstractRow}. I believe a CSV.File confirms to this specification, meaning that it is a subtype of AbstractVector{Row} where Row is CSV.Row which itself is a subtype of Tables.AbstractRow.
  • For the purposes of a test, I do not want to read data from a file. Instead, it is easier to write a test which has the data written into the test. I chose to store data as a Vector{Vector{Any}} which I believe should conform to the required type.

Maybe it is the presence of the Any which throws things off here? I’m not sure how to fix it.

struct ConfigStruct
    df_config::DataFrame 
end

ConfigStruct(config::Union{AbstractMatrix, AbstractVector{T}}) where T <: AbstractVector = 
    ConfigStruct(DataFrame(config, :auto))

# case for multiple dispatch:
ConfigStruct(path::AbstractString) = ConfigStruct(DataFrame(CSV.File(path))

test_data_vector::Vector{Vector{Any}} = (
    [
        ["A", "B", "C"],
        ["AA", "BA", "CA"],
        ["AB", "BB", "CB"],
        ["AC", "BC", "CC"],
    ]
)

test_matrix = [1 2; 3 4; 5 6]

sfv = ConfigStruct(test_data_vector)
sfm = ConfigStruct(test_matrix)
2 Likes
julia> Vector{Vector{Any}} <: AbstractVector{Tables.AbstractRow}
false
1 Like

I don’t know enough Julia to understand this. Is the where necessary? Why did you choose a design which uses a template parameter T?

I can read this and understand what it does, but I want to know more - why this design?

Thanks for your help so far btw…

Well, actually this works too:

ConfigStruct(config::Union{AbstractMatrix, AbstractVector{<: AbstractVector}}) = ConfigStruct(DataFrame(config, :auto))

To phrase my question another way - what thought process did you go through to come to this conclusion?

I believe you that it will work, and I will try and test on my system in a minute, but for example, why Union of an AbstractMatrix and the double vector?

Then, why AbstractVector{T} instead of Tables.AbstractRow, which as far as I can tell is closer to the type used by DataFrame? (Maybe I’m mistaken here?)

For the explanation of why the where T <: AbstractVector part is needed, see this FAQ entry and the link therein:

Frequently Asked Questions · The Julia Language?

1 Like

These kind of responses really aren’t helpful. I understand - in the literal sense - what the code written in the above examples does.

What I do not understand is the reasoning behind the design decisions. This is what I am asking about, and linking me to the FAQ is not going to provide an answer to this.

Hope my point is clear now? Apologies if I didn’t explain myself clearly enough.

This signature is not going to work for any input (as much as I understand it - I may have my problems with Julia type system), see the FAQ cited above.

Now, if you want to keep your data in a CSV file (specified by a String), then you specify a method building the desired output (ConfigStruct) from the desired input (String, or, for generality, AbstractString) - see my code above. Whatever happens in the background - you don’t need to think too much about it as long as it does the job. I’d only add the keywords for the headers and separator instead of leaving it to CSV.jl guessing.

If you want also create your ConfigStruct from any other types of data except DataFrame (for which a constructor is already created implicitly), you have to provide corresponding constructor. In your example it is Vector{Vector}, and I added the most general signature for this case. It would also cover AbstractVector{<:Tables.AbstractRow} should you have it for whatever reason, even if I can’t imagine any use for it. However the more natural way for 2D data is a 2D Array, or Matrix (which is not Vector{Vector}), so I added corresponding signature.

There are many other ways to initialize a DataFrame, e.g. Vector{NamedTuple}, thus you could also add

ConfigStruct(config) = ConfigStruct(DataFrame(config))
1 Like

FAQ deals (among other things) explicitly with the case in question.

1 Like

If this question is an FAQ then I don’t understand why it is. I don’t understand much of what you posted above either.

You posted multiple replies with solutions, but didn’t explain those solutions. You simply said “the answer is do X”. Well, ok, but why is do X the right thing to do?

I don’t believe this is an FAQ. People are not FAQ-ing how to build an interface which accepts the same kind of abstract concept as is accepted by a DataFrame constructor. That’s a highly specific question. It’s not as if I’m asking how to install a package or something, which would be an FAQ.

That’s plainly not true, and you know that.

Perhaps the missing link here is that CSV.File implements the Tables.jl interface, and so does DataFrame. The Tables.jl package defines a generic table interface, and it has the concept of sources and sinks. I don’t have time to give a detailed explanation right now, but take a look at the Tables.jl documentation to get a better idea of how you work with a generic table interface.

(Unfortunately, AbstractDataFrame does not have a well-defined interface that I am aware of, so there’s no easy way to turn ConfigStruct into an AbstractDataFrame, if that’s what you are interested in.)

1 Like
  • why? what does this mean?

None of the constructors of a DataFrame take a String so this doesn’t make any sense.

just a few sentences earlier, you said this will [Quote] never work [/Quote]

This is true, but only one of these constructor signatures is used for initializing from a CSV.File (or any other file format) so while there are other constructors, this is irrelevant.

If I wrote the ConfigStruct type, I would make a constructor that just passes all its arguments to DataFrame, like this:

struct ConfigStruct
    df::DataFrame
end

ConfigStruct(args...; kwargs...) = DataFrame(args...; kwargs...)

Then you should be able to do ConfigStruct(CSV.File(path)) (among other things).

The DataFrame constructor that gets dispatched to when you call DataFrame(CSV.File(path)) has signature DataFrame(x::Any), which is required because objects of any type (and any supertype) can implement the Tables.jl interface. (You might have missed my post above about Tables.jl, because we posted at about the same time.)

Here’s the link to that DataFrame(x::Any) constructor.