The Parse, donβt validate blog post came by on Hacker News again. As far as I understand, this is advertising to use type checking instead of checking the object each time. Are there valuable lessons to be learned in this post in the Julia context? For example, would it be a good idea to use a NonEmpty
type?
Parse, donβt validate is a great idea, and Julia code will benefit from doing it. In a typed dynamic language, there is a question of how to implement the idea. E.g.:
- Double down on types, use things like SumTypes.jl,
- Double down on dynamism, use things like Clojureβs Spec,
- Use traits.
I think thereβs already a lot that you can do in the spirit of βparse, donβt validateβ using Juliaβs type system. Of course what you end up with when there are bugs is MethodError
s instead of static compilation errors.
One thing that I sometimes forget is that if Iβm writing code that processes data that Iβve created myself inside my program, then I donβt need to validate it, since Iβm the one who created it and I know what shape itβs in.
So I think the advice in βparse, donβt validateβ is focused mostly on processing input data. But some of the advice applies generally, like this one:
Use a data structure that makes illegal states unrepresentable.
Sometimes you see a function like this:
function foo(; flag1, flag2)
if flag1 && flag2
1
elseif flag1 && !flag2
2
elseif !flag1 && flag2
3
else
error("flag1 and flag2 cannot both be false")
end
end
But you could just make the 4th state illegal by using the type system:
# Hopefully there are more natural names that you can use
# in your real application.
struct Flag1Flag2 end
struct Flag1NotFlag2 end
struct NotFlag1Flag2 end
foo(::Flag1Flag2) = 1
foo(::Flag1NotFlag2) = 2
foo(::NotFlag1Flag2) = 3
Another example that I think is related. Itβs nice to avoid struct fields with Union{Nothing, T}
types, if possible. So instead of this,
abstract type AbstractPerson end
struct Person <: AbstractPerson
name::Union{Nothing, String}
age::Int
end
you could do this:
abstract type AbstractPerson end
struct Person <: AbstractPerson
name::String
age::Int
end
struct Anonymous <: AbstractPerson
age::Int
end
Parse, donβt validateβ¦ with type checking by JET.jl:
File
# parse_dont_validate.jl
struct NonEmpty{T}
head::T
tail::Vector{T}
end
head(x::NonEmpty) = x.head
function foo()
x = rand(3)
head(x)
end
JET
Scroll to the bottom to see the type error:
julia> report_file("parse_dont_validate.jl"; analyze_from_definitions=true)
[toplevel-info] virtualized the context of Main (took 0.001 sec)
[toplevel-info] entered into parse_dont_validate.jl
[toplevel-info] exited from parse_dont_validate.jl (took 0.004 sec)
[toplevel-info] analyzing from top-level definitions ... 4/4
βββββ 3 possible errors found βββββ
β @ parse_dont_validate.jl:11 rand(3)
ββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:277 Random.rand(Random.Float64, Random.Dims(dims))
βββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:289 Random.default_rng()
ββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/RNGs.jl:370 Random.default_rng(Base.getproperty(Random.Threads, :threadid)())
βββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/RNGs.jl:376 Random.MersenneTwister()
ββββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/RNGs.jl:147 #self#(Random.nothing)
βββββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/RNGs.jl:147 Random.seed!(Random.MersenneTwister(Core.apply_type(Random.Vector, Random.UInt32)(), Random.DSFMT_state()), seed)
ββββββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/Random.jl:426 Random.seed!(rng)
βββββββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/RNGs.jl:362 Random.make_seed()
ββββββββββ @ /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Random/src/RNGs.jl:326 Random.read(Random.pipeline(Base.cmd_gen(Core.tuple(Core.tuple("ifconfig"))), Base.cmd_gen(Core.tuple(Core.tuple("sha1sum")))), Random.String)
βββββββββββ @ process.jl:421 Base.read(cmd)
ββββββββββββ @ process.jl:410 Base.open(cmd, "r", Base.devnull)
βββββββββββββ @ process.jl:339 Core.kwfunc(Base.open)(Core.apply_type(Core.NamedTuple, (:read, :write))(Core.tuple(true, true)), Base.open, cmds, stdio)
ββββββββββββββ @ process.jl:361 Base.#open#646(write, read, _3, cmds, stdio)
βββββββββββββββ @ process.jl:365 Base._spawn(cmds, Base.getindex(Base.Any, in, out, Base.stderr))
ββββββββββββββββ @ process.jl:119 Base.setup_stdios(#639, stdios)
βββββββββββββββββ @ process.jl:196 f(open_io)
ββββββββββββββββββ @ process.jl:120 Base._spawn(Core.getfield(#self#, :cmds), stdios, Base.ProcessChain())
βββββββββββββββββββ @ process.jl:151 Base._spawn(Base.getproperty(cmds, :b), stdios_right, chain)
ββββββββββββββββββββ @ process.jl:181 Base._spawn_primitive(Base.getindex(Base.getproperty(cmd, :exec), 1), cmd, stdios)
βββββββββββββββββββββ @ process.jl:99 Base.repr(cmd)
ββββββββββββββββββββββ @ strings/io.jl:219 Base.#repr#386(Base.nothing, #self#, x)
βββββββββββββββββββββββ @ strings/io.jl:219 Core.kwfunc(Base.sprint)(Core.apply_type(Core.NamedTuple, (:context,))(Core.tuple(context)), Base.sprint, Base.show, x)
ββββββββββββββββββββββββ @ strings/io.jl:101 Base.#sprint#385(Core.tuple(context, sizehint, _3, f), args...)
βββββββββββββββββββββββββ @ strings/io.jl:105 f(Core.tuple(s), args...)
ββββββββββββββββββββββββββ @ cmd.jl:116 Base.map(#620, Base.getproperty(cmd, :exec))
βββββββββββββββββββββββββββ @ abstractarray.jl:2294 Base.collect_similar(A, Base.Generator(f, A))
ββββββββββββββββββββββββββββ @ array.jl:606 Base._collect(cont, itr, Base.IteratorEltype(itr), Base.IteratorSize(itr))
βββββββββββββββββββββββββββββ @ array.jl:691 Base.iterate(itr)
ββββββββββββββββββββββββββββββ @ generator.jl:47 Base.getproperty(g, :f)(Base.getindex(y, 1))
βββββββββββββββββββββββββββββββ @ cmd.jl:117 Core.kwfunc(Base.sprint)(Core.apply_type(Core.NamedTuple, (:context,))(Core.tuple(Core.getfield(#self#, :io))), Base.sprint, #621)
ββββββββββββββββββββββββββββββββ @ strings/io.jl:101 Base.#sprint#385(Core.tuple(context, sizehint, _3, f), args...)
βββββββββββββββββββββββββββββββββ @ strings/io.jl:103 f(Core.tuple(Base.IOContext(s, context)), args...)
ββββββββββββββββββββββββββββββββββ @ cmd.jl:118 Base.with_output_color(#622, :underline, io)
βββββββββββββββββββββββββββββββββββ @ util.jl:71 Base.#with_output_color#814(Core.tuple(false, #self#, f, color, io), args...)
ββββββββββββββββββββββββββββββββββββ @ util.jl:85 Base.split(str, '\n')
βββββββββββββββββββββββββββββββββββββ @ strings/util.jl:411 Base.#split#375(0, true, #self#, str, splitter)
ββββββββββββββββββββββββββββββββββββββ @ strings/util.jl:411 Base._split(str, Base.isequal(splitter), limit, keepempty, _7)
βββββββββββββββββββββββββββββββββββββββ @ strings/util.jl:421 Base.first(r)
ββββββββββββββββββββββββββββββββββββββββ @ abstractarray.jl:386 Base.iterate(itr)
ββββββββββββββββββββββββββββββββββββββββ no matching method found for call signature (Tuple{typeof(iterate), Nothing}): Base.iterate(itr::Nothing)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ @ strings/util.jl:421 Base.last(r)
ββββββββββββββββββββββββββββββββββββββββ @ abstractarray.jl:437 Base.lastindex(a)
ββββββββββββββββββββββββββββββββββββββββ no matching method found for call signature (Tuple{typeof(lastindex), Nothing}): Base.lastindex(a::Nothing)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β @ parse_dont_validate.jl:12 head(x)
β no matching method found for call signature (Tuple{typeof(head), Vector{Float64}}): head(x::Vector{Float64})
ββββββββββββββββββββββββββββββ
(included_files = Set(["parse_dont_validate.jl"]), any_reported = true)
Thank you both for your thoughts and Cameron for your great examples!
Iβve been thinking about it some more and donβt see much benefit (but I am open to be convinced otherwise). In essence, the point of parse donβt validate, as I understand, is to get feedback more quickly. In a perfect situation, syntax highlighting would give an error like in your last example, which is much quicker than compiling, say, Python and seeing the output. However, given that Julia has a quick evaluation going on with Revise, I doubt that the efforts put in properly using types is worth it. But, I might, of course, be completely wrong
That is only part of the benefit.
- dispatch on very specific properties of objects, good for performance: βAlgorithm efficiency comes from problem informationβ
- no redundant checks, because problem information is embedded in the type: If I parse an input into a NonEmptyVector{Int64}, then I donβt have to keep checking the empty case whenever I do something with it. I can just dispatch on the type. Good for performance.
- no redundant checks (less code) good for readability
- very specific types in method signatures, good for readability
- if I accidentally change or delete a necessary prerequisite check, the program will error loudly instead of silently assuming that I already checked a property and returning incorrect results
Another example: strings. Package StrBase.jl
contains validated string types. That is handy because you donβt need to manually handle invalid byte codes, which also makes it faster in most cases.
Ok, Iβm getting more convinced. So, if you would do this for DataFrames, one would do the following? Define types such as NonMissing{DataFrame}
, Sorted{DataFrame}
and overload all kinds of methods to handle these new types such as
vcat(a::NonMissing{DataFrame}, b::DataFrame) = vcat(DataFrame(a), b)
vcat(a::NonMissing{DataFrame}, b::NonMissing{DataFrame}) = NonMissing{DataFrame}(vcat(a, b))
filter(f, df::NonMissing{DataFrame}) = NonMissing{DataFrame}(filter(f, df::DataFrame))
[...]
That seems like a feasible strategy. That said, it looks like AbstractDataFrame has a pretty large API thatβs specific to that type. It might be easier to implement this on the smaller Tables.jl API with SplitApplyCombine.jl, and maybe using one of the other table types.
Problem-specific types are often useful in this regard: e.g., not a Table
, but a MeasurementList
, or Users
. Convert from a plain Table
immediately after reading from a file, potentially throwing errors in the process, and only use the βparsedβ value afterwards.