Exact numerical logging

I have a long-running calculation, for which want to emit a log that allows recovery of exact numerical values, so that I can restart or debug the calculation. The typical logged item can be thought of as a (possibly nested) NamedTuple (or Dict), with vectors and scalars, eg

(a = rand(SVector{3,Float64}, 4), μ = 9.7, κ = rand(9))

It is not a problem if logging does not preserve types (eg the SVector above). The output would ideally be human-readable, reasonably future-proof, and support UTF8.

Suggestions on how to implement this would be helpful, even if very general. I am thinking of writing to a JSON file, but if someone already implemented something like this, I would love to hear about it.

2 Likes

What’s wrong with the print output (as used in JSON or whatever)? For Float64 values, print outputs a decimal representation that, when parsed as Float64, yields exactly the same value.

Using your example:

julia> x = (a = rand(SVector{3,Float64}, 4), μ = 9.7, κ = rand(9));

julia> eval(Meta.parse(repr(x))) == x
true
9 Likes

The professional developers on my team using c#, convert the floats (or Ints or bools), to an unsigned integers, then use base64 encoding. I’m told this will guarantee what is written out will be read back in, assuming using a windows or Mac computer. We don’t use Linux. We store everything as vector, and whatever format (json, xml, TOML) we are using also store dimensions and the datatype of the original data. Below is code I use to convert a vector to that hashed format and reading it back into a vector.

We’ve had a lot of success using TOML to store thousands of regression and unit tests across our systems using this. It made it very easy for me to write Julia versions of our c# and guarantee that I get the same results as our c#.

In the TOML, we store the binary representation of the data, but also the values written out. We do that so that from inspecting the file you can see the values, but if you want the exact information the program would use the binary rep. TOML makes it very easy to convert to Dict and back. In fact, we round trip TOML to Dict to Julia Structs.

If a vector, the size is a single number. If the data is a scalar, size is not included in the input.

We prefer TOML since it understands nan, infinity, easily maps to Dicts, human readable, and a lot of libraries are available.

The TOML file for one of the inputs looks like this:

[input.processrisk]
size = [2,3]
dtype = "Float64"
data = [0.09618765866371004, 0.44433488489147255, 0.9005311838970897, 0.6818201970996056, 0.8526313332776663, 0.41132839089506557]
bdata = "UEw9IMGfuD/QK8WV+2/cP5q3+8Um0ew/THEJl3jR5T9WHn+BwUjrPwzhs1A0U9o/"

Here is the Julia code with an example for working with a vector to create what we call the binary representation:

x = rand(10)

str = writebinarydata(x)

x2 = readbinarydata(str, Float64)

x .== x2

function writebinarydata(x::AbstractVector)

    if eltype(x) == Float64
		type = UInt64
    elseif eltype(x) == Bool
    	type = UInt8
    elseif eltype(x) == Int64
    	type = UInt64
    else 
    	@warn "Unsupported type"
    end

    reinterpret.(type,x) |> base64encode
end

readbinarydata(x::String, dtype::Type) = dtype.(reinterpret(dtype, x |> base64decode))

4 Likes

Is this guaranteed? Because I remember having this problem with C++ in which the original numbers and the ones printed and parsed back differed, and I had to start using "%a" format of printf to print the numbers in the hexadecimal representation (with this one C/C++ had no loss of precision).

1 Like

Hi Tamas. I understand why it’s nice to have a human-readable file. If you can relax that requirement, the HDF file format is amazing for both preservation of exact values and for archival reliability. It underlies the JLD data format, too. HDF is exact about the binary representation of data values. It has built-in tools that will dump values to ASCII on-demand, so it’s sort of as good as ascii? And it’s supported by a non-profit since 1988, or so. It’s kind of great, when its complexity is tamed by an API layer like JLD.

5 Likes

Not sure if this counts as human readable, but Julia is capable of printing and parsing Floats in Base16 which removes the room for interpretation in how they get parsed.

2 Likes

NB: @Oscar_Smith, to understand your post better, this helped.

1 Like

Yes, it’s guaranteed by modern float-to-string algorithms like Ryu (which is employed by Julia) and Grisu (which Julia used to employ).

This “information preservation” property was codified by Steele and White (1990) and I think it’s become widely accepted.

(And the good news is that it’s implemented in pure Julia, so you don’t have to worry about what your operating system’s libc does.)

9 Likes

Very good to know. A shame this is not adopted by more languages. I had this problem about 4 years ago, so it seems like C/C++ is not set on adopting it (as there were C/C++ standard updates between 1990 and 2016 and the problem still existed at 2016).

The problem here is that C/C++ farm a lot out the the OS, and some OSes are pretty bad with some of the finicky stuff like this.

2 Likes

Nothing, that’s fine. I just need to embed it in JSON etc so that it is easier to process the file with a script if necessary.

Does TOML (and its current Julia implementation) scale to large files?

Yes, HDF is fine and I am using it elsewhere, but this data is really unstructured. The intention is to replace @info etc logging with something more structured and searchable.

1 Like

The largest file we’ve created contains about 1 million floats in total. That takes about .4 seconds to write and about .3 seconds to read on my 2019 MacBook Pro.

2 Likes

Not sure, how close it to your question, but I think there is no need in removing @info, you just need to change your sink. For example LoggingFacilities.jl provides conversion of usual logging output to JSON format. With the help of LoggingExtras.jl you can setup your logging to write to HDF/BSON/whatever without changing a line of your code (yet initial logger setup take some time of course).

3 Likes

Tamas, what a nice idea! It’s a logstash-style of program serialization. I remain a cheerleader for HDF, but they do refer to strings as “Other Non-numeric Datatypes” in their user manual, so maybe there’s a little neglect in that area.

One optition is using TensorBoardLogger.jl.
It logs to a ProtoBuf file, that can be displayed by the TensorBoard program,
but also can be read by a number of libraries (including TensorBoardLogger.jl itself)

In general though i think we should make a bunch of what LoggingExtras.jl calls Logging Sinks.
A JSON one would be good (lots of tools out there speak JSON for log files, e.g. AWS’s cloudwatch).
I am not sure if one exists (other than the LoggingFacilities.JSONTransformerLogger which is an unusual take on it, in that it isn’t a sink, it is a transformer, i see some advantages to that though).

Perhaps more generically would be good to make a Table logging sink.
Which is setup to be able to generically write to any Tables.jl sink (e.g. a CSV, LibPQ, Arrow)

1 Like

There is also the option of (Google’s) protocol buffers, which have a Julia implementation (ProtoBuf.jl). I don’t have any experience with them myself, but they are a binary-only JSON competitor and might be worth looking into for scaling to large sizes.

1 Like

After experimenting with many options, I settled on HDF5 after all, because it is the most robust and hassle-free. In particular, JSON & similar flattens all arrays to vectors, and ProtoBuf and TensorBoard are a bit heavyweight for my purposes.

I wrapped up everything in a mini-package

which is currently being registered. Particular attention is paid to thread safety, not keeping files open when not needed, and reading back logs. Comments, PRs, etc are of course welcome.

Example

julia> using HDF5Logging, Logging

julia> logger = hdf5_logger(tempname())
Logging into HDF5 file /tmp/jl_IbbUvj, 0 messages in “log”

julia> # write log

julia> with_logger(logger) do
       @info "very informative" a = 1
       end

julia> # read log

julia> logger[1]
(level = Info, message = "very informative", _module = "Main", group = "REPL[46]", id = "Main_7a40b9cc", file = "REPL[46]", line = 2, data = ["a" => 1])
11 Likes

Tamas, this code uses metadata keys to store information attached to groups in the HDF5 file. This kind of key isn’t kept in block storage, so I’m curious whether it behaves well for even small numbers of logging messages. It’s like that Reeses commercial, “You put your data in my metadata!” How large did you test?

HDF5 has two main ways to store logging-type information: either a Packet Table or a Dataset with a Compound Datatype. You could also use an Opaque datatype (h5ex_t_opaque) to kind of stuff whatever you want in there.

The files also aren’t set up for an append operation. HDF5 stores data in blocks that are indexed by B+ trees and kept in an LRU cache. Packets and logging tend to write in chunks of messages, as a result, in order to reduce churn of blocks and to avoid corruption. Some of the other file formats mentioned are stars at appending and can recover from failed or partial writes.

Yes, I am aware of the fact that this is a compromise. It works fine for me though as I don’t have a lot of log messages, just a few with large objects.

I would love to use JSON and TOML, but my current obstacle is serializing arrays without losing the dimension information. This basically means that I would have to invent my own format within either to keep track of this. Cf

HDF5, even if slightly abused, seems like a reasonable workaround that allows me to proceed with the actual computation without getting lost in a rabbit hole.

In the long run, it would be great to have a protocol that just serializes everything to

  1. integers and floats,
  2. vectors of the above,
  3. dictionaries of dictionaries and/or the above.

This could then be used with JSON, TOML, BSON, etc.

3 Likes

If this is the use case, I would just use the Serialization standard library.

2 Likes