Exact numerical logging

Tamas_Papp · April 7, 2021, 3:25pm

I have a long-running calculation, for which want to emit a log that allows recovery of exact numerical values, so that I can restart or debug the calculation. The typical logged item can be thought of as a (possibly nested) NamedTuple (or Dict), with vectors and scalars, eg

(a = rand(SVector{3,Float64}, 4), μ = 9.7, κ = rand(9))

It is not a problem if logging does not preserve types (eg the SVector above). The output would ideally be human-readable, reasonably future-proof, and support UTF8.

Suggestions on how to implement this would be helpful, even if very general. I am thinking of writing to a JSON file, but if someone already implemented something like this, I would love to hear about it.

stevengj · April 7, 2021, 3:59pm

What’s wrong with the print output (as used in JSON or whatever)? For Float64 values, print outputs a decimal representation that, when parsed as Float64, yields exactly the same value.

Using your example:

julia> x = (a = rand(SVector{3,Float64}, 4), μ = 9.7, κ = rand(9));

julia> eval(Meta.parse(repr(x))) == x
true

stephenll · April 7, 2021, 4:48pm

The professional developers on my team using c#, convert the floats (or Ints or bools), to an unsigned integers, then use base64 encoding. I’m told this will guarantee what is written out will be read back in, assuming using a windows or Mac computer. We don’t use Linux. We store everything as vector, and whatever format (json, xml, TOML) we are using also store dimensions and the datatype of the original data. Below is code I use to convert a vector to that hashed format and reading it back into a vector.

We’ve had a lot of success using TOML to store thousands of regression and unit tests across our systems using this. It made it very easy for me to write Julia versions of our c# and guarantee that I get the same results as our c#.

In the TOML, we store the binary representation of the data, but also the values written out. We do that so that from inspecting the file you can see the values, but if you want the exact information the program would use the binary rep. TOML makes it very easy to convert to Dict and back. In fact, we round trip TOML to Dict to Julia Structs.

If a vector, the size is a single number. If the data is a scalar, size is not included in the input.

We prefer TOML since it understands nan, infinity, easily maps to Dicts, human readable, and a lot of libraries are available.

The TOML file for one of the inputs looks like this:

[input.processrisk]
size = [2,3]
dtype = "Float64"
data = [0.09618765866371004, 0.44433488489147255, 0.9005311838970897, 0.6818201970996056, 0.8526313332776663, 0.41132839089506557]
bdata = "UEw9IMGfuD/QK8WV+2/cP5q3+8Um0ew/THEJl3jR5T9WHn+BwUjrPwzhs1A0U9o/"

Here is the Julia code with an example for working with a vector to create what we call the binary representation:

x = rand(10)

str = writebinarydata(x)

x2 = readbinarydata(str, Float64)

x .== x2

function writebinarydata(x::AbstractVector)

    if eltype(x) == Float64
		type = UInt64
    elseif eltype(x) == Bool
    	type = UInt8
    elseif eltype(x) == Int64
    	type = UInt64
    else 
    	@warn "Unsupported type"
    end

    reinterpret.(type,x) |> base64encode
end

readbinarydata(x::String, dtype::Type) = dtype.(reinterpret(dtype, x |> base64decode))

Henrique_Becker · April 7, 2021, 5:47pm

Is this guaranteed? Because I remember having this problem with C++ in which the original numbers and the ones printed and parsed back differed, and I had to start using "%a" format of printf to print the numbers in the hexadecimal representation (with this one C/C++ had no loss of precision).

adolgert · April 7, 2021, 5:58pm

Hi Tamas. I understand why it’s nice to have a human-readable file. If you can relax that requirement, the HDF file format is amazing for both preservation of exact values and for archival reliability. It underlies the JLD data format, too. HDF is exact about the binary representation of data values. It has built-in tools that will dump values to ASCII on-demand, so it’s sort of as good as ascii? And it’s supported by a non-profit since 1988, or so. It’s kind of great, when its complexity is tamed by an API layer like JLD.

Oscar_Smith · April 7, 2021, 6:19pm

Not sure if this counts as human readable, but Julia is capable of printing and parsing Floats in Base16 which removes the room for interpretation in how they get parsed.

rafael.guerra · April 7, 2021, 6:32pm

NB: @Oscar_Smith, to understand your post better, this helped.

stevengj · April 7, 2021, 6:39pm

Yes, it’s guaranteed by modern float-to-string algorithms like Ryu (which is employed by Julia) and Grisu (which Julia used to employ).

This “information preservation” property was codified by Steele and White (1990) and I think it’s become widely accepted.

(And the good news is that it’s implemented in pure Julia, so you don’t have to worry about what your operating system’s libc does.)

Henrique_Becker · April 7, 2021, 6:50pm

Very good to know. A shame this is not adopted by more languages. I had this problem about 4 years ago, so it seems like C/C++ is not set on adopting it (as there were C/C++ standard updates between 1990 and 2016 and the problem still existed at 2016).

Oscar_Smith · April 7, 2021, 6:51pm

The problem here is that C/C++ farm a lot out the the OS, and some OSes are pretty bad with some of the finicky stuff like this.

Tamas_Papp · April 8, 2021, 6:59am

Nothing, that’s fine. I just need to embed it in JSON etc so that it is easier to process the file with a script if necessary.

Does TOML (and its current Julia implementation) scale to large files?

Yes, HDF is fine and I am using it elsewhere, but this data is really unstructured. The intention is to replace @info etc logging with something more structured and searchable.

stephenll · April 8, 2021, 11:32am

The largest file we’ve created contains about 1 million floats in total. That takes about .4 seconds to write and about .3 seconds to read on my 2019 MacBook Pro.

Skoffer · April 8, 2021, 12:07pm

Not sure, how close it to your question, but I think there is no need in removing @info, you just need to change your sink. For example LoggingFacilities.jl provides conversion of usual logging output to JSON format. With the help of LoggingExtras.jl you can setup your logging to write to HDF/BSON/whatever without changing a line of your code (yet initial logger setup take some time of course).

adolgert · April 8, 2021, 4:18pm

Tamas, what a nice idea! It’s a logstash-style of program serialization. I remain a cheerleader for HDF, but they do refer to strings as “Other Non-numeric Datatypes” in their user manual, so maybe there’s a little neglect in that area.

oxinabox · April 8, 2021, 4:41pm

One optition is using TensorBoardLogger.jl.
It logs to a ProtoBuf file, that can be displayed by the TensorBoard program,
but also can be read by a number of libraries (including TensorBoardLogger.jl itself)

In general though i think we should make a bunch of what LoggingExtras.jl calls Logging Sinks.
A JSON one would be good (lots of tools out there speak JSON for log files, e.g. AWS’s cloudwatch).
I am not sure if one exists (other than the LoggingFacilities.JSONTransformerLogger which is an unusual take on it, in that it isn’t a sink, it is a transformer, i see some advantages to that though).

Perhaps more generically would be good to make a Table logging sink.
Which is setup to be able to generically write to any Tables.jl sink (e.g. a CSV, LibPQ, Arrow)

stevengj · April 8, 2021, 4:43pm

There is also the option of (Google’s) protocol buffers, which have a Julia implementation (ProtoBuf.jl). I don’t have any experience with them myself, but they are a binary-only JSON competitor and might be worth looking into for scaling to large sizes.

Tamas_Papp · April 13, 2021, 2:00pm

After experimenting with many options, I settled on HDF5 after all, because it is the most robust and hassle-free. In particular, JSON & similar flattens all arrays to vectors, and ProtoBuf and TensorBoard are a bit heavyweight for my purposes.

I wrapped up everything in a mini-package

https://github.com/tpapp/HDF5Logging.jl

which is currently being registered. Particular attention is paid to thread safety, not keeping files open when not needed, and reading back logs. Comments, PRs, etc are of course welcome.

Example

julia> using HDF5Logging, Logging

julia> logger = hdf5_logger(tempname())
Logging into HDF5 file /tmp/jl_IbbUvj, 0 messages in “log”

julia> # write log

julia> with_logger(logger) do
       @info "very informative" a = 1
       end

julia> # read log

julia> logger[1]
(level = Info, message = "very informative", _module = "Main", group = "REPL[46]", id = "Main_7a40b9cc", file = "REPL[46]", line = 2, data = ["a" => 1])

adolgert · April 15, 2021, 12:00am

Tamas, this code uses metadata keys to store information attached to groups in the HDF5 file. This kind of key isn’t kept in block storage, so I’m curious whether it behaves well for even small numbers of logging messages. It’s like that Reeses commercial, “You put your data in my metadata!” How large did you test?

HDF5 has two main ways to store logging-type information: either a Packet Table or a Dataset with a Compound Datatype. You could also use an Opaque datatype (h5ex_t_opaque) to kind of stuff whatever you want in there.

The files also aren’t set up for an append operation. HDF5 stores data in blocks that are indexed by B+ trees and kept in an LRU cache. Packets and logging tend to write in chunks of messages, as a result, in order to reduce churn of blocks and to avoid corruption. Some of the other file formats mentioned are stars at appending and can recover from failed or partial writes.

Tamas_Papp · April 15, 2021, 8:39am

Yes, I am aware of the fact that this is a compromise. It works fine for me though as I don’t have a lot of log messages, just a few with large objects.

I would love to use JSON and TOML, but my current obstacle is serializing arrays without losing the dimension information. This basically means that I would have to invent my own format within either to keep track of this. Cf

github.com/JuliaData/StructTypes.jl

ArrayType serialization with 2D+ arrays

opened 08:05PM - 21 Sep 20 UTC

Byrth

Currently `ArrayType()` iterates element-wise over an array type using the defau…lt `iterate(x::T)` method when serializing. This has the effect of flattening higher dimensional arrays into vectors. Matrices, for instance, become vectors with no information about their original dimensions. The JSON standard maintains array order, so it seems safe for me to serialize matrices as vectors of vectors. However, my only way to affect serialization is by overloading the `iterate(x::T)` function and doing so for the `Matrix` type seems likely to recompile large parts of Julia base (as well as break my own code.) Then on deserialization, I would just `construct(::Type{Matrix}, x::Vector) = hcat(x...)` It makes sense on some level that this would be natively unsupported because it makes the serialized representation of matrices and vectors of vectors indistinguishable (and thus `deserialize(serialize(matrix)) != matrix` without an additional construct definition), but that code doesn't work right now either and it seems like there should be some way to for users to enable support for it because matrices are so common. Perhaps there should be something analogous to `construct()` for serialization?

HDF5, even if slightly abused, seems like a reasonable workaround that allows me to proceed with the actual computation without getting lost in a rabbit hole.

In the long run, it would be great to have a protocol that just serializes everything to

integers and floats,
vectors of the above,
dictionaries of dictionaries and/or the above.

This could then be used with JSON, TOML, BSON, etc.

kristoffer.carlsson · April 15, 2021, 9:14am

If this is the use case, I would just use the Serialization standard library.

Topic		Replies	Views
Human-readable but precise printing of data General Usage question	5	1065	July 14, 2017
JSON.jl Float64 serialization (suppress scientific notation) General Usage question , json , float , serialization	0	88	September 23, 2024
Converting Float64 into String but in scientific notation General Usage	3	5526	August 18, 2018
Logarithm of an array New to Julia	5	9436	April 11, 2019
[ANN] LogarithmicNumbers.jl v1.0.0 Package Announcements numbers	3	746	March 23, 2022

Exact numerical logging

Example

Related topics