Parallel bzip2 ang gzip: pbzip2 and pigz

Do JuMP’s write_to_file and read_from_file use pbzip2 and pigz, parallel implementations of bzip2 and gzip, when writing or reading models stored in .bz2 and .gz compressed files?

https://jump.dev/JuMP.jl/stable/reference/models/#JuMP.write_to_file

https://zlib.net/pigz/

No, JuMP uses GitHub - JuliaIO/CodecZlib.jl: zlib codecs for TranscodingStreams.jl. and GitHub - JuliaIO/CodecBzip2.jl: A bzip2 codec for TranscodingStreams.jl., which in turn use the standard zlib and libbzip2.

pbzip2 and pigz ought to be much faster than bzip2 and gzip since they utilize multiple cores through multithreading. Are there plans for TranscodingStreams.jl to use pbzip2 and pigz instead?

No plans. Is the speed of writing our your file a bottleneck?

For what it’s worth, gzip decompression is not very amenable to parallellization. (Compression is a different story.)

My workflow is:

  1. Construct the LP model in JuMP: 37 minutes.
  2. Write the LP model to a MPS file using JuMP.write_to_file: 100 minutes.
  3. Using 24 threads, presolve with PaPILO, solve with PDLP (1e-4 relative tolerance), postsolve with PaPILO: 145 minutes (PDLP takes 140 minutes to solve the presolved model).

Writing the MPS file takes 35% of the overall time. Writing a compressed MPS.GZ file takes about the same amount of time, but the compressed MPS.GZ file (1.2 GB) is 11% the size of the uncompressed MPS file (11 GB).

For another similarly-sized LP model, step 3 takes 50 minutes, so that writing the MPS file takes 53% of the overall time.

Have you profiled where the time is spent in step 2?

The only way to spend 100 minutes writing 11 GB (that’s 1.8 MB per second) is to have a very slow network disk but in that case the compressed writing should only require 11 minutes and spending 89 minutes compressing 11 GB of data sounds like entirely the wrong order of magnitude, even if it’s single-threaded.

My gut feeling is that writing or compressing+writing takes up a few minutes and the rest is spent on something else but I have no insights in the code so I can’t even guess what that might be. Profiling is the only way to find out where the time truly is spent.

An easy experiment is to write the uncompressed file to disk and then compress it with command line gzip. How much time does the latter step require? It should be in the same ballpark as writing the compressed file from Julia.

1 Like

I am using a fairly new Lambda workstation. This is how I measure the time of write_to_file:
MPS_fn = “/data/my.mps” # MPS filename. or MPS GZ filename: MPS_fn = “/data/my.mps.gz”
MPS_time = @elapsed begin
write_to_file(m,MPS_fn)
end

  1. Construct the LP model in JuMP: 37 minutes.
  2. Write the LP model to a MPS file using JuMP.write_to_file: 100 minutes.

I think we’ve had this conversation a couple of times, but JuMP might to be the best tool for the job. We don’t optimize for writing to a file. Part of the “write” is actually a “copy the entire model in memory at least once” which is probably part of the issue. I’ll have a think to see if there’s a way we could improve things.

Yeah, that’s because the “write” isn’t timing only the write to file. It also has a bunch of overhead on the JuMP side to turn the problem into something that can be written to an MPS file (which involves a copy of the entire model), to make sure every variable and constraint has a unique name, to order the columns, etc. The issue isn’t the compression.

1 Like

Where is the source code for JuMP.write_to_file?

Why doesn’t JuMP.write_to_file support xz compression, which is included in TranscodingStreams.jl?
https://jump.dev/JuMP.jl/stable/reference/models/#I/O

JuMP.write_to_file is a thin wrapper:

around MOI.write_to_file:

other extensions are possible, but need implementing:

PRs to improve things are welcome.

For your case, the better long-term outcome is probably to write a C interface to PDLP. Then you could go straight to the C library without having to read and write files.