String compression on windows

Impressium · October 11, 2021, 8:05pm

Hi all

I’m forced to use windows. Is there any string compression pkg that can be used on windows? (like https://github.com/ararslan/Shoco.jl)

Thank you!

giordano · October 11, 2021, 8:10pm

@ararslan what does

The Shoco C library does not work on Windows due to lack of C99 support, which means that this package has the same restriction.

mean?

gbaraldi · October 11, 2021, 8:57pm

Looks like both the library and the package aren’t mantained, https://github.com/siara-cc/Unishox. This looks like a decent substitute, it’s pure C and it works with unicode, I’m gonna try a local binarybuilder build to write a wrapper, the api seems to be 2 functions

Impressium · October 11, 2021, 9:03pm

That sounds awesome.

Impressium · October 11, 2021, 9:04pm

Maybe I have to describe my use case. I have a really large JSON string that I want to compress.

Oscar_Smith · October 11, 2021, 9:09pm

Rather than compression json, you might want to consider using a more efficient format in the first place. Binary data formats are likely going to be smaller and faster than compressed json.

gbaraldi · October 11, 2021, 9:09pm

In that case https://github.com/JuliaIO/TranscodingStreams.jl is probably your best bet, shoco and apparently unishox, are for short strings. Also what Oscar said

Impressium · October 11, 2021, 10:21pm

I know but I can’t decide that…

ararslan · October 19, 2021, 4:58pm

I… don’t entirely remember. Best I recall, Shoco doesn’t support Windows because it relies on C99 features that aren’t available on Windows, though maybe that’s just a limitation of MSVC (I don’t know) and maybe support for a 22 year old standard has since been added. I’ve not thought about Shoco in a long time.

Palli · October 19, 2021, 5:41pm

I don’t know of better than Shoco, that I knew of and Unishox I just learned of, except for my own idea. I intend to implement it in Julia, basically it’s coding bigraphs (or tigraphs), not single letters, for compressing two letters into a byte (or by a third, 3 into 1 byte, or more likely 6 into 2), giving direct index (most often) and faster sorting.

If you see such a library please let me know, then I don’t have to do it. Or if you want to implement yourself or collaborate.

I see Portuguese compresses the most down to 40% (except for Chinese and Japanese), a bit more than my bigram idea (while losing direct indexing). Do you need to support mostly one specific language, or few, then which? My bigram idea is almost trivial to code, trigram more involved, do you really need down to 33%?

I have lots of memory compression ideas that have higher priority, e.g. for numbers. I choose the priority based on what excites me most at the time. If you want direct indexing into a string then UTF-8 is already a problem, TranscodingStreams.jl would also be a problem, and it seems also the two short-string compression methods already mentioned.

Binary formats are more compact (for e.g. numbers), but I believe they (most of them at least) use UTF-8 still uncompressed, for the text parts in JSON.

ararslan · October 21, 2021, 11:01pm

FYI @Impressium, I’ve updated Shoco to work with Julia 1.6 and it now supports Windows. If you end up using it, please open issues if you run into any problems!

gbaraldi · November 3, 2021, 9:34pm

I’ve just released https://github.com/gbaraldi/Unishox.jl which is the exact same thing except it looks like Unishox is getting more development than Shoco. I have no idea about differences in performance.

SortofDamocles · February 10, 2022, 6:38pm

There appears to be a bug in Unishox. I have a rather nasty string S (at the end of the post). The first time I tried to use Unishox to compress it, I got:

julia> Unishox.decompress(Unishox.compress(S))
ERROR: BoundsError: attempt to access 1218-element Vector{Int8} at index [1:2519]
Stacktrace:
 [1] throw_boundserror(A::Vector{Int8}, I::Tuple{UnitRange{Int64}})
   @ Base .\abstractarray.jl:651
 [2] checkbounds
   @ .\abstractarray.jl:616 [inlined]
 [3] getindex
   @ .\array.jl:807 [inlined]
 [4] decompress(s::String)
   @ Unishox C:\Users\rbuckalew\.julia\packages\Unishox\dahtN\src\Unishox.jl:40
 [5] top-level scope
   @ REPL[35]:1

After a restart, the next time I tried the same commands but my Julia session simply crashed.

Here is the string that led to this behavior:

julia> S
"[[[[[[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[-6]]]]], [[[[[-5]]]]]]], [[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[-3]]]]], [[[[[-2]]]]]]], [[[[[[[4]]]]], [[[[[4]]]]]], [[[[[[7]]]]], [[[[[4]]]]]]]], [[[[[[[[-3]]]]], [[[[[0]]]]]], [[[[[[-3]]]]], [[[[[-1]]]]]]], [[[[[[[-6]]]]], [[[[[0]]]]]], [[[[[[-6]]]]], [[[[[-4]]]]]]], [[[[[[[-7]]]]], [[[[[-4]]]]]], [[[[[[1]]]]], [[[[[-2]]]]]]]], [[[[[[[[2]]]]], [[[[[5]]]]]], [[[[[[2]]]]], [[[[[1]]]]]]], [[[[[[[-1]]]]], [[[[[2]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[-2]]]]], [[[[[-2]]]]]], [[[[[[0]]]]], [[[[[0]]]]]]]]], [[[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[-3]]]]], [[[[[4]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[-3]]]]]]]], [[[[[[[[-2]]]]], [[[[[5]]]]]], [[[[[[0]]]]], [[[[[-1]]]]]]], [[[[[[[-2]]]]]], [[[[[[-6]]]]]]]]], [[[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[2]]]]], [[[[[0]]]]]]], [[[[[[[-3]]]]]], [[[[[[-6]]]]]]]], [[[[[[[[-6]]]]], [[[[[4]]]]]], [[[[[[2]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[0]]]]]]]], [[[[[[[[-3]]]]], [[[[[4]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[-2]]]]]], [[[[[[-3]]]]]]]]]], [[[[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[-2]]]]], [[[[[3]]]]]]], [[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[2]]]]], [[[[[2]]]]]], [[[[[[5]]]]], [[[[[2]]]]]]]], [[[[[[[[3]]]]], [[[[[4]]]]]], [[[[[[3]]]]], [[[[[3]]]]]]], [[[[[[[-2]]]]], [[[[[2]]]]]], [[[[[[-2]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]], [[[[[2]]]]]], [[[[[[-1]]]]], [[[[[0]]]]]]]], [[[[[[[[0]]]]], [[[[[1]]]]]], [[[[[[4]]]]], [[[[[5]]]]]]], [[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[2]]]]], [[[[[2]]]]]], [[[[[[0]]]]], [[[[[0]]]]]]]]], [[[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[-1]]]]], [[[[[0]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[-4]]]]], [[[[[-3]]]]]], [[[[[[-4]]]]], [[[[[-3]]]]]]], [[[[[[[-4]]]]]], [[[[[[-4]]]]]]]]], [[[[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[0]]]]], [[[[[-2]]]]]]], [[[[[[[-3]]]]]], [[[[[[-2]]]]]]]], [[[[[[[[-4]]]]], [[[[[0]]]]]], [[[[[[-2]]]]], [[[[[0]]]]]]], [[[[[[[-1]]]]]], [[[[[[-2]]]]]]]], [[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[-1]]]]], [[[[[0]]]]]]], [[[[[[[0]]]]]], [[[[[[-1]]]]]]]]]], [[[[[[[[[[-3]]]]], 
[[[[[0]]]]]], [[[[[[-6]]]]], [[[[[-3]]]]]]], [[[[[[[-6]]]]], [[[[[0]]]]]], [[[[[[-6]]]]], [[[[[-4]]]]]]], [[[[[[[-7]]]]], [[[[[-4]]]]]], [[[[[[1]]]]], [[[[[-2]]]]]]]], [[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[-4]]]]], [[[[[-3]]]]]]], [[[[[[[-6]]]]], [[[[[-2]]]]]], [[[[[[-6]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]], [[[[[4]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]]], [[[[[[[[-2]]]]], [[[[[-3]]]]]], [[[[[[2]]]]], [[[[[5]]]]]]], [[[[[[[-1]]]]], [[[[[-2]]]]]], [[[[[[-1]]]]], [[[[[2]]]]]]], [[[[[[[0]]]]], [[[[[0]]]]]], [[[[[[-4]]]]], [[[[[-4]]]]]]]]], [[[[[[[[[-3]]]]], [[[[[4]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[-3]]]]]]]], [[[[[[[[3]]]]], [[[[[4]]]]]], [[[[[[3]]]]], [[[[[4]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[-4]]]]], [[[[[-5]]]]]], [[[[[[-2]]]]], [[[[[5]]]]]]], [[[[[[[-5]]]]]], [[[[[[-1]]]]]]]]], [[[[[[[[[-6]]]]], [[[[[4]]]]]], [[[[[[2]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[-2]]]]]]]], [[[[[[[[-2]]]]], [[[[[2]]]]]], [[[[[[0]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]]], [[[[[[-4]]]]]]]], [[[[[[[[-1]]]]], [[[[[-2]]]]]], [[[[[[-3]]]]], [[[[[4]]]]]]], [[[[[[[-2]]]]]], [[[[[[-2]]]]]]]]]], [[[[[[[[[[-1]]]]], [[[[[2]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[-1]]]]], [[[[[2]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[-2]]]]], [[[[[-2]]]]]], [[[[[[0]]]]], [[[[[0]]]]]]]], [[[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[-3]]]]], [[[[[-2]]]]]]], [[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[-3]]]]], [[[[[-2]]]]]]], [[[[[[[4]]]]], [[[[[4]]]]]], [[[[[[2]]]]], [[[[[2]]]]]]]], [[[[[[[[-1]]]]], [[[[[-2]]]]]], [[[[[[-1]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]], [[[[[-2]]]]]], [[[[[[-1]]]]], [[[[[2]]]]]]], [[[[[[[0]]]]], [[[[[0]]]]]], [[[[[[-4]]]]], [[[[[-4]]]]]]]]], [[[[[[[[[-2]]]]], [[[[[5]]]]]], [[[[[[0]]]]], [[[[[-1]]]]]]], [[[[[[[-2]]]]]], [[[[[[-6]]]]]]]], [[[[[[[[-2]]]]], [[[[[-1]]]]]], [[[[[[-2]]]]], [[[[[-1]]]]]]], [[[[[[[-4]]]]]], [[[[[[-4]]]]]]]], [[[[[[[[-3]]]]], [[[[[-4]]]]]], [[[[[[-5]]]]], [[[[[2]]]]]]], [[[[[[[-5]]]]]], [[[[[[-1]]]]]]]]], [[[[[[[[[-3]]]]], [[[[[4]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[-3]]]]]]]], [[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[-1]]]]], [[[[[-2]]]]]], [[[[[[-3]]]]], [[[[[4]]]]]]], [[[[[[[-3]]]]]], [[[[[[1]]]]]]]]]]]"

gbaraldi · February 11, 2022, 1:28am

That’s a string among strings, I will take a look when able to see where the problem might be.

Palli · February 11, 2022, 10:00am

I see Unishox (i.e. Unishox2) has been updated to 1.0.2, and Julia’s wrapper seems to use 1.0.0. In the meantime some bugs were fixed on Sep 25, 2021.

The github link to Unishox now redirects to Unishox2 but I see the former in Yggrasil. Does that mean it was built for Unishox, not Unishox2, that may or may not (but I find likely) to be incompatible?

Palli · February 11, 2022, 10:18am

Self-admitted in the docs (section 9.5):

Unishox was found to be the slowest of all since employs several [methods?] to achieve the
best compression. However this should not be too much of an issue in most
cases when a single string or few strings are handled at a time.

but it compresses best, compared to the (non-Unicode) Shoco (and Smaz, there in all cases except 1, with Smaz down to 58 bytes vs 60), it’s also much better than (Unicode-handling) SCSU and BOCU (one of, or both patented).

My concern (and I’ve sent a question on it to the author), is that it seems to me it requires validated UTF-8, i.e. I’m not sure it handled invalid UTF-8/arbitrary bytestrings.

gbaraldi · February 11, 2022, 10:26am

I haven’t looked at it for a little while, the package is very simple, but if there was no API breakage it should just be a case of updating the Yggdrasil build script

gbaraldi · February 11, 2022, 11:03am

Some quick tests show that it was the lib that was returning the wrong number of bytes of the uncompressed string even though the part it returns is correctly decompressed. I am updating the jll to see if that fixes it, otherwise I will open an issue upstream .

edit:
I found the issue, the string returned by Unishox after compression had a NUL character in the middle of, which made julia cut it off at the middle, I’m not sure if that was expected or not, but it’s easy to fix on my part. Will tag a release with a fix .

Topic		Replies	Views
Problem processing non utf8 string New to Julia	17	2165	June 1, 2018
Minimal Julia: What do you want in Julia, or not? General Usage	38	2422	February 27, 2025
Problems with deprecations of islower, lowercase, isupper, uppercase Internals & Design	179	13301	January 1, 2018
How to optimaly comprssing and save long strings ( 10^6 char and more) General Usage	4	360	February 4, 2020
Is Julia well-suited for string manipulation? General Usage strings	24	3005	March 24, 2023

String compression on windows

Related topics