String compression on windows

Hi all :slight_smile:

I’m forced to use windows. Is there any string compression pkg that can be used on windows? (like GitHub - ararslan/Shoco.jl: Julia wrapper for the shoco string compression C library)

Thank you!

@ararslan what does

The Shoco C library does not work on Windows due to lack of C99 support, which means that this package has the same restriction.

mean? :thinking:

Looks like both the library and the package aren’t mantained, GitHub - siara-cc/Unishox: Guaranteed compression for Unicode short strings. This looks like a decent substitute, it’s pure C and it works with unicode, I’m gonna try a local binarybuilder build to write a wrapper, the api seems to be 2 functions

1 Like

That sounds awesome.

1 Like

Maybe I have to describe my use case. I have a really large JSON string that I want to compress.

Rather than compression json, you might want to consider using a more efficient format in the first place. Binary data formats are likely going to be smaller and faster than compressed json.

1 Like

In that case GitHub - JuliaIO/TranscodingStreams.jl: Simple, consistent interfaces for any codec. is probably your best bet, shoco and apparently unishox, are for short strings. Also what Oscar said

2 Likes

I know but I can’t decide that…

I… don’t entirely remember. Best I recall, Shoco doesn’t support Windows because it relies on C99 features that aren’t available on Windows, though maybe that’s just a limitation of MSVC (I don’t know) and maybe support for a 22 year old standard has since been added. I’ve not thought about Shoco in a long time. :sweat_smile:

I don’t know of better than Shoco, that I knew of and Unishox I just learned of, except for my own idea. I intend to implement it in Julia, basically it’s coding bigraphs (or tigraphs), not single letters, for compressing two letters into a byte (or by a third, 3 into 1 byte, or more likely 6 into 2), giving direct index (most often) and faster sorting.

If you see such a library please let me know, then I don’t have to do it. Or if you want to implement yourself or collaborate.

I see Portuguese compresses the most down to 40% (except for Chinese and Japanese), a bit more than my bigram idea (while losing direct indexing). Do you need to support mostly one specific language, or few, then which? My bigram idea is almost trivial to code, trigram more involved, do you really need down to 33%?

I have lots of memory compression ideas that have higher priority, e.g. for numbers. I choose the priority based on what excites me most at the time. If you want direct indexing into a string then UTF-8 is already a problem, TranscodingStreams.jl would also be a problem, and it seems also the two short-string compression methods already mentioned.

Binary formats are more compact (for e.g. numbers), but I believe they (most of them at least) use UTF-8 still uncompressed, for the text parts in JSON.

FYI @Impressium, I’ve updated Shoco to work with Julia 1.6 and it now supports Windows. If you end up using it, please open issues if you run into any problems!

2 Likes

I’ve just released GitHub - gbaraldi/Unishox.jl: Julia package for the unishox string compression library which is the exact same thing except it looks like Unishox is getting more development than Shoco. I have no idea about differences in performance.

There appears to be a bug in Unishox. I have a rather nasty string S (at the end of the post). The first time I tried to use Unishox to compress it, I got:

julia> Unishox.decompress(Unishox.compress(S))
ERROR: BoundsError: attempt to access 1218-element Vector{Int8} at index [1:2519]
Stacktrace:
 [1] throw_boundserror(A::Vector{Int8}, I::Tuple{UnitRange{Int64}})
   @ Base .\abstractarray.jl:651
 [2] checkbounds
   @ .\abstractarray.jl:616 [inlined]
 [3] getindex
   @ .\array.jl:807 [inlined]
 [4] decompress(s::String)
   @ Unishox C:\Users\rbuckalew\.julia\packages\Unishox\dahtN\src\Unishox.jl:40
 [5] top-level scope
   @ REPL[35]:1

After a restart, the next time I tried the same commands but my Julia session simply crashed.

Here is the string that led to this behavior:

julia> S
"[[[[[[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[-6]]]]], [[[[[-5]]]]]]], [[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[-3]]]]], [[[[[-2]]]]]]], [[[[[[[4]]]]], [[[[[4]]]]]], [[[[[[7]]]]], [[[[[4]]]]]]]], [[[[[[[[-3]]]]], [[[[[0]]]]]], [[[[[[-3]]]]], [[[[[-1]]]]]]], [[[[[[[-6]]]]], [[[[[0]]]]]], [[[[[[-6]]]]], [[[[[-4]]]]]]], [[[[[[[-7]]]]], [[[[[-4]]]]]], [[[[[[1]]]]], [[[[[-2]]]]]]]], [[[[[[[[2]]]]], [[[[[5]]]]]], [[[[[[2]]]]], [[[[[1]]]]]]], [[[[[[[-1]]]]], [[[[[2]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[-2]]]]], [[[[[-2]]]]]], [[[[[[0]]]]], [[[[[0]]]]]]]]], [[[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[-3]]]]], [[[[[4]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[-3]]]]]]]], [[[[[[[[-2]]]]], [[[[[5]]]]]], [[[[[[0]]]]], [[[[[-1]]]]]]], [[[[[[[-2]]]]]], [[[[[[-6]]]]]]]]], [[[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[2]]]]], [[[[[0]]]]]]], [[[[[[[-3]]]]]], [[[[[[-6]]]]]]]], [[[[[[[[-6]]]]], [[[[[4]]]]]], [[[[[[2]]]]], [[[[[-2]]]]]]], [[[[[[[1]]]]]], [[[[[[0]]]]]]]], [[[[[[[[-3]]]]], [[[[[4]]]]]], [[[[[[-1]]]]], [[[[[-2]]]]]]], [[[[[[[-2]]]]]], [[[[[[-3]]]]]]]]]], [[[[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[-2]]]]], [[[[[3]]]]]]], [[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[2]]]]], [[[[[2]]]]]], [[[[[[5]]]]], [[[[[2]]]]]]]], [[[[[[[[3]]]]], [[[[[4]]]]]], [[[[[[3]]]]], [[[[[3]]]]]]], [[[[[[[-2]]]]], [[[[[2]]]]]], [[[[[[-2]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]], [[[[[2]]]]]], [[[[[[-1]]]]], [[[[[0]]]]]]]], [[[[[[[[0]]]]], [[[[[1]]]]]], [[[[[[4]]]]], [[[[[5]]]]]]], [[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[2]]]]], [[[[[2]]]]]], [[[[[[0]]]]], [[[[[0]]]]]]]]], [[[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[-1]]]]], [[[[[0]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[1]]]]], [[[[[2]]]]]], [[[[[[1]]]]], [[[[[2]]]]]]], [[[[[[[-1]]]]]], [[[[[[-1]]]]]]]], [[[[[[[[-4]]]]], [[[[[-3]]]]]], [[[[[[-4]]]]], [[[[[-3]]]]]]], [[[[[[[-4]]]]]], [[[[[[-4]]]]]]]]], [[[[[[[[[-3]]]]], [[[[[-2]]]]]], [[[[[[0]]]]], [[[[[-2]]]]]]], [[[[[[[-3]]]]]], [[[[[[-2]]]]]]]], [[[[[[[[-4]]]]], [[[[[0]]]]]], [[[[[[-2]]]]], [[[[[0]]]]]]], [[[[[[[-1]]]]]], [[[[[[-2]]]]]]]], [[[[[[[[-1]]]]], [[[[[0]]]]]], [[[[[[-1]]]]], [[[[[0]]]]]]], [[[[[[[0]]]]]], [[[[[[-1]]]]]]]]]], [[[[[[[[[[-3]]]]], 


That’s a string among strings, I will take a look when able to see where the problem might be.

I see Unishox (i.e. Unishox2) has been updated to 1.0.2, and Julia’s wrapper seems to use 1.0.0. In the meantime some bugs were fixed on Sep 25, 2021.

The github link to Unishox now redirects to Unishox2 but I see the former in Yggrasil. Does that mean it was built for Unishox, not Unishox2, that may or may not (but I find likely) to be incompatible?

Self-admitted in the docs (section 9.5):

Unishox was found to be the slowest of all since employs several [methods?] to achieve the
best compression. However this should not be too much of an issue in most
cases when a single string or few strings are handled at a time.

but it compresses best, compared to the (non-Unicode) Shoco (and Smaz, there in all cases except 1, with Smaz down to 58 bytes vs 60), it’s also much better than (Unicode-handling) SCSU and BOCU (one of, or both patented).

My concern (and I’ve sent a question on it to the author), is that it seems to me it requires validated UTF-8, i.e. I’m not sure it handled invalid UTF-8/arbitrary bytestrings.

I haven’t looked at it for a little while, the package is very simple, but if there was no API breakage it should just be a case of updating the Yggdrasil build script

Some quick tests show that it was the lib that was returning the wrong number of bytes of the uncompressed string even though the part it returns is correctly decompressed. I am updating the jll to see if that fixes it, otherwise I will open an issue upstream :slight_smile: .

edit:
I found the issue, the string returned by Unishox after compression had a NUL character in the middle of, which made julia cut it off at the middle, I’m not sure if that was expected or not, but it’s easy to fix on my part. Will tag a release with a fix .

1 Like