String compression on windows

Hi all :slight_smile:

I’m forced to use windows. Is there any string compression pkg that can be used on windows? (like GitHub - ararslan/Shoco.jl: Julia wrapper for the shoco string compression C library)

Thank you!

@ararslan what does

The Shoco C library does not work on Windows due to lack of C99 support, which means that this package has the same restriction.

mean? :thinking:

Looks like both the library and the package aren’t mantained, GitHub - siara-cc/Unishox: Guaranteed compression for Unicode short strings. This looks like a decent substitute, it’s pure C and it works with unicode, I’m gonna try a local binarybuilder build to write a wrapper, the api seems to be 2 functions

1 Like

That sounds awesome.

Maybe I have to describe my use case. I have a really large JSON string that I want to compress.

Rather than compression json, you might want to consider using a more efficient format in the first place. Binary data formats are likely going to be smaller and faster than compressed json.

1 Like

In that case GitHub - JuliaIO/TranscodingStreams.jl: Simple, consistent interfaces for any codec. is probably your best bet, shoco and apparently unishox, are for short strings. Also what Oscar said

1 Like

I know but I can’t decide that…

I… don’t entirely remember. Best I recall, Shoco doesn’t support Windows because it relies on C99 features that aren’t available on Windows, though maybe that’s just a limitation of MSVC (I don’t know) and maybe support for a 22 year old standard has since been added. I’ve not thought about Shoco in a long time. :sweat_smile:

I don’t know of better than Shoco, that I knew of and Unishox I just learned of, except for my own idea. I intend to implement it in Julia, basically it’s coding bigraphs (or tigraphs), not single letters, for compressing two letters into a byte (or by a third, 3 into 1 byte, or more likely 6 into 2), giving direct index (most often) and faster sorting.

If you see such a library please let me know, then I don’t have to do it. Or if you want to implement yourself or collaborate.

I see Portuguese compresses the most down to 40% (except for Chinese and Japanese), a bit more than my bigram idea (while losing direct indexing). Do you need to support mostly one specific language, or few, then which? My bigram idea is almost trivial to code, trigram more involved, do you really need down to 33%?

I have lots of memory compression ideas that have higher priority, e.g. for numbers. I choose the priority based on what excites me most at the time. If you want direct indexing into a string then UTF-8 is already a problem, TranscodingStreams.jl would also be a problem, and it seems also the two short-string compression methods already mentioned.

Binary formats are more compact (for e.g. numbers), but I believe they (most of them at least) use UTF-8 still uncompressed, for the text parts in JSON.