Problem processing non utf8 string


#1

I am brand new to Julia, and this is my first project, writing software for an archiving/backup system, from a multitude of different sources. I am trying to read in and processes files (often up to a gigabyte big) of what is essentially records of text data (one record per line) but the meaning and use of the upper 128 possible byte values is very varied, often even within one record. So, I need to inspect, slice, create substrings, compare, hash (to look for repetition etc.) these records and substrings of these records, always referenced by byte count from the beginning of the record or substring, and regard the number of possible different characters as 256. (From my point of view, for example a record or a bit of a record may actually be utf8 Unicode, but handling a 2 byte uft8 character as two separate characters for me is a non issue. For most processing I simply do not need to discover whether if is Unicode, symbols from a ‘code-page’, peculiar to the owner of that record, a corruption, parity or whatever).

I seem to be having having big problems with julia handling my records. I seem to have two choices. I can either first convert each byte of my record into utf8 string format, (where all bytes over 127 would become two bytes, the first of which would be 0xc2 or 0xc3) and then, after processing, I need to convert back again. Alternatively I can convert into an array of characters each of which is 32bits (4 bytes) long, and then convert that back into to a byte string again. The actual processing in this latter form seems much harder. (To set about my problem regarding the whole file as an integer array is even more messy.) I just seem to clash with the language somewhere, either in the conversions, and/or in the bit of i/o handling still required. I really have not yet contemplated creating my own string abstraction.

The documentation emphasises early on how easy the language is to process ‘c’ like strings, where, quite frankly, the typing difference between a string, a character array, and a byte array seems to be practically be non existent. (I would never claim to be good at ‘C’). However, this approach, if made to work in julia would be the end of my woes, but I cannot suss what this documentation means in practical terms, or how it it easy. Locale values seem to make no difference.

I have to admit, I can still see many avenues that I could spend spend many a (happy?) hour following up. E.g. seeing how to implement a package such as “@nalimilan’s StringEncoding”. What this would involve I have yet to learn (‘making’?? -> dependencies?? -> breaking seemingly irrelevant things that at least now work??? - only to find out it was a blind avenue for what I want anyway.) I need to get on with my project, and I need to find out out if tool ‘julia’ is a Godsend or something I that will never suite me. I feel a kind word from people experienced with the tool would be such a help.

Please could someone tell me which approach to use for my problem, and explain in simple terms the functions and syntax to use them in solving my problem.

Thanks.


#2

At the risk of a certain person falsely claiming that I’m “using every thread as a chance to plug your own work”, I think that you might be interested in the packages in the JuliaString org, in particular Strs.jl.
Unfortunately, it is still in the process of being registered, so you might want to just take a look at it for now, and hopefully very soon you’ll be able to simply do Pkg.add("Strs") to make it available via using Strs in your Julia session.

In particular, I think you should probably use the Text1Str type from Strs, which represents simple text strings with 1-byte codeunits, without specifying what the particular encoding is.
Unlike storing them simply as Vector{UInt8} (which would be your best option using Base Julia, IMO), you can use string operations on a Text1Str (there is also a BinaryStr, for non-text byte strings).

If you already have experience in other languages with string processing, such as dealing with character sets and encodings, Unicode and different Unicode representations, I would encourage you to think about contributing to JuliaString.org (I’ve been planning to start on a pure Julia StrEncodings package as soon as the base Str* packages are fully released, and any help there would be appreciated).

Using Strs instead of base String and Char also has the advantage of much better performance for most all string functions, something I’ll be trying to plot the results of some rather extensive comparative benchmarking soon.


#3

If you neither know nor care what the encoding is, why do you need a string type at all? Why not just process arrays of bytes?


#4

He did say “For most processing”, not for all processing, and he still may wish to search the strings for ASCII values, etc, without worrying about whether or not bytes with the high bit set represent part of a UTF-8 sequence, a S-JIS or EUC sequence, an ISO character or some Windows CP character.
That was my problem with Julia from the start with the lack of any “BinaryString” type, and the issue that if you used Vector{UInt8}, you couldn’t even use the string operations for concatenation (’*’).


#5

Are you using 0.6? In the latest Julia versions (master / soon-to-be-released 0.7-alpha), the standard String type can hold arbitrary data, but gives it a UTF-8 like interpretation and display. On 0.6 and earlier some operations insisted on valid UTF-8 data in Strings, which indeed did lead to annoyances along the lines of what you’re describing. Those annoyances are gone on Julia master.


#6

2 posts were split to a new topic: Binary values in strings


#8

Yet another alternative to my statement 'Alternatively I can convert into an array of characters each of which is 32bits (4 bytes) long, and then convert that back into to a byte string again. The actual processing in this latter form seems much harder. (To set about my problem regarding the whole file as an integer array is even more messy. ’ Your alternative, with several others, was considered , but thought more messy than the string array alternative, but my question was quite long enough.

My objective is to rationalise files extracted from many systems dating back to well pre the invention of the microprocessor. The collection has been formed from a mentality of excessive but badly organised and never documented backing up of files, drives, and partitions onto another computer, and then that computer itself has been backed onto a third, etc. etc. Many systems housed emulators. All the files belonging to these are in the collection, but indexed in different ways to the original system or other emulators because of the host. The collection is not in one place, it exists in several currently used systems, abandoned but kept drives, disused but kept systems (now of sentimental/antique value). Most of this multi-terabyte collection is clearly duplication, but because of disk failure any one of the apparent duplicates could contain wanted information that is not anywhere else. I need to rationalise this, hopefully identifying all unique information and storing it once, properly indexed and usable, and of a total size smaller than a common USB drive, which can then itself be duplicated many times and stored in different places making the whole collection immune to loss from device failure, a building fire, cyber attack, or whatever.

First I used SHA1 to make a ‘text’ file of every file in one part of the collection, and this was the initial creator of the one of the files that are the subject of this thread. Each ‘0x0A’ terminated line line refers to a file, consisting of fixed length metadata: a hash of the contents, and its date.
Then follows e.g.
\home\myname\DirectoyOfCollection\Mess1\Mess2\Mess3/Mess3/Mess4/Mess5\Mess6\mess7\mess8.

Mess6 thru Mess8 containing bytes>127. mess8 has to be a filename. Quite evidently (delete last two words, I am pushing my luck) the signature of some system backed up into probably a Microsoft system, backed up into the Linux system, backed into the ext4 system used to produce the file. (I adore that ext4 allows almost anything but ‘/’ and ‘0x0a’ in its filenames, it will back a whole Windows file structure directly. It does not correctly recreate the actual structure in itself, but a simple intersystem copy will restore the whole structure back into the Windows system, no problem.) Note: I just don’t care what mess6 thru 8 looks like to a user of the target system. I don’t even care what the system is. The first thing I could do is strip off \home\myname\DirectoryOfCollection\ off the front of each entry leaving alone the metadata and all after: a string operation. This front end is not just a waste of space it fouls things up if a different computer is used to create a file from a different chunk of of my collection. However, before this I thought I would asses the scale of duplication issue. I will never practically do this unless I bring the duplicated files together, because the file is bigger than my available RAM. (Hint: ever wanted to ensure a operation that would take a few hours now takes a few weeks (thus provoke failure because the power gets disrupted)? Solution: provoke a Linux kernel into using a swap space.) No, I need to bring the duplication candidates together or close. Easy, it is why I introduced the hash metadata. In bash: export LC_ALL=C;sort filename. (another string operation, but not yet handed to Julia.) Visual inspection: far from all files with the same hash had the same filename. They seemed mainly library files. I didn’t check, but I fully believe they did have the same content, it is most unlikely the hash uniqueness is failing that often that systematically. Some system(s) are evidently offering the same library file to be included under different names. In my system I definitely want each name stored separately, so the hash is not a file ID in all but a very few (manually handleable) cases as hoped. Should be easy, extract the filename (Mess8 above) and insert into metadata between the hash and the date. However, I decided 5 character would be enough, I wanted to keep the metadata standard in length. Now I am introducing >127 characters into my metadata!, but no matter, the sort will sill bring the ‘identical’ files together. Julia woes. findlast doesn’t even work with UTF8 strings owing to finding a continuation bytes. An issue already noted and possibly fixed by the authors, but not in in my release if released yet at all. So I write my own: (I will concentrate on the one directory separator here). (In my autism I find the ‘Python’ layout quite confusing, I squash things together tightly using ; widely. I will try expanding here, but excuse if I get spaces wrong.)

function lslsh(a)
  i=length(a)
  while i>1;
    i-=1
    if (try;a[i]=='/';catch;false;end)
    return i+1
  end
 end
 return 1
end

h2=open("ModiedVersion.txt","a")
 open("ExistingVersion") do h1
  while !eof(h1)
   lS=readline(h1)
   if length(lS)<43      #dummy line got in
      write(h2,lS,"\n")  #just echo it
   else
      mS=lS[1:41]*(lS[lslsh(lS):end]*" "^6)[1:5]*lS[41:end]
      write(h2,mS,"\n")
   end
  end;
end
close(h2)

(now deep into string functions!). Worked over a quarter of a million times, then failed. The first byte of the ‘filename’ was >0x7f and <0xc0. StringToBeSliced[x:end] works fine with bytes over 127 providing the xth byte (and possibly the end byte) are not between 0x80 and 0xbf, ‘a continuation byte’, then it fails. Now try to write a ‘catch for that’. First it has to be passed through verbatim. Its value is given in the error message but just how is it extracted with ANY julia function without type converting, and doing that seems to be a minefield in itself. Now pass though the rest of the slice, -but what if the next byte of that is a ‘continuation byte?’ This is clearly a bodge too far.

As the processing gets deeper so the more ‘you name a string function, and it will quite likely be useful’.

Yes, I could re-write all string functions to work with byte arrays, but it is a bodge making the code less clear and a lot of work. I am fighting not working with the language. Logically I am truly handling strings – but I still don’t know, or care, what any of the bytes over 127 actually represent.

I hope my experiences so with the language is of interest to at least a few.


#9

Why is it that that whenever I ask a question the simple solution will always be available – err – sometime in the near future? Yes if Pkg.add(“Strs”) is as simple as you indicate, without experiences mentioned in my question, that is great when it comes. I look forward to this. I would love to get experienced and enthusiastically involved and contribute. However, I have been trying to retire for years, and find the time for this type of thing, but my commitments still overwhelming, and I can make no promises. Currently I am looking for ready made tools to use. I thank you for this message. I will watch out.


#10
Pkg.clone("https://github.com/JuliaString/Strs.jl")

#11

| | || | | | (| | | Version 0.6.2 (2017-12-13 18:08 UTC)
/ |_|||_’_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu

julia> Pkg.clone(“https://github.com/JuliaString/Strs.jl”)
INFO: Initializing package repository /home/richard/.julia/v0.6
INFO: Cloning METADATA from https://github.com/JuliaLang/METADATA.jl
INFO: Cloning Strs from https://github.com/JuliaString/Strs.jl
INFO: Computing changes…
ERROR: unknown package StrRegex required by Strs
Stacktrace:
[1] requirements(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Base.Pkg.Types.Fixed}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}) at ./pkg/query.jl:20
[2] resolve(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{String,Tuple{VersionNumber,Bool}}, ::Dict{String,Base.Pkg.Types.Fixed}, ::Dict{String,VersionNumber}, ::Set{String}) at ./pkg/entry.jl:480
[3] resolve(::Dict{String,Base.Pkg.Types.VersionSet}, ::Dict{String,Dict{VersionNumber,Base.Pkg.Types.Available}}, ::Dict{String,Tuple{VersionNumber,Bool}}, ::Dict{String,Base.Pkg.Types.Fixed}) at ./pkg/entry.jl:479
[4] clone(::String, ::SubString{String}) at ./pkg/entry.jl:205
[5] (::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#clone,Tuple{String}})() at ./pkg/dir.jl:36
[6] cd(::Base.Pkg.Dir.##4#7{Array{Any,1},Base.Pkg.Entry.#clone,Tuple{String}}, ::String) at ./file.jl:70
[7] #cd#1(::Array{Any,1}, ::Function, ::Function, ::String, ::Vararg{String,N} where N) at ./pkg/dir.jl:36
[8] clone(::String) at ./pkg/pkg.jl:169

julia> Text1Str(“hello”)
ERROR: UndefVarError: Text1Str not defined

julia>

I’m out of my depth


#12

I personally found it extremely hard to parse that large block of text, but from what I understand, you have a file consisting of lines of the form

HDP\n

where the H is the fixed-length sha1 hash, D is a fixed-length date, and P is an arbitrary length path, and \n is the newline (\x0a) character. The wrinkle is that P may have invalid UTF8 bytes (but is still an 8-bit format)?

Now you want to rewrite the format so that it is of the form:

HNDP\n

where N is the first 5 characters of the filename extracted from P (based on the code, it looks like you want shorter names padded with spaces)?

If I have that correct, then I have a couple of questions:

  1. is the final path separator always / or can it be \ (since as you point out, \ is a valid filename on ext filesystems)?
  2. why not hash the filename (or create a single hash of the filename+contents)?

#13

Assuming that my interpretation is correct (and the last separator is always /), then something like the following should work:

function append_5chars(newfile, oldfile)
    open(newfile,"a") do f_out
        open(oldfile) do f_in
            while !eof(f_in)
                write(f_out, read(f_in, 40)) # copy first 40 chars
                arr = readuntil(f_in, 0x0a) # 0x0a == UInt8('\n')
                k = findlast(arr, 0x2f)     # 0x2f == UInt8('/')
                n = write(f_out, arr[k+1:min(end,k+5)])
                while n < 5 # append extra spaces
                    n += write(f_out, ' ')
                end
                write(f_out, arr)
            end
        end
    end
end

If performance is critical, this could probably be done with pre-allocated buffers.


#14

Perhaps you already considered this, but you could use Buzhash, used by deduplicating backup systems such as borg backup and attic.


#15

Wow! thanks for that. Been playing a bit, and it seems I was showing my
sheer inexperience with the language when I decided (unsigned) byte array
would be even more messy than character arrays or strings. This seems much
more free of the typing nightmare I was in.

However some functions that I would have thought would work don’t. e.g.

ans<an2
ERROR: MethodError: no method matching isless(::Array{UInt8,1}, 
::Array{UInt8,1})
Stacktrace:
  [1] <(::Array{UInt8,1}, ::Array{UInt8,1}) at ./operators.jl:194

From definition of Base.isless

‘Test whether x is less than y , according to a canonical total order.
Values that are normally unordered, such as NaN ,
are ordered in an arbitrary but consistent fashion. This is the default
comparison used by sort .’

Well anss=sort(ans) works fine. Yet it seems I have to write my own
implementation of isless. Surely I am being stupid, somewhere. using
collect does not seem to help.


#16

In Julia 0.6 you can use lexless. In Julia 0.7 isless for arrays is now lexicographical comparison and lexless is no longer needed (https://github.com/JuliaLang/julia/pull/25380).


#17

I have managed to rationalise a hopeless mess of data from various sources
into consistent, as you put it, ‘files of large blocks of text’. Each block
has towards the left useless historic data as to how is got here. Towards
the right it has data intimately involved with what is needed for a
satisfactory backup/archive store. If only I could work out, as you put it,
‘parse’ it (and thus separate out and throw away the unwanted data), I
would be at least 95% of the way through cracking the entire huge messy
project.

Exactly.

This is certainly the ‘type’ of manipulation ‘parsing it’ involves.
However, this thread is much more about my woes with julia in getting this
far than actually progressing with the project.

Currently I am studying what might work by assuming, quite probably
naively, that all system either uses ‘/’ or ‘’ as its path separator, and
no target system actually uses either of these as ‘any old character for
use in a filename’. In as much as this might, and probably will, fail, my
approach would be to study the circumstances of the failures, and work out
in practice whether it really matters. I.e. as long as the target receives
a satisfactory string, it does not really matter if the archive system
‘thought’ there were more (or less) pathname splits than there really are
in the target system. This incorrect parsing on the archive system is going
to lead to inefficiency, and duplication of files stored. If the occasions
of this is rare, and the files stored are tiny it might not be worth
worrying about it at all. If it turns out to be a significant problem, one
solution would be to not use the pathname as a genuine pathname at all in
the backup system. All files would be stored named by its hash, and the
system extracts the hash from the, what now become a database of path
strings rather than genuine path strings. Obviously, all hashes would now
have to be verified unique, probably by appending a count to the hash. The
more I think about this, the more this idea appeals to me.

However, first I must work out how to parse the file so that, whatever the
source, the really relevant data is the rationalised, duplication disposed
of, and indexed in a systematic way.

You may notice the code I put out pads it out with spaces, but seriously,
my brain was far more on julia’s constant ‘errors caused by type
mishandling, and the persistent error caused by a byte read as a uft8
continuation character’, than ever it was on furthering the project. The
idea (your idea if this is what you meant and want the credit) of using the
entire filename and hashing it down to two or one bytes on the way through
is far superior, but I am not going to worry about this when I cannot
reliably even simply slice it and duplicate it.

I stuck a few bytes after the sha1 hash solely so filesort would group
entries that use same filename together, separated from entries that use a
different filename, to point to the same file actually on the medium. You
may think my addition is a really crude hash, even if more humanly obvious,
but functionally it is nothing but a crude hash, extending the sha1 hash.
Please see answer to last question.


#18

Thanks. There help here is fantastic! Why does ‘<’ not use lexless then,
rather then end up issuing a detailed ‘No Method’ message?


#19

I had the same question which is why it now does.