Yet another alternative to my statement 'Alternatively I can convert into an array of characters each of which is 32bits (4 bytes) long, and then convert that back into to a byte string again. The actual processing in this latter form seems much harder. (To set about my problem regarding the whole file as an integer array is even more messy. ’ Your alternative, with several others, was considered , but thought more messy than the string array alternative, but my question was quite long enough.
My objective is to rationalise files extracted from many systems dating back to well pre the invention of the microprocessor. The collection has been formed from a mentality of excessive but badly organised and never documented backing up of files, drives, and partitions onto another computer, and then that computer itself has been backed onto a third, etc. etc. Many systems housed emulators. All the files belonging to these are in the collection, but indexed in different ways to the original system or other emulators because of the host. The collection is not in one place, it exists in several currently used systems, abandoned but kept drives, disused but kept systems (now of sentimental/antique value). Most of this multi-terabyte collection is clearly duplication, but because of disk failure any one of the apparent duplicates could contain wanted information that is not anywhere else. I need to rationalise this, hopefully identifying all unique information and storing it once, properly indexed and usable, and of a total size smaller than a common USB drive, which can then itself be duplicated many times and stored in different places making the whole collection immune to loss from device failure, a building fire, cyber attack, or whatever.
First I used SHA1 to make a ‘text’ file of every file in one part of the collection, and this was the initial creator of the one of the files that are the subject of this thread. Each ‘0x0A’ terminated line line refers to a file, consisting of fixed length metadata: a hash of the contents, and its date.
Then follows e.g.
\home\myname\DirectoyOfCollection\Mess1\Mess2\Mess3/Mess3/Mess4/Mess5\Mess6\mess7\mess8.
Mess6 thru Mess8 containing bytes>127. mess8 has to be a filename. Quite evidently (delete last two words, I am pushing my luck) the signature of some system backed up into probably a Microsoft system, backed up into the Linux system, backed into the ext4 system used to produce the file. (I adore that ext4 allows almost anything but ‘/’ and ‘0x0a’ in its filenames, it will back a whole Windows file structure directly. It does not correctly recreate the actual structure in itself, but a simple intersystem copy will restore the whole structure back into the Windows system, no problem.) Note: I just don’t care what mess6 thru 8 looks like to a user of the target system. I don’t even care what the system is. The first thing I could do is strip off \home\myname\DirectoryOfCollection\ off the front of each entry leaving alone the metadata and all after: a string operation. This front end is not just a waste of space it fouls things up if a different computer is used to create a file from a different chunk of of my collection. However, before this I thought I would asses the scale of duplication issue. I will never practically do this unless I bring the duplicated files together, because the file is bigger than my available RAM. (Hint: ever wanted to ensure a operation that would take a few hours now takes a few weeks (thus provoke failure because the power gets disrupted)? Solution: provoke a Linux kernel into using a swap space.) No, I need to bring the duplication candidates together or close. Easy, it is why I introduced the hash metadata. In bash: export LC_ALL=C;sort filename. (another string operation, but not yet handed to Julia.) Visual inspection: far from all files with the same hash had the same filename. They seemed mainly library files. I didn’t check, but I fully believe they did have the same content, it is most unlikely the hash uniqueness is failing that often that systematically. Some system(s) are evidently offering the same library file to be included under different names. In my system I definitely want each name stored separately, so the hash is not a file ID in all but a very few (manually handleable) cases as hoped. Should be easy, extract the filename (Mess8 above) and insert into metadata between the hash and the date. However, I decided 5 character would be enough, I wanted to keep the metadata standard in length. Now I am introducing >127 characters into my metadata!, but no matter, the sort will sill bring the ‘identical’ files together. Julia woes. findlast doesn’t even work with UTF8 strings owing to finding a continuation bytes. An issue already noted and possibly fixed by the authors, but not in in my release if released yet at all. So I write my own: (I will concentrate on the one directory separator here). (In my autism I find the ‘Python’ layout quite confusing, I squash things together tightly using ; widely. I will try expanding here, but excuse if I get spaces wrong.)
function lslsh(a)
i=length(a)
while i>1;
i-=1
if (try;a[i]=='/';catch;false;end)
return i+1
end
end
return 1
end
h2=open("ModiedVersion.txt","a")
open("ExistingVersion") do h1
while !eof(h1)
lS=readline(h1)
if length(lS)<43 #dummy line got in
write(h2,lS,"\n") #just echo it
else
mS=lS[1:41]*(lS[lslsh(lS):end]*" "^6)[1:5]*lS[41:end]
write(h2,mS,"\n")
end
end;
end
close(h2)
(now deep into string functions!). Worked over a quarter of a million times, then failed. The first byte of the ‘filename’ was >0x7f and <0xc0. StringToBeSliced[x:end] works fine with bytes over 127 providing the xth byte (and possibly the end byte) are not between 0x80 and 0xbf, ‘a continuation byte’, then it fails. Now try to write a ‘catch for that’. First it has to be passed through verbatim. Its value is given in the error message but just how is it extracted with ANY julia function without type converting, and doing that seems to be a minefield in itself. Now pass though the rest of the slice, -but what if the next byte of that is a ‘continuation byte?’ This is clearly a bodge too far.
As the processing gets deeper so the more ‘you name a string function, and it will quite likely be useful’.
Yes, I could re-write all string functions to work with byte arrays, but it is a bodge making the code less clear and a lot of work. I am fighting not working with the language. Logically I am truly handling strings – but I still don’t know, or care, what any of the bytes over 127 actually represent.
I hope my experiences so with the language is of interest to at least a few.