How to obtain the result of a diff between 2 files in a loop?

Hello,

I would like to compare in a loop files and to obtain the result through an integer (1=OK, 0=FALSE).

I use the following code:

for n in 1:10
   run(`diff file_a file_b`)
end

If the 2 files file_a and file_b are identical, the code is OK. If the 2 fils are different, the code gives an error.

Is it possible to have a “clean” code wich gives an int or a boolean in function of the result of the shell diff command in the previous loop ?

Or is there another way to code such test ?

Thanks.

You’ll want to use success to test command success, rather than run:

success(`diff file_a file_b`)

Note that if all you want to know if whether they differ or not, you may want to use the UNIX cmp command instead:

if success(`cmp --quiet file_a file_b`)
    # they are the same
else
    # they are different
end
3 Likes

Thanks ! :slight_smile:

--quiet is a GNU-ism; if you want it to work with any POSIX cmp you should use -s.

Of course, this won’t work on Windows. It would be pretty easy to write an equivalent Julia function, though, like:

function filecmp(path1::AbstractString, path2::AbstractString)
    stat1, stat2 = stat(path1), stat(path2)
    if !(isfile(stat1) && isfile(stat2)) || filesize(stat1) != filesize(stat2)
        return false # or should it throw if a file doesn't exist?
    end
    stat1 == stat2 && return true # same file
    open(path1, "r") do file1
        open(path2, "r") do file2
            buf1 = Vector{UInt8}(undef, 32768)
            buf2 = similar(buf1)
            while !eof(file1) && !eof(file2)
                n1 = readbytes!(file1, buf1)
                n2 = readbytes!(file2, buf2)
                n1 != n2 && return false
                0 != Base._memcmp(buf1, buf2, n1) && return false
            end
            return eof(file1) == eof(file2)
        end
    end
end

Not only is this more portable, it is also much faster than executing an external program like cmp. On my Mac laptop it is about 1000× faster in the common case where the file sizes differ, and about 60× faster for a 20kB file when the files match (so that the whole files need to be read).

(Python provides filecmp.cmp in its standard library, I wonder if Julia should too?)

10 Likes

Probably a good idea. Worth opening an issue for.

1 Like

Are you sure this should be in stdlib? Just a package would be fine IMO.

1 Like

True, it’s on the border between basic enough to include in stdlib and not-needed-often-enough to belong in stdlib. On the other hand, what I like about it is that the API is absolutely crystal clear and won’t change—comparing two files to see if they’re bit-for-bit identical is always going to mean the same thing.

1 Like

Until feature requests are opened for perfectly reasonable extensions :wink: Eg even the dead simple shell command cmp can ignore initial regions, compare up to a given number of bytes, etc.

I don’t indend to be nitpicking here; just given how labor intensive extending anything in Base & the standard libraries is (ramifications, reviews), I am very much against making something an stdlib unless there is a compelling reason.

In any case, the effort can certainly start as a package.

1 Like

All fair points!

There are a couple of variations of what it can mean for files to be “identical”:

  • a and b are different files, that happen to contain identical data
  • a and b are hard-linked to the same inode
  • a is a symlink to b, or vice versa
  • a and b are actually the same file (but the path might be different, e.g. due to symlinked directories).

All of these scenarios would count as “bit-for-bit identical” but a filecmp function should offer the possibility to distinguish between them.

I wrote some code for this kind of thing once upon a time. MIT license, if anyone wants to copy from it: GitHub - perrutquist/DeduplicateFiles.jl: Julia functions for finding and removing duplicate files.

4 Likes

How can I capture the output of diff (the lines that are different) in a Julia variable?

I tried using read(command::Cmd, String) but it fails since diff exits with an error whenever the two files are different.

Edit: I found a way:

diff_output = read(ignorestatus(`diff $file1 $file2`), String)

In general people looking at this question are likely interested in DeepDiffs.jl
Which is much older than this question, so I am surprised it isn’t mentioned already

Applying it to files can be done via comaring eachline(file1), eachline(file2)

5 Likes

Or also

3 Likes

I used this in julia 1.10, but it seems that new _memcmp now uses pointer instead of higher data structure Vector to compare. This will now throw an error of in julia 1.10. While before julia 1.9.4, this function works fine all the time. The logs are as following

MethodError: no method matching _memcmp(::Vector{UInt8}, ::Vector{UInt8}, ::Int64)

Closest candidates are:
  _memcmp(::Union{Ptr{UInt8}, AbstractString}, ::Union{Ptr{UInt8}, AbstractString}, ::Int64)
   @ Base strings/string.jl:129

Yeah, in the longer run if you are using this sort of code you should really write it using documented APIs. Fortunately, calling memcmp is on a Vector{UInt8} is just a 1-line function:

_memcmp(a::Vector{UInt8}, b::Vector{UInt8}) = length(a) == length(b) &&
    Bool(iszero(ccall(:memcmp, Cint, (Ptr{UInt8}, Ptr{UInt8}, Csize_t), a, b, length(a) % Csize_t)))