SubString of splitted IOStream line can't be compared with `in`

Dear all,
I might be too naive doing a line by line file read and throw out duplicated columns and write it back to another file.

The first loop iteration runs well but then an error LoadError: MethodError: objects of type IOBuffer are not callable occured.

Here is a hopefully MWE.

BTW, since I’m new in Julia, I’d appreciate very much any style and performance issue improvement suggestions, thanks!

function removeDuplicates(in::IO, out::IO)
    for l in eachline(in)
        splt = split(l, ";")
        println(splt)
        knownDuplicates = Vector{String}()
        for el in splt[4:end]
            println(el)
            println(knownDuplicates)
            if !isempty(knownDuplicates) && in(el, knownDuplicates)
                el = ""
            else
                push!(knownDuplicates, el)
            end
        end

        write(out, join(splt, ";"))

    end
end

function removeDuplicates(inFile::String, outFile::String)
    open(outFile, "w") do out
        open(inFile, "r") do in
            removeDuplicates(in, out)
        end
    end
end


fakeData = """
"5674012";"530489692";"batch_145322";"10/31/2019 15:00:13";1;2;1;2;2;3;4;2;
"5674012";"530489702";"batch_145323";"10/31/2019 15:00:32";1;2;1;2;2;3;4;2;"
"5674012";"530489728";"batch_145327";"10/31/2019 15:01:56";1;2;1;2;2;3;4;2;"
"""

io = (IOBuffer = (in = IOBuffer(fakeData), out = IOBuffer()), 
      filename = (in = "list.txt", out = "nodup_julia.txt"))

@time removeDuplicates(io.IOBuffer.in, io.IOBuffer.out)

output:

SubString{String}["\"5674012\"", "\"530489692\"", "\"batch_145322\"", "\"10/31/2019 15:00:13", "\""]
"10/31/2019 15:00:13
String[]
"
["\"10/31/2019 15:00:13"]
ERROR: LoadError: MethodError: objects of type IOBuffer are not callable
Stacktrace:
 [1] removeDuplicates(in::IOBuffer, out::IOBuffer)
   @ Main c:\Users\janka\Git\juliaVsCpp\removeDuplicatesInRow_Julia.jl:33
 [2] top-level scope
   @ .\timing.jl:210
in expression starting at c:\Users\janka\Git\juliaVsCpp\removeDuplicatesInRow_Julia.jl:63

julia> 

Thank you very much!

edit: my julia version

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)       
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
  WORD_SIZE: 64    
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code-insiders
  JULIA_NUM_THREADS =

Your variable names shadow a lot of regular julia functions - in this case, the function argument shadows the in function (which is only special insofar it’s allowed to be used infix). I’d recommend not shadowing types/functions for clarities sake (e.g. in or IOBuffer, like you do later in your code)

julia> in                            
in (generic function with 35 methods)

Oh my god, that was stupid. Thank you very much for your kind help.

May I open another topic regarding performance/style improvements? I guess it makes sense since the title is not accuracte anymore, right?

Thanks!

I’d write your code like this:

function removeDuplicates(input::IO, output::IO)
    for line in eachline(input)
        parts = split(line, ";")
        println(parts)
        knownDuplicates = Set{String}()
        for element in parts[4:end]
            println(element)
            println(knownDuplicates)
            push!(knownDuplicates, el)
        end

        write(output, join(parts, ";"))
    end
end

function removeDuplicates(inFile::String, outFile::String)
    open(outFile, "w") do output
        open(inFile, "r") do input
            removeDuplicates(input, output)
        end
    end
end


fakeData = """
"5674012";"530489692";"batch_145322";"10/31/2019 15:00:13";1;2;1;2;2;3;4;2;
"5674012";"530489702";"batch_145323";"10/31/2019 15:00:32";1;2;1;2;2;3;4;2;"
"5674012";"530489728";"batch_145327";"10/31/2019 15:01:56";1;2;1;2;2;3;4;2;"
"""

io = (buffers = (input = IOBuffer(fakeData), output = IOBuffer()), 
      filename = (src = "list.txt", dest = "nodup_julia.txt"))

Though if you’re interested in CSV-like formats, CSV.jl can take custom delimiters etc.

Thank you very much for your suggestion.

However, after fixing the shadowing bug, I realized my program is not doing what it should.
The println were just for debug purposes. I’d like to remove duplicates (i.e. set to “”) after the first find occurence. In the original approach that obviously could not work since I’m modifying the local el instead of split[indexWhereDupeOccured]. I think, in your proposal, all elements are kept, right?

Here is a new version (I do not edit the original post since the thread title would then be confusing). Set was slower on my machine than Vector.

function removeDuplicates(input::IO, output::IO)
    for line in eachline(input)
        splt = split(line, ";")
        # knownDuplicates = Set{String}() # Set is slower?
        knownDuplicates = Vector{String}()

        for (i, v) in enumerate(splt)
            if i < 4; continue; end
            if v in knownDuplicates
                splt[i] = "" # make element empty str
            else
                push!(knownDuplicates, v)
                # don't alter element
            end
        end

        write(output, join(splt, ";"), "\n")

    end
end

function removeDuplicates(inFile::String, outFile::String)
    open(outFile, "w") do output
        open(inFile, "r") do input
            removeDuplicates(input, output)
        end
    end
end


fakeData = join((join(r,";")*";" for r = eachrow(rand(1:20, 100000, 50))),"\n")

outBuffer = IOBuffer();
@time removeDuplicates(IOBuffer(fakeData), outBuffer)

While this performs quite good on my PC (~2.2 seconds), C and C++ implementations are ~100% faster. This is okay since they are much more verbose, but maybe there are further performance tweaks for the Julia code as well. Since I’m coming from Matlab, it feels strange to explicitely do the iterations and I was thinking if there can be done some kind of vectorization or so. Which probably would me nicer (to my eyes) as well. Any chance to have something like this?

edit: in Python I’ve used list comprehensions a lot, would they make sense here, too?

PS, I’m aware of CSV.jl, but trials showed that CSV.Lines is slower and I don’t need the overhead here for that simple line by line problem (data structure remains constant, no datytypes to parse, etc). If you think that’s not true, I’ll for sure give CSV.jl another try :slight_smile:

Thanks!

In Julia, there’s no need for a vectorized API, since loops themselves are fast (unlike in Matlab). If a package provides an API taking a vector, it’s usually because of batching purposes or because they loop internally.

Your @time call also includes compilation time of removeDuplicates - the second run will be much faster. If you want to benchmark code, I recommend the BenchmarkTools.jl package, which takes care of compiling your code, running it multiple times and collecting statistics (to remove random jitter/slowness) etc.

Sukera, okay thanks :slight_smile:

However, I always liked the vectorized stuff because its quite compact and well understandable. Do you know any way to achieve that in the code above or is that the standard “verbosity level” of Julia?

Thanks!