Shuffle lines in a big file, and trim it

Ju_ska · May 14, 2024, 12:55pm

Hi,

In shell, whatever the size of the file, I do:

cat file.txt | shuf | head -n 5000 > shuffled_trimmed_file.txt

With Julia, I tried to load the file with readdlm(), but it is too big and the process stops.

I looked at ways to randomly read lines from a file, but found nothing convenient/simple.

rafael.guerra · May 14, 2024, 9:50pm

A simple way, in case it helps:

# 1 - Create input file
using Random
open("in.txt", "w") do io
    foreach(_ -> println(io, randstring(rand(1:9))), 1:1_000_000)
end

# 2 - Read n random lines
using StatsBase
Nlines = countlines("in.txt")
n = 5_000
ix = sort(sample(1:Nlines, n; replace=false))
str = Vector{String}(undef, n)
i = j = 1
for line in eachline("in.txt")
    (i in ix) && begin str[j] = line; j += 1; end
    (j==n+1) && break 
    i += 1
end

# 3 - shuffle the n lines and output to file
shuffle!(str)
open("out.txt", "w") do io
    for i in eachindex(str)
        println(io, str[i])
    end
end

Dan · May 14, 2024, 10:55pm

Another more low-level method:

using Mmap

function process_file2(in_fn, out_fn, n)
    f = open(in_fn, "r")
    fout = open(out_fn, "w")
    mm = Mmap.mmap(f, Vector{UInt8})
    L = length(mm)
    i = 0
    lines = Set{UInt}()
    while i < n
        ix = rand(1:L)
        ix2 = ix
        while ix2 < L && mm[ix2] != UInt8('\n')
            ix2 += 1
        end
        ix2 += 1
        ix2 < L || continue
        ix2 in lines && continue
        push!(lines, ix2)
        while ix2 < L && mm[ix2] != UInt8('\n')
            write(fout, mm[ix2])
            ix2 += 1
        end
        write(fout, '\n')
        i += 1
    end
    close(fout)
    close(f)
end

n = 5_000
process_file2("infile.txt", "outfile.txt", n)

This tries to avoid reading the whole file, and thus is o(Nlines). In practice, it was 1000x faster than previous post.

A somewhat annoying cost, is non-uniformity if lines are of different lengths and subsampled number of lines can’t be too close to original number of lines.

StefanKarpinski · May 15, 2024, 12:48am

In addition to the suggestions here, note that readdlm is untyped and thus very slow and bad for large files—work with lines instead. I would generally just avoid readdlm actually, I regret it being a stdlib.

The simplest version of what you want would be this:

using Random
foreach(println, shuffle!(readlines())[1:5000])

This should be reasonably efficient and is actually shorter than the shell commands and does the equivalent work. If you run this from the command line it works like this:

julia -e 'using Random; foreach(println, shuffle!(readlines())[1:5000])' < file.txt > shuffled_trimmed_file.txt

Or if you want to open named files it gets a bit more verbose:

using Random
open("file.txt", read=true) do in
    open("shuffled_trimmed_file.txt", write=true) do out
        lines = shuffle!(readlines(in))
        for i = 1:5000
            println(out, line[i])
        end
    end
end

This can definitely be golfed to be shorter, but you get the point.

It would, however, be much more efficient to use reservoir sampling and only keep at most 5000 lines in memory at a time. There’s a very cool package called StreamSampling that implements this for you:

using StreamSampling
lines = itsample(eachline("file.txt"), 5000)

That’s it and it’s wildly efficient since it never needs to hold more than 5000 lines in memory.

Ju_ska · May 15, 2024, 8:17am

Thanks everyone for you help !

rdavis120 · May 15, 2024, 9:46pm

Here is another solution if you need a solution to work across languages:

db = DuckDB.DB()
DuckDB.query(db, "COPY (SELECT * FROM 'input.csv' USING SAMPLE reservoir(5000 ROWS)) TO 'output.csv'")

Topic		Replies	Views
Removing the first line in a text file New to Julia question	11	7107	March 5, 2020
Reading a file line by line General Usage question	3	11635	December 3, 2018
Reading (big) ascii files Data	11	2803	April 5, 2019
Partition a large CSV file into smaller files without loading into memory General Usage question , csv , io	6	3721	March 10, 2019
Very slow readdlm() General Usage	14	1968	October 2, 2018

Shuffle lines in a big file, and trim it

Related topics