Nerd Sniping: How can I improve the read in of a human-readable file?

Hi All,

I’m currently trying to read in a huge human-readable file, which can take up to 50% of my total walltime when running a job. I was wondering if there’s a more efficient way to read in such a file?

Naturally, I could always pre-process the file to read in the file and write it to another format, which is more efficient. However, due to the large number of files I have, this is not practical.

The layout of the file is as follows,

  1. L lines of up to 10 integers per line. For example, I might have 15 integers over 2 lines of 10 and 5, respectively.
  2. The following line contains a single floating point number
  3. These L+1 number of lines repeat M times.

My current approach is to pack all the integers over the L lines into a single UInt64 and represent the floating-point number as a Float32.

My current code is this,

p = Progress(num_lines, desc=format("Reading {:>12,d} lines", num_lines), showspeed=true)

for i in 1:num_lines
  bitstr = zero(UInt64)
  sp_idx = 1

  for j in 1:nlines
    state_line = readline(io)
    state_parts = parse.(Int8, split(state_line))

    for sp_index in state_parts
      bit_index = sp_index - 1  # Convert to 0-based index
      bitstr |= UInt64(1) << bit_index
      sp_idx += 1
      if sp_idx > num_ints
        break
      end
    end
  end

  bitstrings[i] = bitstr

  coeff_line = readline(io)
  coeffs[i] = parse.(Float32, split(coeff_line))[state_idx]
  next!(p)
end

Is there a more efficient way to do this?

Any help would be appreciated.

Put it in a function! :slight_smile:

3 Likes

Hi @mbauman, thanks for the quick response.

I currently have this within a function, but are you saying to make a separate standalone function to get better performance? So, I make a separate read_loop function and call this within my read_file function?

Ah, ok, that was, as you might expect, just a very quick response based on what the code you posted looked like. It’s hard to fully evaluate without the complete example — how you structure your code (and what’s global) can have a dramatic impact on performance.

1 Like

Not a problem, I appreciate the help!

I’ll have a look at what I can do and share any updates here!

Please post a runnable example and an example file (can be dummy data and shorter than your actual files of course). The code you posted is incomplete:

  • you never initialize coeffs and bitstrings
  • state_idx is never defined and seems to mean that the final line with the float actually contains multiple floats?
  • Remove the progressbar to not mess up timings with unnecessary logging (you can reintroduce that after optimization of course)
  • From your code I suppose that you know the structure of each ‘block’ rather well and all ‘blocks’ are similar? I.e. the number of lines with integers (called L in your post and nlines in the code) seems constant. Is the distribution of integers among these lines also the same across blocks? The number of blocks seems known as well (called ‘M’ in the text and num_lines in the code)
  • Advice: Try to choose more descriptive and distinguishable variable names. On first read I almost confused num_lines with nlines and sp_index with sp_idx. For num_lines I would propose num_blocks.
3 Likes

Hi @abraemer thanks for the detailed reply.

You’re correct, I copied across the snippet and should’ve use a minimal example. I’ll write up an example and share it below. I’ll also add a function to generate a fake input file to help run the example too.

1 Like

How confident are you that this is where you should improve efficiency, as opposed to the variable-width reads and String parsing? Have you profiled these parts?

Actually I am also confused how that even works because you can only store numbers 1-64 like that but a Int8 goes up until 127. So how are these integers in the file constrained?

As promised, here’s a minimal reproducible example.

The data is sorted, random integers (up to a maximum of 40). I pack all the integers within a block into a single UInt64 as I need to do lots of bitwise operators. Then the float is just stored at the corresponding index in the coefficients array.

The idea is to speed up how I currently read in the data. The current approach reads in the integers within the block to a single UInt64 for efficient representation. The float is just stored within an array. So, within the bitstrings and coefficients arrays, the i-th index represents the i-th block in the original data file.

Here’s the minimal reproducible example,

using StatsBase
using ProgressMeter
using Format

"""
Generates fake data with `num_blocks` blocks.
Each block contains `num_integers` integers and 1 float.

The data is 1-based index (ranged from 1 to 40, inclusively)
"""
function generate_fake_data(
    filename::String,
    num_integers::Int64,
    num_blocks::Int64
)
    open(filename, "w") do io
        for line in 1:num_blocks
            all_integers = sample(1:40, num_integers; replace = false)
            all_integers = sort(all_integers)
            for idx in 1:num_integers
                print(io, all_integers[idx], " ")
                if idx % 10 == 0
                    println(io)
                end
            end 
            if num_integers % 10 != 0
                println(io)
            end
            println(io, rand())
        end 
    end
end 

function read_fake_data(
    filename::String,
    num_integers::Int64,
    num_blocks::Int64
)

    if !isfile(filename)
        error("File not found: $filename")
    end
       
    (bitstrings, coefficients) = open(filename, "r") do io

        p = Progress(num_blocks, desc=format("Reading {:>12,d} blocks...", num_blocks), showspeed=true)

        # We'll pack all integers into a UInt64 (i.e. a bitstring)
        bitstrings = zeros(UInt64, num_blocks)
        coefficients = zeros(Float64, num_blocks)
        
        num_lines_per_block = ceil(Int, num_integers / 10)

        for i in 1:num_blocks
            bitstring = zero(UInt64) # Initialize `bitstring` to store integers in a block
            bit_idx = 1 # starting index for packing bits

            for j in 1:num_lines_per_block
                line = readline(io)
                parsed_line = parse.(Int8, split(line))
                
                for integer_index in parsed_line
                    bit_index = integer_index - 1  # Convert to 0-based index
                    bitstring |= UInt64(1) << bit_index
                    bit_idx += 1
                    if bit_idx > num_integers
                        break
                    end
                end
            end

            bitstrings[i] = bitstring

            coeff_line = readline(io)
            coefficients[i] = parse.(Float64, split(coeff_line))[1]
            next!(p)
        end

        return (bitstrings, coefficients)
    end

    return bitstrings, coefficients
    
end

filename = "fake_data.txt"
num_integers = 22 # number of integers per block
num_blocks = 1_000_000 # total number of blocks

generate_fake_data(filename, num_integers, num_blocks)

bitstrings, coefficients = read_fake_data(filename, num_integers, num_blocks)

print(length(bitstrings), length(coefficients))

I appreciate the help

That’s a good point. If the bottleneck is mainly due to String parsing, then I probably can’t do much more? Unless I read in the data and write it to a more efficient format?

The only way to know is to profile your code, eg with @profview inside VSCode’s Julia REPL, to see where it spends the most time

1 Like

Assuming that profiling backs this up, you can dodge String instantiation and processing by reading characters/bytes into a preallocated buffer with copyline/copyuntil, and non-allocating string-like objects can be made with StringViews.jl. You said human-readable, integers, and floating point, so I’d infer that means a text file with 0-9, ' ', '.', '\n', '\r', all ASCII and simpler to handle than the wider UTF-8. There’s going to be a limit to the read efficiency because of parse and the variable-width data on a couple levels; a binary format could make some of that easier to read, but it may not be worth making unreadable copies with even more assumptions (one very underrated reason is that text is decimal and arbitrary precision, so specifying most data types forces approximations and limits).

1 Like

You can take a look at Automa.jl, which can be used to generate very efficient parsers from (a constrained set of) regular expressions.

2 Likes

I have not looked under the hood, but am wondering if @scanf() might be more efficient? But only if the format is consistent.

1 Like