Reading a file from line x to line y

Hi,

I would like to read a portion of a lot of big files, as I know that the lines that interest me are located within line x and line y
Currently, I load the files with CSV.File and read parts, but the loading overhead is way too much.
The same process in shell (with sed/grep…) takes 27 minutes, Julia takes 139.
Should I call grep from Julia ?

Did you try these?

skipto
limit

1 Like

Unfortunately, as far as I’ve tested it, skipto/limit has poor performance:

@time csv=CSV.File("data/big_file.txt"; header=false, skipto=1501245, limit=450)
  5.636217 seconds (137.98 M allocations: 2.146 GiB, 0.36% gc time, 0.08% compilation time)
450-element CSV.File:
...

In shell:

time sed -n '1501245,1501694p;1501695q' data/big_file.txt | wc -l
450

real	0m0,061s

You could try using eachline, Iterators.drop and Iterators.take to assemble the lines you need into an in-memory stream, and then read the data from there. Depending on the amount of data you need to scan, you can either join the lines in question into a single string in memory, or use an IOBuffer and an asynchronous task to feed the buffer.

But you made me wonder… If just grepping the CSV files takes 27 minutes, wouldn’t you be better off by importing the CSV data into an SQL database once, and then fetching the necessary data using queries instead? You could use SQLite or DuckDB, or any SQL solution of your choice…

1 Like

Looks like very complicated for such a simple operation.

For various reasons, I can’t use SQL…

I’m currently experimenting with:

block=readchomp(`sed -n "$start,$end p;$end_plus_one q" $filename`)

It’s fast but it doesn’t return an array.

Please check also this thread.

Only ever used Windows so take all of this with a grain of salt. Looking it up, sed is a stream editor working on one line at a time, wc -l counts lines, and piped processes execute concurrently (maybe like broadcasting fusing into 1 loop), so my guess is that shell command increments a count for each line read. By contrast, CSV.File must process lines into a tabular structure with a fixed number of named columns, which is a lot of work to find out. countlines(...) runs much faster than length(CSV.File(...)), at least on the small files I have lying around. My hunch is you don’t want all that processing, you just need to read lines of text.

2 Likes

If the file contains a big dataset, usually the content of the lines has data identifying the data. In this case, random access and binary search to find the initial position to read and stop condition could blow the other methods out of the water.

For example, if each line contains a sequential ID, or any sorted ID (for which you know the ID of the initial and final lines).

The TL;DR: If data is sorted use binary-search.

Well, that’s a bit vague. I understand if you can’t create intermediate SQL files, but it’s odd if you can’t use SQL at all…

DuckDB, for instance, supports querying directly from a CSV file. I don’t know the performance of this operation, but generally, DuckDB is quite performant. So maybe you can have a look at this: CSV Import – DuckDB, and then you can narrow the query using SQL syntax. The Julia library for DuckDB also supports importing the data into e.g., DataFrames, so the result should be compatible with what you get from the CSV lib.

I’d like to keep things simple for something as simple as this…
For now I crafted a run command with sed inside, and write the extract in a ram folder, then I read the extract as a csv. I now run under 18 minutes \o/ .

Can you provide a few lines from the big_file to have as a sample, and what is the size in MB and in lines of the file?
(for each problem there are specific optimizations)

It’s data from Earth magnetic field, reshaped under this form:

20200320201739 29204.51 5900.46 -18887.12
20200320201740 29204.53 5900.45 -18887.09
20200320201741 29204.54 5900.46 -18887.11
20200320201742 29204.54 5900.45 -18887.10
20200320201743 29204.58 5900.45 -18887.11
20200320201744 29204.59 5900.47 -18887.08
20200320201745 29204.60 5900.49 -18887.10
20200320201746 29204.63 5900.49 -18887.09

Files can be around GiB size.

1 Like

Well, this exactly provides the kind of specific additional info which could solve the speed problem:

  1. The file is ASCII one byte per codepoint.
  2. Lines seem to be of about equal length and tame ASCII letters.
  3. Data in each line provides sorted indication of which line it is, and the required lines can probably be devined from data content (allowing random-access binary-search).

The points (1)-(3) are not working code, but getting some 500 lines from the middle of such a file, should be doable and probably even much faster than sed.

Regarding point (3), another bit of interesting info is how the chosen range of lines is determined (is there an β€œindex” file which specifies line numbers)?

For example, a vectorized loop counting the number of UInt('\n') can very quickly scan an mmaped file to find the right line, even without random-access directly to mid-file.

2 Likes

Looks like I had forgotten to add one necessary option to do this. With GMT.jl (master version for now)

using GMT

D = gmtread("ppt20240301vsec.sec", data=true, inrows="100:200")
Attribute table
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Timecol β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       1 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
BoundingBox: [1.7092512e9, 1.7092512e9, 0.020555555555555556, 0.04833333333333333, 61.0, 61.0, 29136.36, 29137.42, 5945.51, 5945.67, -18656.91, -18656.72, 35102.42, 88888.0]

101Γ—7 GMTdataset{Float64, 2}
 Row β”‚                Time  col.2  col.3    col.4    col.5     col.6    col.7
─────┼────────────────────────────────────────────────────────────────────────
   1 β”‚ 2024-03-01T00:01:14    0.0   61.0  29137.4  5945.56  -18656.8  88888.0
   2 β”‚ 2024-03-01T00:01:15    0.0   61.0  29137.4  5945.53  -18656.8  35103.3
   3 β”‚ 2024-03-01T00:01:16    0.0   61.0  29137.4  5945.58  -18656.8  88888.0
   4 β”‚ 2024-03-01T00:01:17    0.0   61.0  29137.4  5945.6   -18656.8  88888.0
   5 β”‚ 2024-03-01T00:01:18    0.0   61.0  29137.4  5945.63  -18656.8  88888.0

Note, GMT checks lots of things when reading text tables so this may not be the fastest of the solutions.

1 Like

Dan, from another thread, my guess is that first column is time. Number of seconds since ???

1 Like
20200320201743 29204.58 5900.45 -18887.11

20200320201743 is a DateTime 2020-03-20-20:17:43

1 Like

Did you try to tell CSV the types of the columns?

types: a single Type, AbstractVector or AbstractDict of types, or a function of the form (i, name) -> Union{T, Nothing} to be used for column types; if a single Type is provided, all columns will be parsed with that single type; an AbstractDict can map column index Integer, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector is provided, it must match the # of columns provided or detected in header. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, or nothing to signal the column’s type should be detected while parsing.

Did you try to read the file using multiple threads?

Did you try InMemoryDatasets.jl ? This function reads a csv file: Reading and Writing Β· DLMReader

In my tests it was consistently faster than CSV.jl …

To concretize some on my earlier suggestions, here is code to print lines from x to y:

using Mmap

function process_file(in_fn, startline, stopline)
    f = open(in_fn, "r")
    mm = Mmap.mmap(f, Vector{UInt8})
    l = 1
    pos = 0
    while l < startline
        pos = last(findnext([UInt8('\n')], mm, pos+1))
        l += 1
    end
    startpos = pos
    while l <= stopline
        pos = last(findnext([UInt8('\n')], mm, pos+1))
        l += 1
    end
    stoppos = pos
    write(stdout, @view mm[startpos:stoppos])
    close(f)
end

Using this code, to replicate the lines from the first post:

julia> @time process_file("big_file.txt", 1501245,1501694)

20200320201743 29204.58 5900.45 -18887.11
:
:
20200320201744 29204.59 5900.47 -18887.08
  0.036912 seconds (1.50 M allocations: 91.657 MiB, 10.68% gc time)

which takes 37ms vs 59ms of the sed command:

$ time sed -n '1501245,1501694p;1501695q' big_file.txt
:
real	0m0.059s

As usual, different methods converge to same orders of magnitude.

UPDATE: A more Julian version of the above function (taking the same amount of time) returns a vector of line strings:

function process_file(in_fn, startline, stopline)
    open(in_fn, "r") do f
        mm = Mmap.mmap(f, Vector{UInt8})
        skipline = let mm = mm
            (pos, l) -> last(findnext([UInt8('\n')], mm, pos+1))
        end
        startpos = foldl(skipline, 1:startline; init=1)+1
        stoppos = foldl(skipline, startline:stopline; init=startpos)
        return readlines(IOBuffer(@view mm[startpos:stoppos]))
    end
end
2 Likes

Have you tried CSV.Chunks?

β€’  skipto::Integer: specifies the row where the data starts in the csv file;
   by default, the next row after the header row(s) is used. If header=0, then
   the 1st row is assumed to be the start of data; providing a skipto argument
   does not affect the header argument. Note that if a row number skipto and
   comment or ignoreemptyrows are provided, the data row will be the first
   non-commented/non-empty row after the row number, meaning if the provided
   row number is a commented row, the data row will actually be the next
   non-commented row.

β€’  footerskip::Integer: number of rows at the end of a file to skip parsing.
   Do note that commented rows (see the comment keyword argument) do not count
   towards the row number provided for footerskip, they are completely ignored
   by the parser
1 Like

Thanks, I’ll check that.
For now I use:

file.txt

1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff

Code

input_file="file.txt"
output_file="out.txt"

read_start=2
read_stop=4
quit=read_stop + 1

run(`/bin/sh -c " sed -n \"$read_start, $read_stop p; $quit q\" $input_file > $output_file"`)

Result

2 bbb
3 ccc
4 ddd