Reading (big) ascii files

Paul18fr · April 3, 2019, 2:25pm

Dear All

After reading an article speaking about Julia, I decided to have a look on it both to deal with huge amount of data (through hdf5 files both in I/O) and to perform different type of calculations; nevertheless I’m wondering if Julia can deal with big ascii files (dozens of Go)?

Generally I’ve one number per line (int & floats) and I do not need to load all the file in RAM, reading each line ounce at a time is satisfying (identical to readline in Python for example).

Any feedback on such topic will be appreciated

Regards

Paul

tamasgal · April 3, 2019, 2:35pm

Welcome Paul!

Not sure how to answer your question, but of course, with Julia you can read a file line by line, as with almost every other language out there which is capable of handling file descriptors

What have you tried so far?

The following Python code

with open("foo.txt") as fobj:
    for line in fobj.readline():
        print(line)

is for example this in Julia:

open("foo.txt") do fobj
    for line in eachline(fobj)
        println(line)
    end
end

stevengj · April 3, 2019, 3:43pm

Note also that if you have a CSV file (comma-separated values — one value per line is a special case of this), then the CSV.jl package can work in a streaming fashion that only reads a portion of the data at a time. (Use limit=N to tell it to read N lines at a time.) See also this discussion.

Paul18fr · April 3, 2019, 4:32pm

Hi,

Thanks for the answers; I’ll have a look on the different possibilities

Paul

zgornel · April 4, 2019, 10:25am

Alternatively, if there are no newline characters in the file, one can read it in chunks

chunk_size = 64 # bytes
open("foo.txt", "r") do fid
    while !eof(fid)
        buf = read(fid, chunk_size)  # read a chunk
        # do stuff with the buffer; one could use readuntil as well
        # to read all bytes up to a certain character
    end
end

johnh · April 4, 2019, 11:10am

I think this is an interesting topic. Would it make sense to have a simple program which reads the ASCII File and produces a more compact output format, eg, HDF5 or JuliaDB

I guess it depends on what is producing the data - if it is a continuos stream of sensor data or stock ticker prices then this falls flat on its face.
(Yes I do realise stock ticker prices are pretty useless without the name of the stock also!)

IF you are able to tell us what type of work you are doing it could help frame a response. I do realise that many people work in companies where you cant just tell everything.

zgornel · April 4, 2019, 3:33pm

Speaking of streaming, it would be cool to have a StreamedArray type where each element is the ‘latest’ value in a buffer/stream. In this way, one could always have an up-to-date and ready to process structure. It would be very useful for online algorithms (search, learning etc)

Paul18fr · April 4, 2019, 3:53pm

Thanks all for the advices, I’ll have a look on it

stevengj · April 4, 2019, 4:14pm

See mmap: Memory-mapped I/O · The Julia Language

(If you are working with that much data and access it frequently, it makes sense to prepare it in a binary format designed for your access pattern.)

zgornel · April 4, 2019, 5:41pm

I’m pretty familiar with memory mapped files. I was referring to mapping to the elements of the array stuff like /dev/random for example. For integrated devices, it should be straightforward to access their exposed I/O /dev/ interfaces and pretty convenient to have them directly mappable to a vector;
Just to mock up a bit:

sa = StreamedArray(open("/dev/dev1","r"), open("/dev/dev2","r", type=Float32)
# StreamedArray{Float32,1}([0.0f0, 0.0f0])
update!(sa) # reads some bytes from device
# StreamedArray{Float32,1}([1.0f0, 2.0f0])

StefanKarpinski · April 4, 2019, 8:00pm

Unless I’m misunderstanding, a “streamed array” seems like an oxymoron. If the many different kinds of things that can represent arrays have one shared feature it seems to be fast random access to individual elements by indexing. Streaming seems completely at odds with that. Or am I misunderstanding the idea?

zgornel · April 5, 2019, 8:12am

Random access into such an array would just grab the last/top bytes in the stream associated to the accessed index… The array is more like a static slice in time for a bunch of streams.

Topic		Replies	Views
Read large stream from STDIN General Usage	5	1927	February 20, 2021
Julia 3 times slower than Fortran reading integer data from ASCII file Performance fortran , performance , io	14	1259	March 26, 2022
Reading a large array from an HDF5 file New to Julia hdf5	14	6124	February 6, 2023
Reading a file line by line General Usage question	3	11516	December 3, 2018
How to read each line from a file then store the values to array elements? General Usage	5	2429	August 18, 2021

Reading (big) ascii files

Related topics