Reading (big) ascii files

Dear All

After reading an article speaking about Julia, I decided to have a look on it both to deal with huge amount of data (through hdf5 files both in I/O) and to perform different type of calculations; nevertheless I’m wondering if Julia can deal with big ascii files (dozens of Go)?

Generally I’ve one number per line (int & floats) and I do not need to load all the file in RAM, reading each line ounce at a time is satisfying (identical to readline in Python for example).

Any feedback on such topic will be appreciated

Regards

Paul

Welcome Paul!

Not sure how to answer your question, but of course, with Julia you can read a file line by line, as with almost every other language out there which is capable of handling file descriptors :wink:

What have you tried so far?

The following Python code

with open("foo.txt") as fobj:
    for line in fobj.readline():
        print(line)

is for example this in Julia:

open("foo.txt") do fobj
    for line in eachline(fobj)
        println(line)
    end
end
4 Likes

Note also that if you have a CSV file (comma-separated values — one value per line is a special case of this), then the CSV.jl package can work in a streaming fashion that only reads a portion of the data at a time. (Use limit=N to tell it to read N lines at a time.) See also this discussion.

Hi,

Thanks for the answers; I’ll have a look on the different possibilities

Paul

Alternatively, if there are no newline characters in the file, one can read it in chunks

chunk_size = 64 # bytes
open("foo.txt", "r") do fid
    while !eof(fid)
        buf = read(fid, chunk_size)  # read a chunk
        # do stuff with the buffer; one could use readuntil as well
        # to read all bytes up to a certain character
    end
end

I think this is an interesting topic. Would it make sense to have a simple program which reads the ASCII File and produces a more compact output format, eg, HDF5 or JuliaDB

I guess it depends on what is producing the data - if it is a continuos stream of sensor data or stock ticker prices then this falls flat on its face.
(Yes I do realise stock ticker prices are pretty useless without the name of the stock also!)

IF you are able to tell us what type of work you are doing it could help frame a response. I do realise that many people work in companies where you cant just tell everything.

1 Like

Speaking of streaming, it would be cool to have a StreamedArray type where each element is the ‘latest’ value in a buffer/stream. In this way, one could always have an up-to-date and ready to process structure. It would be very useful for online algorithms (search, learning etc)

Thanks all for the advices, I’ll have a look on it

See mmap: Memory-mapped I/O · The Julia Language

(If you are working with that much data and access it frequently, it makes sense to prepare it in a binary format designed for your access pattern.)

1 Like

I’m pretty familiar with memory mapped files. I was referring to mapping to the elements of the array stuff like /dev/random for example. For integrated devices, it should be straightforward to access their exposed I/O /dev/ interfaces and pretty convenient to have them directly mappable to a vector;
Just to mock up a bit:

sa = StreamedArray(open("/dev/dev1","r"), open("/dev/dev2","r", type=Float32)
# StreamedArray{Float32,1}([0.0f0, 0.0f0])
update!(sa) # reads some bytes from device
# StreamedArray{Float32,1}([1.0f0, 2.0f0])

Unless I’m misunderstanding, a “streamed array” seems like an oxymoron. If the many different kinds of things that can represent arrays have one shared feature it seems to be fast random access to individual elements by indexing. Streaming seems completely at odds with that. Or am I misunderstanding the idea?

Random access into such an array would just grab the last/top bytes in the stream associated to the accessed index… The array is more like a static slice in time for a bunch of streams.