Loading first few lines from data file

iamsuddhasattwa · March 3, 2019, 1:24am

Hi, I have a data file with 100 rows of data of the form
1 2
3 4
5 6.1
…

I want to load the first 10 lines as a 10X2 array. However, I do not see any option in readdlm() for doing that, although it should be an easy task.

nilshg · March 3, 2019, 8:08pm

If the file is only 100 rows surely it won’t take more than a fraction of a second to read it, so wouldn’t you just do:

a = readdlm("my_100_row_file.txt")[1:10]

readdlm doesn’t seem to have an option to specify the number of rows to read, but you could easily do something yourself along the lines of:

function read_first_n(filename, n = 10)
  a = Array{Any}(undef, n)
  open(filename) do file
      for (i, ln) in enumerate(eachline(file))
        if i <= n
          a[i] = ln
        else 
          break
        end
      end
    end
    return a
end

or if you want to make your life easier just use the CSV package which is a lot more fully featured than the DelimitedFiles stdlib:

using CSV
CSV.read("my_100_row_file.txt", delim = ' ', rows = 10)

bennedich · March 3, 2019, 8:31pm

I also don’t see how to do it with readdlm, but a combination of Iterators.take and eachline makes this quite easy to do yourself. In the example below, I read the first 3 lines only, and specify the type to be Float64:

julia> map.(s -> parse(Float64, s), split.(Iterators.take(eachline("test.txt"), 3), ' '))
3-element Array{Array{Float64,1},1}:
 [1.0, 2.0]
 [3.0, 4.0]
 [5.0, 6.1]

StefanKarpinski · March 3, 2019, 8:45pm

Since you can pass any IO handle to readdlm, you can do something like this:

open(`head -n10 $file`) do io
    readdlm(io)
end

Might be possible to just pass the command object directly to readdlm. I’m not at a computer so I haven’t tried any of this. As a philosophical matter, we try to avoid APIs with lots of options (head/tail/etc options to every function that reads a file) in favor of composable constructs (passing IO objects that can do some form of trucation to functions).

iamsuddhasattwa · March 4, 2019, 1:07am

Thank you for the reply. However, after having installed and loaded “Iterators”, I get the following error message. Perhaps there is an error in your code.

x = map.(s → parse(Float64, s), split.(Iterators.take(eachline(“Data.txt”), 3), ’ '))

ERROR: MethodError: no method matching size(::##16#18)
Closest candidates are:
size{N}(::Any, ::Integer, ::Integer, ::Integer…) at abstractarray.jl:48
size(::BitArray{1}) at bitarray.jl:39
size(::BitArray{1}, ::Any) at bitarray.jl:43
…
in broadcast_shape(::Function, ::Base.Take{EachLine}, ::Char, ::Vararg{Char,N}) at ./broadcast.jl:31
in broadcast_t(::Function, ::Type{Any}, ::Function, ::Vararg{Any,N}) at ./broadcast.jl:213
in broadcast(::Function, ::Function, ::Base.Take{EachLine}, ::Char) at ./broadcast.jl:230

iamsuddhasattwa · March 4, 2019, 1:46am

Thank you, this one actually worked. But when I said “10” in my posted question, I basically meant some arbitrary number. Is it possible to pass an argument N into the open() command ?

iamsuddhasattwa · March 4, 2019, 1:49am

Thank you for the reply. But there seems to be an error with the word undef in your second line. The following error message is created.

ERROR: UndefVarError: undef not defined

bicycle1885 · March 4, 2019, 3:26am

Of couse, you can do. Try this way:

n = 10
file = "somefile.txt"
open(readlines, `head -n $(n) $(file)`)

bennedich · March 4, 2019, 5:06am

No. I think what’s happening here is that you’re using Julia 0.6? If that’s the case, I’d strongly recommend pausing whatever development you’re doing and focusing on upgrading to Julia 1+.

I like the readdlm version, to avoid having to write the parsing logic yourself, I’m just a bit concerned about going through an external command that way. It doesn’t seem platform independent?

iamsuddhasattwa · March 4, 2019, 5:15am

My version is being given as v"0.5.1-pre+31", so I think it means 0.5.1. Is that too backdated ?
Thank you for the tip !

bennedich · March 4, 2019, 5:20am

Yes, that’s an old unmaintained version. We are on Julia 1.1 / 1.2 now. Lots of things have changed, which is why the examples given here don’t work for you. As I said, I think upgrading should be your number one priority.

iamsuddhasattwa · March 4, 2019, 5:33am

I will do that right away.

iamsuddhasattwa · March 4, 2019, 5:34am

Actually, the program gets into some kind of infinite loop when I do that

julia> n = 10;

julia> file = “Data.txt”;

julia> x = open(readlines, `head -n$(n) $(file)')

Here, the execution keeps running.

iamsuddhasattwa · March 15, 2019, 7:17pm

Thanks it worked after I updated my Julia version.

iamsuddhasattwa · March 15, 2019, 7:20pm

Thanks it worked after I updated my Julia version. But the return type is Array{String,1}, not Array{Float64,2} as I had hoped.

bennedich · March 15, 2019, 8:02pm

It’s because you’re using readlines which returns strings. As I stated above, personally I’m not a fan of mixing bash and Julia for things that can easily be done in just Julia, so I would do something like this:

julia> rows = 3;

julia> first_rows = Iterators.take(eachline("test.txt"), rows);

julia> data = reduce(vcat, map.(s -> parse(Float64, s), split.(first_rows, ' '))')
3×2 Array{Float64,2}:
 1.0  2.0
 3.0  4.0
 5.0  6.1

CameronBieganek · March 30, 2020, 6:52pm

Given this philosophy, is there some way to create a file IO object that limits the number of lines that are read in? I mean, aside from calling the system head with open(`head -n10 $file`). I can’t seem to find a way to create such an IO object, and thus it appears that the only way to limit the number of lines read by readdlm is to use open(`head -n10 $file`).

StefanKarpinski · March 30, 2020, 7:34pm

It seems like it doesn’t really need to be an IO object. For example:

julia> collect(Iterators.take(eachline("/usr/share/dict/words"), 10))
10-element Array{String,1}:
 "A"
 "a"
 "aa"
 "aal"
 "aalii"
 "aam"
 "Aani"
 "aardvark"
 "aardwolf"
 "Aaron"

Paul_Soderlind · March 30, 2020, 8:33pm

(on a lighter note)

Episode 2, season 3 of the British TV series Black Adder spends a lot of time on “aardvark” - number 8 in the list. (It evolves around S. Johnson’s dictionary and how he forgot about “aard…”) It’s (very) good fun.

CameronBieganek · March 30, 2020, 10:24pm

Ideally, I’d like to be able to take advantage of the row parsing that is built into readdlm, rather than having to read the rows in using eachline and parse them myself. It looks like readdlm can’t take in an iterator of strings representing rows:

julia> io() = IOBuffer("a\tb\tc\n1\t2\t3\n4\t5\t6\n7\t8\t9\n")
io (generic function with 1 method)

julia> readdlm(io())
4×3 Array{Any,2}:
  "a"   "b"   "c"
 1     2     3
 4     5     6
 7     8     9

julia> it = Iterators.take(eachline(io()), 2);

julia> readdlm(it)
ERROR: MethodError: no method matching readdlm_auto(::Base.Iterators.Take{Base.EachLine{Base.GenericIOBuffer{Array{UInt8,1}}}}, ::Char, ::Type{Float64}, ::Char, ::Bool)

It is theoretically possible to accomplish by using this horrible unsightly hack:

julia> readdlm(IOBuffer(join(Iterators.take(eachline(io()), 2), '\n')))
2×3 Array{Any,2}:
  "a"   "b"   "c"
 1     2     3

But it would be much nicer if I could just do this:

readdlm(file; limit=2)

This would be useful when you have a lot of large files and you just want to read in the first few lines for each of them. I know that this can be done with CSV.jl, but it seems like readdlm ought to support this also.