JuliaDB loadndsparse: many errors

I’ve done a fair amount of work to get my CSV cleaned up (pipe-delimited/no quotes), data types and column names right and loaded separate, consulted the out-of-core section of the JuliaDB docs, and I found the TrueFX Jupyter notebook which was a lifesaver.

Nonetheless, I’m getting numerous deep errors when trying to loadndsparse:

      From worker 9:    unknown function (ip: 0x7ff61615e300)
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    unknown function (ip: 0x7ff622af5d88)
      From worker 9:    unknown function (ip: 0x7ff622af6344)
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #table#71 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/indexedtable.jl:137
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #table at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #table#72 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/indexedtable.jl:140
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #table at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #convert#86 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/indexedtable.jl:388
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #convert at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse#103 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/ndsparse.jl:99
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse#102 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/ndsparse.jl:65
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse#106 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/ndsparse.jl:112
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse#107 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/ndsparse.jl:116
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #ndsparse at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #convert#120 at /home/user/.julia/packages/IndexedTables/5U0Ap/src/ndsparse.jl:314
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #convert at ./none:0
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #_loadtable_serial#3 at /home/user/.julia/packages/JuliaDB/jDAlJ/src/util.jl:183
      From worker 9:    unknown function (ip: 0x7ff6100605d8)
      From worker 9:    #_loadtable_serial at ./none:0 [inlined]
      From worker 9:    #190 at /home/user/.julia/packages/JuliaDB/jDAlJ/src/io.jl:131
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    do_task at /home/user/.julia/packages/Dagger/sdZXi/src/scheduler.jl:259
      From worker 9:    unknown function (ip: 0x7ff610056175)
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    jl_f__apply at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    #112 at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292
      From worker 9:    run_work_thunk at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:79
      From worker 9:    macro expansion at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292 [inlined]
      From worker 9:    #111 at ./task.jl:268
      From worker 9:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 9:    unknown function (ip: 0x7ff622b10c12)
      From worker 9:    unknown function (ip: 0xffffffffffffffff)

Any help would be appreciated.

Here’s some errors I managed to grab while attempting to use loadtable instead of loadndsparse:

      From worker 10:   unknown function (ip: 0x7f52aba86c61)
      From worker 10:   unknown function (ip: 0x7f52aba871d2)
      From worker 10:   unknown function (ip: 0x7f52aba87300)
      From worker 10:   jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 10:   unknown function (ip: 0x7f52b841ed88)
      From worker 10:   unknown function (ip: 0x7f52b841f344)
      From worker 10:   jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 10:   do_task at /home/user/.julia/packages/Dagger/sdZXi/src/scheduler.jl:260
      From worker 10:   unknown function (ip: 0x7f52a597f0c5)
      From worker 10:   jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 10:   jl_f__apply at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 10:   #112 at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292
      From worker 10:   run_work_thunk at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:79
      From worker 10:   macro expansion at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292 [inlined]
      From worker 10:   #111 at ./task.jl:268
      From worker 10:   jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 10:   unknown function (ip: 0x7f52b8439c12)
      From worker 10:   unknown function (ip: 0xffffffffffffffff)

I tried running this without loading anything from the CSV package and after running pkg update. Here’s some of the errors:

      From worker 2:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    unknown function (ip: 0x7f60e83d299f)
      From worker 2:    unknown function (ip: 0x7f60e83d3e3f)
      From worker 2:    unknown function (ip: 0x7f60e83d74a7)
      From worker 2:    unknown function (ip: 0x7f60e83d775d)
      From worker 2:    unknown function (ip: 0x7f60e83d9169)
      From worker 2:    unknown function (ip: 0x7f60e83dc1a0)
      From worker 2:    unknown function (ip: 0x7f60e83e16ae)
      From worker 2:    unknown function (ip: 0x7f60e845068a)
      From worker 2:    unknown function (ip: 0x7f60e8451c61)
      From worker 2:    unknown function (ip: 0x7f60e84521d2)
      From worker 2:    unknown function (ip: 0x7f60e8452300)
      From worker 2:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    unknown function (ip: 0x7f60f4de9d88)
      From worker 2:    unknown function (ip: 0x7f60f4dea344)
      From worker 2:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    #_loadtable_serial#3 at /home/user/.julia/packages/JuliaDB/jDAlJ/src/util.jl:178
      From worker 2:    unknown function (ip: 0x7f60e237f108)
      From worker 2:    #_loadtable_serial at ./none:0 [inlined]
      From worker 2:    #190 at /home/user/.julia/packages/JuliaDB/jDAlJ/src/io.jl:131
      From worker 2:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    do_task at /home/user/.julia/packages/Dagger/sdZXi/src/scheduler.jl:259
      From worker 2:    unknown function (ip: 0x7f60e2322505)
      From worker 2:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    jl_f__apply at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    #112 at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292
      From worker 2:    run_work_thunk at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:79
      From worker 2:    macro expansion at /builddir/build/BUILD/julia/build/usr/share/julia/stdlib/v1.2/Distributed/src/process_messages.jl:292 [inlined]
      From worker 2:    #111 at ./task.jl:268
      From worker 2:    jl_apply_generic at /usr/bin/../lib64/libjulia.so.1 (unknown line)
      From worker 2:    unknown function (ip: 0x7f60f4e04c12)
      From worker 2:    unknown function (ip: 0xffffffffffffffff)

Also struggled to make JuliaDB work. Afaik, further development is not prioritised at this stage. Also Juliadb uses text parse.jl as the csv loading engine, not csv.jl. But I don’t think that knowledge helps in this case

Thanks for your response. So, is there a way for Julia to work with data larger than memory?

There is, which is JuliaDB, but the barrier-to-try is quite high as the data reading isn’t perfect.

You can try split a large CSV into smaller CSV chunks first and then read them in using JuliaDB?

I made R’s disk.frame (https://github.com/xiaodaigh/disk.frame) and have started a DiskFrame.jl project on my laptop to do just that. Watch this space.

idk if this will help:
https://github.com/JuliaData/Feather.jl

Thanks and good luck!

Can feather use an on-disk data source?

this is a copy paste of the first sentence of the doc

Feather.jl provides a pure Julia library for reading and writing feather-formatted binary files, an efficient on-disk representation of a DataFrame .

there is no ability to read chunk by chunk but feather files are just like csvs in that they live on disk, and it’s lazy loading. So might have some advantages but not sure…

If you need to process column by column you might want to check out JDF.jl

You can load only a few columns at a time for analysis like this

using DataFrames, JDF

a = DataFrame(a = 1:3, b= 1:3, c=1:3)

savejdf(a, "a.jdf")

column_a_b = loadjdf("a.jdf", cols = [:a, :b])

column_c = loadjdf("a.jdf", cols = [:a])

DiskFrame.jl will be built on top of JDF.jl so JDF.jl will have chunk by chunk read in the next version

Yeah, the “high level api” says it can read Data.Source, and I don’t know what meets that type definition. But I’ll give it a shot :wink:

It’s probably going to come to column-by-column. Unfortunately I’m stuck in CSV-land, so there’s some cut in my future…

I don’t understand what you mean? Anyway, if you need help with JDF please let me know.

So it can read from CSV.Rows but it’s building the file in memory rather than on disk…