JuliaDB Getting Started...with save error


#1

I am taking my first little steps in JuliaDB. I am planning to follow https://juliadb.org/latest, including the tutorial. I do not only want to access the data, but reduce the loading time relative to .csv . (I have long wondered what a good store for financial securities data [with its sparse matrix organization] should be. I was just about to try out sqlite. But maybe JuliaDB is ideal for this.)

First, I cut a small piece of the (pricey) CRSP data:

"permno","yyyymmdd","prc","vol","ret","shrout","openprc","numtrd","retx","vwretd","ewretd","eom"
10000,19860108,-2.5,12800,-0.02439,3680,NA,NA,-0.02439,-0.020744,-0.005117,0
10000,19860109,-2.5,1400,0,3680,NA,NA,0,-0.011219,-0.011588,0
10000,19860110,-2.5,8500,0,3680,NA,NA,0,0.000083,0.003651,0
10000,19860113,-2.625,5450,0.05,3680,NA,NA,0.05,0.002749,0.002433,0
10000,19860114,-2.75,2075,0.047619,3680,NA,NA,0.047619,0.000366,0.004474,0
10000,19860115,-2.875,22490,0.045455,3680,NA,NA,0.045455,0.008206,0.007693,0
10000,19860116,-3,10900,0.043478,3680,NA,NA,0.043478,0.004702,0.00567,0
10000,19860117,-3,8470,0,3680,NA,NA,0,-0.001741,0.003297,0
10000,19860120,-3,1000,0,3680,NA,NA,0,-0.003735,-0.001355,0

Julia 1.0.2, JuliaDB 0.9.0. First, let’s load a data sample and save it to disk in order to experiment with how much storage it will take and how fast it will be:

julia> using JuliaDB

julia> @time sample=loadtable("./sample.csv")
 11.862888 seconds (42.01 M allocations: 2.037 GiB, 7.64% gc time)
Table with 9 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   19860108  -2.5    12800  -0.02439  3680    missing  missing  -0.02439  -0.020744  -0.005117  0
10000   19860109  -2.5    1400   0.0       3680    missing  missing  0.0       -0.011219  -0.011588  0
10000   19860110  -2.5    8500   0.0       3680    missing  missing  0.0       8.3e-5     0.003651   0
10000   19860113  -2.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
10000   19860114  -2.75   2075   0.047619  3680    missing  missing  0.047619  0.000366   0.004474   0
10000   19860115  -2.875  22490  0.045455  3680    missing  missing  0.045455  0.008206   0.007693   0
10000   19860116  -3.0    10900  0.043478  3680    missing  missing  0.043478  0.004702   0.00567    0
10000   19860117  -3.0    8470   0.0       3680    missing  missing  0.0       -0.001741  0.003297   0
10000   19860120  -3.0    1000   0.0       3680    missing  missing  0.0       -0.003735  -0.001355  0

julia> save( sample, "mysample.jdb" )
ERROR: DivideError: integer division error
Stacktrace:
 [1] rem at ./int.jl:233 [inlined]
 [2] padalign(::IOStream, ::Int64) at /Users/ivo/.julia/packages/MemPool/stadz/src/io.jl:13
 [3] mmwrite(::Serialization.Serializer{IOStream}, ::Array{Missing,1}) at /Users/ivo/.julia/packages/MemPool/stadz/src/io.jl:38
...
  • The load time for a 10-line csv file seems slow. I hope it is a fixed cost and not a variable cost.

  • Is .jdb the recommended file extension?

  • My first attempt was just to see how much storage overhead a juliadb db takes. Is save( object, filename) the correct function? What did I do wrong? The error could be a bit better…

regards,

/iaw


#2

save(object, filename) is correct but you shouldn’t use an extension AFAIK. The results are saved in binary with just save(sample, "mysample").

Loading from binary is much faster, but the CSV load speed is close to Stata in my limited experience.


#3

same result without a filename extension. the save shows an integer division error. could someone using juliadb regularly please confirm that this also happens on their computer and/or if tell me if this is a bug?


#4

The issue arises because somehow TextParse has been changed to use Missing instead of DataValue, whereas JuliaDB has not been ported to Missing yet. Please do open an issue about it.

As a workaround, using CSVFiles to load everything using the IterableTables story works fine:

using JuliaDB, CSVFiles
t = CSVFiles.load("test.csv") |> table
save(t, "test")
JuliaDB.load("test")

Besides compile time JuliaDB.load("test") should be reasonably fast (especially if you start julia with many processes, as it loads in parallel). The issue with loadtable should definitely be fixed though, it’s very unfortunate that the default loading method fails if there is missing data.


#5

done.


#6

latest update started fixing problems. alas, not fully I think. Is the following correct on my part (and incorrect on juliadb), or the opposite?

julia> @time sample=loadtable("./sample.csv")
  2.833065 seconds (8.73 M allocations: 433.733 MiB, 3.55% gc time)
Table with 9 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   19860108  -2.5    12800  -0.02439  3680    missing  missing  -0.02439  -0.020744  -0.005117  0
10000   19860109  -2.5    1400   0.0       3680    missing  missing  0.0       -0.011219  -0.011588  0
10000   19860110  -2.5    8500   0.0       3680    missing  missing  0.0       8.3e-5     0.003651   0
10000   19860113  -2.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
10000   19860114  -2.75   2075   0.047619  3680    missing  missing  0.047619  0.000366   0.004474   0
10000   19860115  -2.875  22490  0.045455  3680    missing  missing  0.045455  0.008206   0.007693   0
10000   19860116  -3.0    10900  0.043478  3680    missing  missing  0.043478  0.004702   0.00567    0
10000   19860117  -3.0    8470   0.0       3680    missing  missing  0.0       -0.001741  0.003297   0
10000   19860120  -3.0    1000   0.0       3680    missing  missing  0.0       -0.003735  -0.001355  0

julia> @time save( sample, "cutesample.jdb" )
  1.045374 seconds (4.08 M allocations: 198.137 MiB, 5.16% gc time)
Table with 9 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   19860108  -2.5    12800  -0.02439  3680    missing  missing  -0.02439  -0.020744  -0.005117  0
10000   19860109  -2.5    1400   0.0       3680    missing  missing  0.0       -0.011219  -0.011588  0
10000   19860110  -2.5    8500   0.0       3680    missing  missing  0.0       8.3e-5     0.003651   0
10000   19860113  -2.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
10000   19860114  -2.75   2075   0.047619  3680    missing  missing  0.047619  0.000366   0.004474   0
10000   19860115  -2.875  22490  0.045455  3680    missing  missing  0.045455  0.008206   0.007693   0
10000   19860116  -3.0    10900  0.043478  3680    missing  missing  0.043478  0.004702   0.00567    0
10000   19860117  -3.0    8470   0.0       3680    missing  missing  0.0       -0.001741  0.003297   0
10000   19860120  -3.0    1000   0.0       3680    missing  missing  0.0       -0.003735  -0.001355  0

julia> @time sample=loadtable("./cutesample.jdb")
Error parsing ./cutesample.jdb
ERROR: previous rows had 1 fields but row 3 has 2
error(::String) at ./error.jl:33
guesscolparsers(::String, ::Array{String,1}, ::TextParse.LocalOpts, ::Int64, ::Int64, ::Array{Any,1}, ::Array{String,1}, ::Nothing) at /Users/ivo/.julia/packages/TextParse/WFgcL/src/csv.jl:496
...

#7

try JuliaDB.load() ; http://juliadb.org/latest/api/io.html#Dagger.load

  • loadtable() : only for CSV files !

using JuliaDB, IndexedTables , DataFrames
@time flights=loadtable("hflights.csv")  ;
@time JuliaDB.save( flights, "hflights.jdb" )
@time hflights=JuliaDB.load( "hflights.jdb" )