JuliaDB Getting Started...with save error

iwelch · November 23, 2018, 4:44am

I am taking my first little steps in JuliaDB. I am planning to follow https://juliadb.org/latest, including the tutorial. I do not only want to access the data, but reduce the loading time relative to .csv . (I have long wondered what a good store for financial securities data [with its sparse matrix organization] should be. I was just about to try out sqlite. But maybe JuliaDB is ideal for this.)

First, I cut a small piece of the (pricey) CRSP data:

"permno","yyyymmdd","prc","vol","ret","shrout","openprc","numtrd","retx","vwretd","ewretd","eom"
10000,19860108,-2.5,12800,-0.02439,3680,NA,NA,-0.02439,-0.020744,-0.005117,0
10000,19860109,-2.5,1400,0,3680,NA,NA,0,-0.011219,-0.011588,0
10000,19860110,-2.5,8500,0,3680,NA,NA,0,0.000083,0.003651,0
10000,19860113,-2.625,5450,0.05,3680,NA,NA,0.05,0.002749,0.002433,0
10000,19860114,-2.75,2075,0.047619,3680,NA,NA,0.047619,0.000366,0.004474,0
10000,19860115,-2.875,22490,0.045455,3680,NA,NA,0.045455,0.008206,0.007693,0
10000,19860116,-3,10900,0.043478,3680,NA,NA,0.043478,0.004702,0.00567,0
10000,19860117,-3,8470,0,3680,NA,NA,0,-0.001741,0.003297,0
10000,19860120,-3,1000,0,3680,NA,NA,0,-0.003735,-0.001355,0

Julia 1.0.2, JuliaDB 0.9.0. First, let’s load a data sample and save it to disk in order to experiment with how much storage it will take and how fast it will be:

julia> using JuliaDB

julia> @time sample=loadtable("./sample.csv")
 11.862888 seconds (42.01 M allocations: 2.037 GiB, 7.64% gc time)
Table with 9 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   19860108  -2.5    12800  -0.02439  3680    missing  missing  -0.02439  -0.020744  -0.005117  0
10000   19860109  -2.5    1400   0.0       3680    missing  missing  0.0       -0.011219  -0.011588  0
10000   19860110  -2.5    8500   0.0       3680    missing  missing  0.0       8.3e-5     0.003651   0
10000   19860113  -2.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
10000   19860114  -2.75   2075   0.047619  3680    missing  missing  0.047619  0.000366   0.004474   0
10000   19860115  -2.875  22490  0.045455  3680    missing  missing  0.045455  0.008206   0.007693   0
10000   19860116  -3.0    10900  0.043478  3680    missing  missing  0.043478  0.004702   0.00567    0
10000   19860117  -3.0    8470   0.0       3680    missing  missing  0.0       -0.001741  0.003297   0
10000   19860120  -3.0    1000   0.0       3680    missing  missing  0.0       -0.003735  -0.001355  0

julia> save( sample, "mysample.jdb" )
ERROR: DivideError: integer division error
Stacktrace:
 [1] rem at ./int.jl:233 [inlined]
 [2] padalign(::IOStream, ::Int64) at /Users/ivo/.julia/packages/MemPool/stadz/src/io.jl:13
 [3] mmwrite(::Serialization.Serializer{IOStream}, ::Array{Missing,1}) at /Users/ivo/.julia/packages/MemPool/stadz/src/io.jl:38
...

The load time for a 10-line csv file seems slow. I hope it is a fixed cost and not a variable cost.
Is .jdb the recommended file extension?
My first attempt was just to see how much storage overhead a juliadb db takes. Is save( object, filename) the correct function? What did I do wrong? The error could be a bit better…

regards,

/iaw

pmarg · November 23, 2018, 5:29am

save(object, filename) is correct but you shouldn’t use an extension AFAIK. The results are saved in binary with just save(sample, "mysample").

Loading from binary is much faster, but the CSV load speed is close to Stata in my limited experience.

iwelch · November 23, 2018, 3:42pm

same result without a filename extension. the save shows an integer division error. could someone using juliadb regularly please confirm that this also happens on their computer and/or if tell me if this is a bug?

piever · November 23, 2018, 9:14pm

The issue arises because somehow TextParse has been changed to use Missing instead of DataValue, whereas JuliaDB has not been ported to Missing yet. Please do open an issue about it.

As a workaround, using CSVFiles to load everything using the IterableTables story works fine:

using JuliaDB, CSVFiles
t = CSVFiles.load("test.csv") |> table
save(t, "test")
JuliaDB.load("test")

Besides compile time JuliaDB.load("test") should be reasonably fast (especially if you start julia with many processes, as it loads in parallel). The issue with loadtable should definitely be fixed though, it’s very unfortunate that the default loading method fails if there is missing data.

iwelch · November 23, 2018, 9:29pm

done.

iwelch · November 28, 2018, 5:47pm

latest update started fixing problems. alas, not fully I think. Is the following correct on my part (and incorrect on juliadb), or the opposite?

julia> @time sample=loadtable("./sample.csv")
  2.833065 seconds (8.73 M allocations: 433.733 MiB, 3.55% gc time)
Table with 9 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   19860108  -2.5    12800  -0.02439  3680    missing  missing  -0.02439  -0.020744  -0.005117  0
10000   19860109  -2.5    1400   0.0       3680    missing  missing  0.0       -0.011219  -0.011588  0
10000   19860110  -2.5    8500   0.0       3680    missing  missing  0.0       8.3e-5     0.003651   0
10000   19860113  -2.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
10000   19860114  -2.75   2075   0.047619  3680    missing  missing  0.047619  0.000366   0.004474   0
10000   19860115  -2.875  22490  0.045455  3680    missing  missing  0.045455  0.008206   0.007693   0
10000   19860116  -3.0    10900  0.043478  3680    missing  missing  0.043478  0.004702   0.00567    0
10000   19860117  -3.0    8470   0.0       3680    missing  missing  0.0       -0.001741  0.003297   0
10000   19860120  -3.0    1000   0.0       3680    missing  missing  0.0       -0.003735  -0.001355  0

julia> @time save( sample, "cutesample.jdb" )
  1.045374 seconds (4.08 M allocations: 198.137 MiB, 5.16% gc time)
Table with 9 rows, 12 columns:
permno  yyyymmdd  prc     vol    ret       shrout  openprc  numtrd   retx      vwretd     ewretd     eom
────────────────────────────────────────────────────────────────────────────────────────────────────────
10000   19860108  -2.5    12800  -0.02439  3680    missing  missing  -0.02439  -0.020744  -0.005117  0
10000   19860109  -2.5    1400   0.0       3680    missing  missing  0.0       -0.011219  -0.011588  0
10000   19860110  -2.5    8500   0.0       3680    missing  missing  0.0       8.3e-5     0.003651   0
10000   19860113  -2.625  5450   0.05      3680    missing  missing  0.05      0.002749   0.002433   0
10000   19860114  -2.75   2075   0.047619  3680    missing  missing  0.047619  0.000366   0.004474   0
10000   19860115  -2.875  22490  0.045455  3680    missing  missing  0.045455  0.008206   0.007693   0
10000   19860116  -3.0    10900  0.043478  3680    missing  missing  0.043478  0.004702   0.00567    0
10000   19860117  -3.0    8470   0.0       3680    missing  missing  0.0       -0.001741  0.003297   0
10000   19860120  -3.0    1000   0.0       3680    missing  missing  0.0       -0.003735  -0.001355  0

julia> @time sample=loadtable("./cutesample.jdb")
Error parsing ./cutesample.jdb
ERROR: previous rows had 1 fields but row 3 has 2
error(::String) at ./error.jl:33
guesscolparsers(::String, ::Array{String,1}, ::TextParse.LocalOpts, ::Int64, ::Int64, ::Array{Any,1}, ::Array{String,1}, ::Nothing) at /Users/ivo/.julia/packages/TextParse/WFgcL/src/csv.jl:496
...

ImreSamu · November 28, 2018, 7:59pm

try JuliaDB.load() ; http://juliadb.org/latest/api/io.html#Dagger.load

loadtable() : only for CSV files !


using JuliaDB, IndexedTables , DataFrames
@time flights=loadtable("hflights.csv")  ;
@time JuliaDB.save( flights, "hflights.jdb" )
@time hflights=JuliaDB.load( "hflights.jdb" )

Topic		Replies	Views
JuliaDB Questions/Issues New to Julia package	13	2565	July 3, 2019
JuliaDB loading data General Usage juliadb	15	1910	July 12, 2019
JuliaDB won't open CSV file General Usage juliadb	6	1940	April 10, 2019
JuliaDB loadndsparse: many errors General Usage	15	774	October 24, 2019
Why do you use JuliaDB? General Usage	9	2157	October 28, 2019

JuliaDB Getting Started...with save error

Related topics