I have a dataset in a JuliaDB distributed table with about 200 million rows. I’d like to take a random subsample from this table to do some quick exploratory plotting and analysis before running the whole thing. This is easy with an in-memory table, e.g.
using JuliaDB
using StatsBase
n = 1000
nsub = 100
t = table(randn(n), randn(n), names=[:x, :y])
typeof(t)
# IndexedTables.NextTable{IndexedTables.Columns{NamedTuples._NT_x_y{Float64,Float64},NamedTuples._NT_x_y{Array{Float64,1},Array{Float64,1}}}}
tsub = t[sample(1:n, nsub, replace=false)]
but the last line fails with a MethodError
if t
is a distributed JuliaDB.DNextTable
instead of an in-memory IndexedTables.NextTable
: there’s no getindex
defined for distributed tables. I’m using Julia 0.6.2 and JuliaDB v0.8.4.
Is this a missing feature, or is a getindex
with integer/array indices not possible because of how a DNextTable
is chunked and stored? Any suggested workarounds? Thanks!