Random subsample from a JuliaDB distributed table: no getindex

I have a dataset in a JuliaDB distributed table with about 200 million rows. I’d like to take a random subsample from this table to do some quick exploratory plotting and analysis before running the whole thing. This is easy with an in-memory table, e.g.

using JuliaDB
using StatsBase

n = 1000
nsub = 100
t = table(randn(n), randn(n), names=[:x, :y])
typeof(t)
# IndexedTables.NextTable{IndexedTables.Columns{NamedTuples._NT_x_y{Float64,Float64},NamedTuples._NT_x_y{Array{Float64,1},Array{Float64,1}}}}
tsub = t[sample(1:n, nsub, replace=false)]

but the last line fails with a MethodError if t is a distributed JuliaDB.DNextTable instead of an in-memory IndexedTables.NextTable: there’s no getindex defined for distributed tables. I’m using Julia 0.6.2 and JuliaDB v0.8.4.

Is this a missing feature, or is a getindex with integer/array indices not possible because of how a DNextTable is chunked and stored? Any suggested workarounds? Thanks!

1 Like