I have a sparse matrix with dimensions 800 x 100,000 filled solely with binary data (1 or 0), when loading it into a table with MLJ, their type is determined to be count data. I am able to change it into the OrderedFactor type through the coerce function, but this is very slow. Is there any way of making the table interface automatically detect these as OrderedFactors instead of defaulting to Count? Any way of getting to this result would work, though that’s what I would imagine is the fastest.
Code:
function read_sparse(file::AbstractString, contains_labs = true)
f1 = contains_labs == true ? 2 : 1
ind = readlines(file)
m = zeros(Int, lastindex(ind), 100_000)
c = 1
for i in ind
i = replace(i, "\t" => " ") |> strip
var = parse.(Int, split(i))
for j in f1:lastindex(var)
m[c, var[j]] = 1
end
c += 1
end
sparse(m)
end # -> returns a Sparse 800 x 100,000 matrix
train = read_sparse("train.txt")
train_table = table(train)
X = coerce(train_table, Count => OrderedFactor); # This is the bottleneck