Reading LIBSVM format

Hi! How one can read files in the libsvm format in Julia? The LIBSVM.jl package seems to implement only algorithms. My current solution would be to read the file in python, save it in other format and then load it into Julia, but that’s a bit of an overkill for such a basic task.

1 Like

Following is a naive implementation I’ve made to read libsvm format.
It’s not high performance but proved suitable for my needs: parsing the Yahoo Laerning to Rank Challenge Set1 train data takes about 2 mins (~ 475 000 observations, 700 columns).
Note that It returns a dense matrix, not a sparse one:

function read_libsvm(raw::Vector{UInt8}; has_query=false)

    io = IOBuffer(raw)
    lines = readlines(io)

    nobs = length(lines)
    nfeats = 0 # number of features

    y = zeros(Float64, nobs)

    if has_query
        offset = 2 # offset for feature idx: y + query entries
        q = zeros(Int, nobs)
    else
        offset = 1 # offset for feature idx: y
    end

    vals = [Float64[] for _ in 1:nobs]
    feats = [Int[] for _ in 1:nobs]

    for i in eachindex(lines)
        line = lines[i]
        line_split = split(line, " ")

        y[i] = parse(Int, line_split[1])
        has_query ? q[i] = parse(Int, split(line_split[2], ":")[2]) : nothing

        n = length(line_split) - offset
        lfeats = zeros(Int, n)
        lvals = zeros(Float64, n)
        @inbounds for jdx in 1:n
            ls = split(line_split[jdx+offset], ":")
            lvals[jdx] = parse(Float64, ls[2])
            lfeats[jdx] = parse(Int, ls[1])
            lfeats[jdx] > nfeats ? nfeats = lfeats[jdx] : nothing
        end
        vals[i] = lvals
        feats[i] = lfeats
    end

    x = zeros(Float64, nobs, nfeats)
    @inbounds for i in 1:nobs
        @inbounds for jdx in 1:length(feats[i])
            j = feats[i][jdx]
            val = vals[i][jdx]
            x[i, j] = val
        end
    end

    if has_query
        return (x=x, y=y, q=q)
    else
        return (x=x, y=y)
    end
end
1 Like

Thanks! So does it mean that there is indeed no dedicated package for this?

Correct, at least I’m not aware of any dedicated package.
I think that the above function could be polished a little and added into LIBSVM, or even adapted into a dedicated lightweight package within JuliaIO · GitHub.