Flux: 1D convolutions (on genomic data)

I am trying to recreate this tutorial from Python.

This basically deals with how to apply deep learning to genomic data. As such, the input consists of sequences, where each sequence is based on a combination of any of the four bases: A, C, G, T. An example of such sequence could be e.g. CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA.

To make the sequences into something numerical, we first have to one-hot-encode them. For the one-hot encoding, I do:

sequence = "CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA"
alphabet = ['A', 'C', 'G', 'T']
using Flux: onehot
tokenise(s) = [onehot(c, alphabet) for c in s]
tokenise(sequence)

which gives a 46-element array of 4-dimensional vectors.

The tutorial then continues using a convolutional neural network (CNN) using a 1D convolutional layer. However, I was having trouble finding anything similar in Flux.

The Python code in question is:

model = Sequential()
model.add(Conv1D(filters=32, kernel_size=12, input_shape=(train_features.shape[1], 4)))
model.add(MaxPooling1D(pool_size=4))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(2, activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer='adam', 
              metrics=['binary_accuracy'])

I have tried looking at CNNs implemented in Flux running on e.g. MNIST data, but I couldn’t find anything related to the 1D-case.

Can anything help with a Flux implementation?

2 Likes

You can just use the Conv layer, it handles up to 3d conv

1 Like

So you do not need a Flux version of Keras’ Conv1D? How come they have a separate version in Keras/Tensorflow then?

Also, do you have a MWE of this?

Okay, I think I managed to get it working! First I load in the data and one-hot-encode it (pretty similar to my first shot above):

using MLJ
using Flux
import MLJFlux
import MLJIteration
using CairoMakie
import Base.@kwdef
using Downloads

url_seqs = "https://raw.githubusercontent.com/abidlabs/deep-learning-genomics-primer/master/sequences.txt"
sequences_raw = split(String(take!(Downloads.download(url_seqs, IOBuffer()))))

url_labels = "https://raw.githubusercontent.com/abidlabs/deep-learning-genomics-primer/master/labels.txt"
labels_raw = parse.(Int64, split(String(take!(Downloads.download(url_labels, IOBuffer())))))

#%%

bases_dna = ['A', 'C', 'G', 'T']

function ohe_row(sequence)
    return collect(sequence) .== permutedims(bases_dna)
end

function onehot(sequences)
    L = 50
    N_bases = 4
    out = BitArray(undef, L, N_bases, size(sequences, 1))
    for (i, sequence) in enumerate(sequences)
        out[:, :, i] = reshape(ohe_row(sequence), (L, N_bases, 1))
    end
    return out
end

#%%

sequences = onehot(sequences_raw);

images = coerce(sequences, GrayImage);
labels = coerce(labels_raw, Binary);

N_images = length(images)
idxs = shuffle(1:N_images);
train = idxs[1:1500];
test = idxs[1501:end];

One slightly weird feature is that it seems to be easiest to convert the genomic data to an L times 4 dimensional gray image (with L being the sequence length, 50 in this case).

The convolutional network was a bit tricky to get working, at least regarding using the correct dimensions for the convolutional filter. However, I think I got it by using the following:

@kwdef struct MyDNABuilder
    kernel_size::Int
    filters::Int
    dense::Int
end

make2d(x::AbstractArray) = reshape(x, :, size(x)[end])

function MLJFlux.build(b::MyDNABuilder, rng, n_in, n_out, n_channels)
    kernel_size, filters, dense = b.kernel_size, b.filters, b.dense
    init = Flux.glorot_uniform(rng)
    front = Chain(
        Conv((kernel_size, 4), 1 => filters, pad = (0, 0), bias = true, init = init),
        MaxPool((4, 1)),
        make2d,
    )
    d = Flux.outputsize(front, (n_in..., n_channels, 1)) |> first
    return Chain(front, Dense(d, dense, init = init), Dense(dense, n_out, init = init))
end

clf = MLJFlux.ImageClassifier(
    builder = MyDNABuilder(kernel_size = 12, filters = 32, dense = 16),
    loss = Flux.crossentropy,
    optimiser = ADAM(0.001, (0.9, 0.999)),
    epochs = 50,
    batch_size = 32,
    rng = 123,
    finaliser = softmax,
)

This network specification should be similar to the one used in Python (see my original post).

To capture the training losses as a function of epoch, I followed the MNIST example from MLJFLux:

# To store the traces:
losses = Float32[]
training_losses = Float32[]
parameter_means = Float32[];
epochs = Int64[]

# To update the traces:
update_loss(loss) = push!(losses, loss)
update_training_loss(losses) = push!(training_losses, losses[end])
update_epochs(epoch) = push!(epochs, epoch)


controls = [
    Step(1),
    NumberLimit(n = 50),
    WithLossDo(),
    WithLossDo(update_loss),
    WithTrainingLossesDo(update_training_loss),
    WithIterationsDo(update_epochs),
];

# The "self-iterating" classifier:
iterated_clf = IteratedModel(
    model = clf,
    controls = controls,
    resampling = Holdout(fraction_train = 0.75),
    measure = log_loss,
);

# Binding the wrapped model to data:
mach = machine(iterated_clf, images[train], labels[train]);

# Training
fit!(mach);

This gives the following training loss curve:

I do not understand why the loss decreases, then increases around epoch 25-30 and then suddenly drops again. If anyone has any suggestions, please let me know! In comparison, the training loss in the Python example does not exhibit similar behaviour.

The accuracy on the test set is 96.8% using Julia, while it is 97.4% using Python so that’s also an advantage of the Python version.

The confusion matrix given Julia is:

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚       Ground Truth        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Predicted  β”‚      0      β”‚      1      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚      0      β”‚     241     β”‚      3      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚      1      β”‚     13      β”‚     243     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

and in Python is:

[[249 10]
 [ 3 238]]

Just a quick side-note:
For one-hot encoding you can simply use Flux.onehotbatch():

julia> Flux.onehotbatch(collect(sequence), alphabet)
4Γ—46 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 β‹…  β‹…  β‹…  1  β‹…  β‹…  β‹…  β‹…  β‹…  1  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  1  …  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  1
 1  1  β‹…  β‹…  β‹…  β‹…  β‹…  1  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…     1  1  1  β‹…  β‹…  β‹…  β‹…  β‹…  1  β‹…  β‹…  1  β‹…  1  β‹…  1  β‹…  β‹…  β‹…
 β‹…  β‹…  1  β‹…  1  1  1  β‹…  β‹…  β‹…  β‹…  1  1  β‹…  β‹…  β‹…  1  1  β‹…     β‹…  β‹…  β‹…  β‹…  1  1  1  1  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  1  β‹…  1  1  β‹…
 β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…  1  β‹…  1  β‹…  β‹…  1  1  1  β‹…  β‹…  β‹…     β‹…  β‹…  β‹…  1  β‹…  β‹…  β‹…  β‹…  β‹…  1  1  β‹…  1  β‹…  β‹…  β‹…  β‹…  β‹…  β‹…
1 Like

No clue, I assume there’s some historical baggage involved since e.g. Flax uses a similar unified API as Flux.

1 Like

The models are not identical. One thing that immediately jumps out is that PyTorch uses kaiming inits for weights and scaled uniform init for biases, whereas here you’re using glorot init. Flux exposes a similar set of init utils, so you should be able to replicate the PyTorch behaviour there.

Woops, that’s what I get for replying on mobile and not reading the colab notebook. The initializations are indeed the same. Have you tried running each model a few times with different seeds to see if the discrepancy persists?

Thanks for that suggestion! I chose this other method due to speed and the number of allocations. With the following definition, sequence = sequences_raw[1], I find the following:

@btime Flux.onehotbatch(collect(sequence), bases_dna);
# output: 64.762 ΞΌs (411 allocations: 19.00 KiB)
@btime ohe_row(sequence);
# output: 662.865 ns (5 allocations: 496 bytes)

so a factor of 100 in speed, most importantly, and also almost a factor of 100 in the number of allocations.

1 Like

I have tried to be as precise in the Keras => Flux translation as possible, including using the same initialisations, as you also mention. I should probably try rerunning the models as you say to see the size of the stochastic variability, good idea!

Could you open an issue in Flux about this? We should improve onehotbatch performance, especially if that’s so easy

1 Like

Sorry for the bikeshedding (I know this doesn’t help your actual problem in any way :upside_down_face:)

There’s another factor of 50 in speed in the onehot encoding if you do:

const bases_dna = ['A', 'C', 'G', 'T'];

function ohe_row!(out, sequence)
   for (i, c) in enumerate(sequence)
       out[i, :] .= bases_dna .== c
   end
   nothing
end

function onehot2(sequences)
   L = 50
   N_bases = 4
   out = BitArray(undef, L, N_bases, size(sequences, 1))
   for i in eachindex(sequences)
       ohe_row!(view(out, :, :, i), view(sequences, i))
   end
   out
end
julia> @btime onehot2(sequences_raw);
  37.463 ΞΌs (3 allocations: 48.97 KiB)
julia> @btime onehot(sequences_raw);
  1.594 ms (14003 allocations: 1.21 MiB)
1 Like

Haha, no, that’s not really my main concern, but it’s always good with some extra performance, so thanks!

Ok, so on your original question:

wouldn’t this here correspond to the first convolution in the python version?
I.e. treat the genome dimension (i.e. the dimension of length 4) as a channel dimension rather than a convolution dimension:

function build_first_conv(params::MyDNABuilder)
    init = Flux.glorot_uniform()
    Conv(
        (params.kernel_size,),
        4 => params.filters,
        pad=(0, 0),
        bias = true,
        init = init
    )

That way I get the same number of parameters (1_568), and same output shape (modulo bach-first vs batch-last, (39, 32, 1500)).

An additional difference to the python version seems to be that you’re missing a relu activation after the first Dense layer.

For the full model I would then have:

function build_model(params::MyDNABuilder)
    init = Flux.glorot_uniform()
    front = Chain(
        Conv(
            (params.kernel_size,),
            4 => params.filters,
            pad=(0, 0),
            bias = true,
            init = init
        ),
        MaxPool((4,)),
        flatten,
    )
    d = Flux.outputsize(front, (50, 4, 1))[1]
    Chain(
        front,
        Dense(
            d,
            params.dense,
            relu,
            init = init
        ),
        Dense(
            params.dense,
            2,
            init = init
        )
    )
end

also note that you don’t need to convert to a gray image. Just plug in the date as it comes out of the onehot function (convert to Float32 though, for fast gradients)

julia> model = build_model(MyDNABuilder(12, 32, 16))
Chain(
  Chain(
    Conv((12,), 4 => 32),               # 1_568 parameters
    MaxPool((4,)),
    Flux.flatten,
  ),
  Dense(288, 16, relu),                 # 4_624 parameters
  Dense(16, 2),                         # 34 parameters
)                   # Total: 6 arrays, 6_226 parameters, 25.055 KiB.
julia> size(x)
(50, 4, 1500)
julia> typeof(x)
BitArray{3}
julia> size(model(x))
(2, 1500)
2 Likes

Thanks a lot for your answer, @trahflow!

I have taken a look at your solution and there are a couple of things that I am in doubt about.
First of all, the your model signature is:

Chain(
  Chain(
    Conv((12,), 4 => 32),               # 1_568 parameters
    MaxPool((4,)),
    Flux.flatten,
  ),
  Dense(288, 16, relu),                 # 4_624 parameters
  Dense(16, 2),                         # 34 parameters
)                   # Total: 6 arrays, 6_226 parameters, 25.055 KiB.

compared to mine:

Chain(
  Chain(
    Conv((12, 4), 1 => 32),             # 1_568 parameters
    MaxPool((4, 1)),
    Flux.flatten,
  ),
  Dense(288, 16, relu),                 # 4_624 parameters
  Dense(16, 2),                         # 34 parameters
)                   # Total: 6 arrays, 6_226 parameters, 25.266 KiB.

They both seem similar when comparing the number of parameters so aren’t these two models actually the same? Or maybe I am missing something.

Btw, you were completely right about me missing a relu in the semi last Dense layer, so thanks a lot for spotting that!

I couldn’t really get your function to play together with MLJFlux.
In particular, MLJFlux requires defining a new MLJFlux.Builder method with one of these signatures:

MLJFlux.build(builder::MyBuilder, rng, n_in, n_out)
MLJFlux.build(builder::MyBuilder, rng, n_in, n_out, n_channels) # for use with `ImageClassifier`

so the use of channels seems to imply the use of images?

Regarding channels, it also says that

x <: Array{<:Float32, 4} of size (W, H, n_channels, batch_size), where (W, H) = n_in, n_channels is 1 or 3, and batch_size is any integer (for use with ImageClassifier)

so it sounds like you cannot even use 4 channels.

You’re right, since you set padding to zero, so except for the extra dimension, both ways should be the same. I just quickly tried though, and the two different convolutions give different results (aside from the extra dimension in your version, and with the same weights of course). Not sure why.

I don’t have much experience with MLJFlux. But from what you say, ImageClassifier seems to tie you to actual grayscale or RGB images (where the color dimension = channels).
Wouldn’t NeuralNetworkClassifier be a bit more flexible in that regard?

Anyway I think a non-padded Kx4 convolution over a Lx4 image with a single channel should be equivalent to a K convolution over a length L sequence with four channels (even though it seems the output of the two operations are not equal). So not sure whether its worth the effort to figure out how to use the latter with MLJFlux. Maybe someone else can say more on why the operations are different.

How look things with that additional Relu?

1 Like

As I read the documentation, NeuralNetworkClassifier deals with

<:Table(Continuous) with n_in columns

which I understand as just a 2D table?

To keep using MLJFlux as the framework, I am thinking that, since MLJFlux uses the following dimensions: W x H x C x N and treats my sequences / β€œimages” as 50 x 4 x 1 x N, I should be able to use permutedims(image, (1, 3, 2, 4)) to permute them to dimension 50 x 1 x 4 x N and then use your suggestion about the convolution with:

Conv((12, 1), 4 => 32, pad = 0, bias = true, )

This would treat the 4 bases as channels instead of height, as far as I see.

When I added the relu to my previous model, I get the following loss curve:

Still with the same weird increase in loss. I’ll give my permutation approach a go now and see how that goes.

Thanks for your patience!

1 Like

Things are starting to look better!

I just tried the new version using your suggestion regarding the convolutions:

function permute_height_channels(A)
    return permutedims(A, (1, 3, 2, 4))
end

@kwdef struct MyDNABuilder2
    kernel_size::Int
    filters::Int
    dense::Int
end

function MLJFlux.build(b::MyDNABuilder2, rng, n_in, n_out, n_channels)
    kernel_size, filters, dense = b.kernel_size, b.filters, b.dense
    init = Flux.glorot_uniform(rng)
    front = Chain(
        permute_height_channels,
        Conv((kernel_size, 1), 4 => filters, pad = 0, bias = true, init = init),
        MaxPool((4, 1)),
        flatten,
    )
    d = Flux.outputsize(front, (n_in..., n_channels, 1)) |> first
    return Chain(
        front,
        Dense(d, dense, relu, init = init),
        Dense(dense, n_out, init = init),
    )
end

The loss curve for this method is:

which already looks a lot better!

Also the confusion matrix is better than before:

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚       Ground Truth        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Predicted  β”‚      0      β”‚      1      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚      0      β”‚     245     β”‚      2      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚      1      β”‚      9      β”‚     244     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

and the accuracy is 97.8% which I would conclude on par with Python.

2 Likes

I tried implementing everything in pure Flux, no MLJFlux, however, without any luck.

The code be as similar to the previous code as possible.

using Flux
using Random
using Downloads
using Flux.Losses: logitcrossentropy
using Flux: onehotbatch
using Flux: onecold
using Flux.Data: DataLoader

#%%

const bases_dna = ['A', 'C', 'G', 'T']

function ohe_row(sequence)
    return collect(sequence) .== permutedims(bases_dna)
end

function ohe_row!(out, sequence)
    for (i, c) in enumerate(sequence)
        out[i, :] .= bases_dna .== c
    end
end

function onehot(sequences)
    L = 50
    N_bases = 4
    out = BitArray(undef, L, N_bases, size(sequences, 1))
    for i in eachindex(sequences)
        ohe_row!(view(out, :, :, i), view(sequences, i))
    end
    return out
end

#%%

function get_data()

    url_seqs = "https://raw.githubusercontent.com/abidlabs/deep-learning-genomics-primer/master/sequences.txt"
    sequences_raw = split(String(take!(Downloads.download(url_seqs, IOBuffer()))));

    url_labels = "https://raw.githubusercontent.com/abidlabs/deep-learning-genomics-primer/master/labels.txt"
    labels_raw = parse.(Int64, split(String(take!(Downloads.download(url_labels, IOBuffer())))));

    # One-hot-encode the sequences
    sequences = Float32.(onehot(sequences_raw));

    # One-hot-encode the labels
    labels = onehotbatch(labels_raw, 0:1)

    N_sequences = last(size(sequences))
    idxs = shuffle(1:N_sequences);
    idx_split = Int(N_sequences*0.75)
    train = idxs[1:idx_split];
    test = idxs[idx_split+1:end];

    # Create DataLoaders (mini-batch iterators)
    train_loader = DataLoader((sequences[:, :, train], labels[:, train]), batchsize=32, shuffle=true);
    test_loader = DataLoader((sequences[:, :, test], labels[:, test]), batchsize=32);

    return train_loader, test_loader

end

#%%

function build_model()
    init = Flux.glorot_uniform()
    front = Chain(
        Conv(
            (12, ),
            4 => 32,
            pad = 0,
            bias = true,
            init = init
        ),
        MaxPool((4, )),
        Flux.flatten,
    )
    d = Flux.outputsize(front, (50, 4, 1))[1] # 288
    Chain(
        front,
        Dense(
            d,
            16,
            relu,
            init = init
        ),
        Dense(
            16,
            2,
            init = init
        ),
    )
end



function loss_and_accuracy(data_loader, model)
    acc = 0
    ls = 0.0f0
    num = 0
    for (x, y) in data_loader
        yΜ‚ = model(x)
        ls += logitcrossentropy(yΜ‚, y, agg=sum)
        acc += sum(onecold(yΜ‚) .== onecold(y))
        num +=  size(x)[end]
    end
    return ls / num, acc / num
end

loss(yΜ‚, y) = logitcrossentropy(yΜ‚, y)

function train()

    train_loader, test_loader = get_data()

    # Construct model
    model = build_model()
    ps = Flux.params(model) # model's trainable parameters;

    ## Optimizer
    opt = ADAM(0.001, (0.9, 0.999))

    train_losses = Float64[]
    train_accuracies = Float64[]
    test_losses = Float64[]
    test_accuracies = Float64[]

    ## Training
    epochs = 50
    for epoch in 1:epochs

        for (x, y) in train_loader

            gs = Flux.gradient(ps) do
                yΜ‚ = model(x)
                loss(yΜ‚, y)
            end

            # gs = gradient(() -> logitcrossentropy(model(x), y), ps) # compute gradient
            Flux.Optimise.update!(opt, ps, gs) # update parameters
        end

        # Report on train and test
        train_loss, train_acc = loss_and_accuracy(train_loader, model)
        test_loss, test_acc = loss_and_accuracy(test_loader, model)
        println("Epoch=$epoch")
        println("  train_loss = $train_loss, train_accuracy = $train_acc")
        println("  test_loss = $test_loss, test_accuracy = $test_acc")

        push!(train_losses, train_loss)
        push!(train_accuracies, train_acc)

        push!(test_losses, test_loss)
        push!(test_accuracies, test_acc)

    end

    return train_losses, train_accuracies, test_losses, test_accuracies

end

train_losses, train_accuracies, test_losses, test_accuracies = train();

The loss curve looks like:

And the accuracy curve looks like:

Weird, I independently just implemented it in Flux and saw the same behaviour.
I don’t have time to debug it right now, but if you figure it out I would be very curious to hear what’s causing it