Hi,

I’ve been following Julia development on and off and with the buzz around the latest release I decided to give it another try by writing a small neural network library. I am working on a daily basis with Torch so I implemented something with a similar API.

However it turned out the performance of the code I wrote was terrible compared to Torch when doing a forward - backward operation on a fully connected network with around 500,000,000 parameters.

```
500,000,000 parameters:
torch: 0.43s
julia: 2.84 - 2.98s
68,000,000 parameters:
torch: 0.05 - 0.06s
julia: 0.35 - 0.36s
```

So I tried to break it down to the main operations, you can see the code below. This performs exactly as the network I wrote. Are there any obvious mistakes that might kill the performance? Any suggestions on how I can improve things?

```
module Test
T = Float32
W1 = rand(T, 2048, 512 * 512)
W2 = rand(T, 1024, 2048)
W3 = rand(T, 10, 1024)
dW1, dW2, dW3 = zeros(W1), zeros(W2), zeros(W3)
out1, out2, out3 = zeros(T, 2048), zeros(T, 1024), zeros(T, 10)
dOut1, dOut2, dOut = zeros(T, 2048), zeros(T, 1024), zeros(T, 512 * 512)
function mockNN(input::Array{Float32, 1}, error::Array{Float32, 1})
# Forward
BLAS.gemv!('N', T(1.0), W1, input, T(0.0), out1)
BLAS.gemv!('N', T(1.0), W2, out1, T(0.0), out2)
BLAS.gemv!('N', T(1.0), W3, out2, T(0.0), out3)
# Backward
# ∂E/∂inputs and ∂E/∂W
fill!(dW3, 0)
fill!(dOut2, 0)
BLAS.gemv!('N', T(1.0), W3', error, T(0.0), dOut2)
BLAS.ger!(T(1.0), error, out2, dW3)
fill!(dW2, 0)
fill!(dOut1, 0)
BLAS.gemv!('N', T(1.0), W2', dOut2, T(0.0), dOut1)
BLAS.ger!(T(1.0), dOut2, out1, dW2)
fill!(dW1, 0)
fill!(dOut, 0)
BLAS.gemv!('N', T(1.0), W1', dOut1, T(0.0), dOut)
BLAS.ger!(T(1.0), dOut1, input, dW1)
end
input = rand(T, 512 * 512)
error = rand(T, 10)
@time mockNN(input, error)
for i in 1:10
input = rand(T, 512 * 512)
error = rand(T, 10)
@time mockNN(input, error)
end
end
```