Weight decay is an addition to the cost function saying that the weights should be small in the L_2 sense, this can be at odds with having weights that fit the data well. In some situations, there is a reason to believe that small model parameters are better than large, but you can easily imagine that if you let lambda go to infinity, your weights will go to zero and zero weights does not give you a good model.
57k training samples sounds like a good amount, I would try to make the model larger until the validation error goes up, and then you might have found a sweetspot without getting too complicated.