GPU performance problem of locally connected layer with Tullio

Have you considered @ein? In a brief test, I found that @ein was faster than @tullio on GPU with a 1×1 filter, please see here. Would be great to a fast locally connected layer in Flux.