GSoC 17' Proposal | Enabling Julia to target GPUs through Polly and imrpove code using run time information

Hello All,

I’m Sanjay Srivallabh, a final year undergraduate student of BITS-Pilani, Hyderabad Campus in India. I’m currently pursuing a semester long dissertation in Polyhedral Compilation under Dr. Ramakrishna at IIT-Hyderabad, India.

I’d like to take GSoC as learning opportunity and make a proposal to,

  • Enable Julia to run on GPUs through Polly
  • Better optimise code using run-time information
  • Enable Polly to choose between sending code to the GPU or CPU for best results.

I’ll be sharing the link to my proposal in a while. 'Looking forward to your feedback and suggestions.

Thank You,
Sanjay

1 Like

Have you checked out CUDAnative.jl? Would this be an alternative to that?

Hello @MikeInnes,

CUDANative.jl requires you to understand CUDA syntax and the GPU execution model to write a function that runs on GPUs. Polly on the the other can turn any Julia code that conforms to certain specifications to NVPTX code. It’s an auto-parallelization pass in LLVM.

1 Like

How will you handle the down and uploads to the GPU and automatically decide if latency + down/upload penalty are worth this optimization?
Correct me if I’m wrong, but it seems like CUDAnative and the Polly approach are orthogonal. I’m guessing, that with Polly you only decide what code you could execute on the GPU. But then you still need to take the LLVM code, compile it, link it and execute the GPU kernel, which is what CUDAnative can do!
Or is this all in included in Polly already? That’d be pretty magical :slight_smile:

2 Likes

Polly currently uses a simple cost model that can decide this.

Yes, they are orthogonal.

Yes they are ! It already works with clang, have a look at this

I see. Not sure how this will apply to Julia. Also, one big problem isn’t solved yet: Determine where to place the data..
Probably an easier scope would be to use polly to optimize kernels which are written using for loops over CuArrays. So you could implement that as a pass inside CUDAnative.
CC @maleadt, does this make sense?

The above GSOC project lists Code Generation For Host as only 50% done, and it also doesn’t seem to be mainlined? If that is the case, I would agree with @sdanisch that using the parts of Polly that already work (presumable recognizing parallelize loops, transforming them for optimal GPU execution, maybe some data placement, etc) either as a pass in CUDAnative, or as a new package building on top of CUDAdrv (or GPUArrays for a vendor-neutral alternative) might be a better choice.

Another reason it might be interesting to implement this as part of CUDAnative (or similar) is that you would obviously need to stick to a subset of Julia which Polly can analyze, and that can be executed on GPUs. CUDAnative is already doing exactly that, so it would be a shame if we’d introduce yet another new Julia’ → GPU compilation path.

Either way, looking forward to your proposal!

1 Like

@Tobias_Grosser1 Could you please clarify what “Code Generation For Host as only 50% done” means ? Does it still reflect the current state of the project ?

@maleadt Polly generates NVPTX code and stores it as a string within LLVM-IR. It then inserts calls to a runtime to handle data transfers and launch the kernel. Also, GSoC project has been proposed to extend Polly to generate SPIR-V code for GPUs that don’t support NVPTX.

From what I understand from CUDANative’s repo, it lets you write functions meant just for the GPU in a syntax similar to that of CUDA kernels in Julia. Polly works at the higher level, turning general (and suitable) Julia code to NVPTX. So, I’m not sure if,

is possible. @Tobias_Grosser1 Could you please share your thoughts on this ?

@sanyam: The website at Polly - GPGPU Code Generation is completely outdated. It should be removed and replaced with actual documentation for Polly-ACC. If you are interested about performance numbers read (or at least skim) this paper: http://grosser.es/bibliography/grosser2016pollyacc.html

So yes, we can do fully automatic GPU code generation with Polly-ACC for two SPEC benchmarks and a variety of computational kernels. There are a lot more opportunities, so I believe this is indeed something pretty exciting for Julia.

Also, yes, this works like magic, “fully automatically without any user interaction”. Similar to real-world magic, it has constraints in what can be done. Still, I believe it would be a great idea to get this to Julia.

Hello All,

Here’s the link to the draft of my GSoC proposal. Please comment on it and let me know your suggestions.

Thank You,
Sanjay

How does it deal with rich CPU objects? I can image a C float* getting detected and offloaded, but would the same apply to eg. a jl_array_t* (containing another data pointer & metadata)? That’s where I figured some manual work at the frontier between host & @polly annotated code would be necessary.

@singam-sanjay: to elaborate on my comment on your proposal, I think it would be better to avoid adding Polly-specific flow to the main codegen, instead trying to use or improve existing mechanisms for outlining compiler functionality, and implementing your Polly-specific functionality as part of a package. This has many advantages (maintainability of both Julia’s codegen and your work, lower barrier to contributing, you get to work in Julia instead of C++, etc). Although that functionality (CodegenParams, CodegenHooks) has been developed for CUDAnative.jl, it is meant to be generic, and extending those mechanisms is not much work (adding more hooks or params, slightly restructuring codegen).

2 Likes

Thanks for the suggestions and info @maleadt !!