[RFC] Optimizations using run-time information for Polly

singam-sanjay · May 19, 2017, 12:43pm

Hello,

As a part of my GSoC project to integrate Julia and Polly-ACC, I’m looking at using run-time information to perform better optimisations.

The first thing would be to inform Polly about the “values” of function parameters. “Values” could include,

Numerical parameters (Int,Float).
Size and Type information of array parameters.

This information can help Polly figure out the most appropriate transformation. A possible scheme would be to leave this information in the IR when functions are annotated with @polly as,

llvm.expect or llvm.assume intrinsics (llvm.expect might be removed in the future)
assignments to specially named variables like %param.0 (which is the value of the first parameter), %param.1.dim1 (number of elements in the 1st dimension of a 2nd function parameter) etc.

Polly’s GPU optimizer, ppcg, is then fed this data.

The second step would be to pick and run the most appropriate code according to the parameters. The implementation of this step depends on the way the multiple machine-code definitions (each appropriate for a particular set of parameters) of the same function are going to be stored.

In this regard, I’d like to know if it’s possible to make Julia maintain multiple function machine-code definitions of the same function for the same sequence of Types of the parameters, yet, differ in being
appropriate for a particular range of values of the parameters. E.g. vector_add(a,b) could have,

An x86 definition for when size(a) < 1,000,000 which runs entirely on the CPU
And a definition with an additional NVPTX kernel for when size(a) > 1,000,000

The pick part of the step would have to reflect the way ppcg would optimize for “case” to select the correct code.

Instead of that,we could update the machine code to include code to handle a new “case” and embed a fence to direct the program control to the code appropriate for a given set of values of the parameters.

Please share your thoughts.

Thanks,
Sanjay

Tobias_Grosser1 · May 26, 2017, 9:59am

Polly could possibly multi-version the code itself. I think using llvm.expect to give common parameter values might be a good start.

tkelman · May 27, 2017, 5:15am

https://github.com/JuliaLang/julia/pull/21849 may be a relevant approach you could try to generalize

singam-sanjay · May 29, 2017, 12:06pm

Here’s my idea to specialise functions,

The generic function is first defined,

function kernel_gemm( alpha, matC, beta, matA, matB )
  for i = ...
    for j = ...
      for k = ...
      ...
      end
    end
  end
end

2.Define wrapper functions that in turn call the general function, and provide hints about the data,

function kernel_gemm_big( alpha, matC, beta, matA, matB )
  @assume prod(size(matA)) > 1000000
  @assume prod(size(matB)) > 1000000
  @assume prod(size(matC)) > 1000000
  @inline kernel_gemm( alpha, matC, beta, matA, matB )
end
function kernel_gemm_small( alpha, matC, beta, matA, matB )
  @assume prod(size(matA)) < 10000
  @assume prod(size(matB)) < 10000
  @assume prod(size(matC)) < 10000
  @inline kernel_gemm( alpha, matC, beta, matA, matB )
end

@assume would be a macro that’s turned into llvm.assume intrinsic, something like _ _ builtin_assume for C. Does Julia have an equivalent of @assume ?

The general function has to be in-lined since both the code and assumptions are required to reside in the same function.

In case of neural networks, the code be highly specialised for each layer since the input and output dimensions remain the same for a particular layer.

Thoughts ?

yuyichao · May 29, 2017, 12:54pm

It doesn’t need to be a macro https://github.com/yuyichao/FunctionWrappers.jl/blob/master/src/FunctionWrappers.jl#L8 works. It should be even easier on 0.6+ now.

singam-sanjay · May 29, 2017, 7:21pm

Oh WOW ! Is it something like inlining assembly in C with __ asm__ ? Where the registers used in the string are mere placeholders ?

yuyichao · May 29, 2017, 7:41pm

It’s not assembly so there’s no register references.

singam-sanjay · May 30, 2017, 2:25pm

I was treating the SSA values as registers, i.e. when inlining assume, %0 would have to be replaced by the argument to assume and other registers would have to renamed so that they don’t interfere with the registers present in the scope.


define void @julia_assume_62320(i8) #0 !dbg !5 {
top:
  %1 = and i8 %0, 1
  %v.i = icmp ne i8 %1, 0
  call void @llvm.assume(i1 %v.i)
  ret void
}
[...]
; Code for the condition to be assumed
[...]
call void julia_assume_62320(i1 %cond_reg)
[...]

to


[...]
; Code for the condition to be assumed
[...]
%59 = and i8 %58, 1
%60 = icmp ne i8 %59, 0
call void @llvm.assume(i1 %60)
[...]

singam-sanjay · May 30, 2017, 5:47pm

@Tobias_Grosser1 ,

For the following Julia code,

function add_1(A)
  assume(size(A,1)==5)
  for i=1:size(A,1)
    @inbounds A[i] += 1;
  end
end

Polly receives,

define void @julia_add_1_62297(i8** dereferenceable(40)) #0 !dbg !5 {
top:
  br label %top.split

top.split:                                        ; preds = %top
  %1 = bitcast i8** %0 to i64**
  %2 = load i64*, i64** %1, align 8, !tbaa !7
  br label %if, !dbg !10

if:                                               ; preds = %top.split, %if
  %"#temp#.03" = phi i64 [ 1, %top.split ], [ %3, %if ]
  %3 = add i64 %"#temp#.03", 1, !dbg !10
  %4 = add i64 %"#temp#.03", -1, !dbg !11
  %5 = getelementptr i64, i64* %2, i64 %4, !dbg !11
  %6 = load i64, i64* %5, align 8, !dbg !11, !tbaa !12
  %7 = add i64 %6, 1, !dbg !11
  store i64 %7, i64* %5, align 8, !dbg !11, !tbaa !12
  %8 = icmp eq i64 %3, 6, !dbg !10
  br i1 %8, label %L20, label %if, !dbg !10

L20:                                              ; preds = %if
  ret void, !dbg !11
}

The loop bounds are available statically.

And for inqualities,

function add_1(A)
  assume(size(A,1)>10); assume(size(A,1)<100);
  for i=1:size(A,1)
    @inbounds A[i] += 1;
  end
end

we have ,

define void @julia_add_1_62320(i8** dereferenceable(40)) #0 !dbg !5 {
top:
  br label %top.split

top.split:                                        ; preds = %top
  %1 = bitcast i8** %0 to i64**
  %2 = load i64*, i64** %1, align 8, !tbaa !7
  %3 = getelementptr i8*, i8** %0, i64 3
  %4 = bitcast i8** %3 to i64*
  %5 = load i64, i64* %4, align 8, !tbaa !7
  %6 = icmp ugt i64 %5, 10, !dbg !10
  call void @llvm.assume(i1 %6), !dbg !10
  %7 = icmp ult i64 %5, 100, !dbg !10
  call void @llvm.assume(i1 %7), !dbg !10
  %8 = icmp eq i64 %5, 0, !dbg !11
  br i1 %8, label %L23, label %if.lr.ph, !dbg !11

if.lr.ph:                                         ; preds = %top.split
  br label %if, !dbg !11

if:                                               ; preds = %if.lr.ph, %if
  %"#temp#.04" = phi i64 [ 1, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %"#temp#.04", 1, !dbg !11
  %10 = add i64 %"#temp#.04", -1, !dbg !12
  %11 = getelementptr i64, i64* %2, i64 %10, !dbg !12
  %12 = load i64, i64* %11, align 8, !dbg !12, !tbaa !13
  %13 = add i64 %12, 1, !dbg !12
  store i64 %13, i64* %11, align 8, !dbg !12, !tbaa !13
  %14 = icmp eq i64 %"#temp#.04", %5, !dbg !11
  br i1 %14, label %L10.L23_crit_edge, label %if, !dbg !11

L10.L23_crit_edge:                                ; preds = %if
  br label %L23, !dbg !11

L23:                                              ; preds = %L10.L23_crit_edge, %top.split
  ret void, !dbg !12
}

Which is also being detected by Polly,

    Function: julia_add_1_62320
    Region: %if---%L23
    Max Loop Depth:  1
    Invariant Accesses: {
    }
    Context:
    [p_0] -> {  : 11 <= p_0 <= 99 }
    Assumed Context:
    [p_0] -> {  :  }
    Invalid Context:
    [p_0] -> {  : 1 = 0 }
    p0: %5

yuyichao · May 30, 2017, 6:45pm

It creates it’s own function so it doesn’t interfere with anything else.

Topic		Replies	Views
GSoC 17' Proposal \| Enabling Julia to target GPUs through Polly and imrpove code using run time information Community announcement	11	2210	April 3, 2017
When does julia infer optimal code, when not? Performance question	4	685	November 16, 2018
How to prevent unwanted "optimization" in SIMD code? Performance question , llvm	8	224	May 18, 2025
Branching on algorithmic options - advice on code organization New to Julia	3	551	September 8, 2021
Blog: Using Julia on the HPC Teaching & Outreach blog-post	40	2240	October 10, 2022

[RFC] Optimizations using run-time information for Polly

Related topics