[RFC] Optimizations using run-time information for Polly

Hello,

As a part of my GSoC project to integrate Julia and Polly-ACC, I’m looking at using run-time information to perform better optimisations.

The first thing would be to inform Polly about the “values” of function parameters. “Values” could include,

  • Numerical parameters (Int,Float).
  • Size and Type information of array parameters.

This information can help Polly figure out the most appropriate transformation. A possible scheme would be to leave this information in the IR when functions are annotated with @polly as,

  • llvm.expect or llvm.assume intrinsics (llvm.expect might be removed in the future)

  • assignments to specially named variables like %param.0 (which is the value of the first parameter), %param.1.dim1 (number of elements in the 1st dimension of a 2nd function parameter) etc.

Polly’s GPU optimizer, ppcg, is then fed this data.

The second step would be to pick and run the most appropriate code according to the parameters. The implementation of this step depends on the way the multiple machine-code definitions (each appropriate for a particular set of parameters) of the same function are going to be stored.

In this regard, I’d like to know if it’s possible to make Julia maintain multiple function machine-code definitions of the same function for the same sequence of Types of the parameters, yet, differ in being
appropriate for a particular range of values of the parameters. E.g. vector_add(a,b) could have,

  • An x86 definition for when size(a) < 1,000,000 which runs entirely on the CPU
  • And a definition with an additional NVPTX kernel for when size(a) > 1,000,000

The pick part of the step would have to reflect the way ppcg would optimize for “case” to select the correct code.

Instead of that,we could update the machine code to include code to handle a new “case” and embed a fence to direct the program control to the code appropriate for a given set of values of the parameters.

Please share your thoughts.

Thanks,
Sanjay

Polly could possibly multi-version the code itself. I think using llvm.expect to give common parameter values might be a good start.

https://github.com/JuliaLang/julia/pull/21849 may be a relevant approach you could try to generalize

Here’s my idea to specialise functions,

  1. The generic function is first defined,
function kernel_gemm( alpha, matC, beta, matA, matB )
  for i = ...
    for j = ...
      for k = ...
      ...
      end
    end
  end
end

2.Define wrapper functions that in turn call the general function, and provide hints about the data,

function kernel_gemm_big( alpha, matC, beta, matA, matB )
  @assume prod(size(matA)) > 1000000
  @assume prod(size(matB)) > 1000000
  @assume prod(size(matC)) > 1000000
  @inline kernel_gemm( alpha, matC, beta, matA, matB )
end
function kernel_gemm_small( alpha, matC, beta, matA, matB )
  @assume prod(size(matA)) < 10000
  @assume prod(size(matB)) < 10000
  @assume prod(size(matC)) < 10000
  @inline kernel_gemm( alpha, matC, beta, matA, matB )
end

@assume would be a macro that’s turned into llvm.assume intrinsic, something like _ _ builtin_assume for C. Does Julia have an equivalent of @assume ?

The general function has to be in-lined since both the code and assumptions are required to reside in the same function.

In case of neural networks, the code be highly specialised for each layer since the input and output dimensions remain the same for a particular layer.

Thoughts ?

It doesn’t need to be a macro https://github.com/yuyichao/FunctionWrappers.jl/blob/master/src/FunctionWrappers.jl#L8 works. It should be even easier on 0.6+ now.

Oh WOW ! Is it something like inlining assembly in C with __ asm__ ? Where the registers used in the string are mere placeholders ?

It’s not assembly so there’s no register references.

I was treating the SSA values as registers, i.e. when inlining assume, %0 would have to be replaced by the argument to assume and other registers would have to renamed so that they don’t interfere with the registers present in the scope.


define void @julia_assume_62320(i8) #0 !dbg !5 {
top:
  %1 = and i8 %0, 1
  %v.i = icmp ne i8 %1, 0
  call void @llvm.assume(i1 %v.i)
  ret void
}
[...]
; Code for the condition to be assumed
[...]
call void julia_assume_62320(i1 %cond_reg)
[...]

to


[...]
; Code for the condition to be assumed
[...]
%59 = and i8 %58, 1
%60 = icmp ne i8 %59, 0
call void @llvm.assume(i1 %60)
[...]

@Tobias_Grosser1 ,

For the following Julia code,

function add_1(A)
  assume(size(A,1)==5)
  for i=1:size(A,1)
    @inbounds A[i] += 1;
  end
end

Polly receives,

define void @julia_add_1_62297(i8** dereferenceable(40)) #0 !dbg !5 {
top:
  br label %top.split

top.split:                                        ; preds = %top
  %1 = bitcast i8** %0 to i64**
  %2 = load i64*, i64** %1, align 8, !tbaa !7
  br label %if, !dbg !10

if:                                               ; preds = %top.split, %if
  %"#temp#.03" = phi i64 [ 1, %top.split ], [ %3, %if ]
  %3 = add i64 %"#temp#.03", 1, !dbg !10
  %4 = add i64 %"#temp#.03", -1, !dbg !11
  %5 = getelementptr i64, i64* %2, i64 %4, !dbg !11
  %6 = load i64, i64* %5, align 8, !dbg !11, !tbaa !12
  %7 = add i64 %6, 1, !dbg !11
  store i64 %7, i64* %5, align 8, !dbg !11, !tbaa !12
  %8 = icmp eq i64 %3, 6, !dbg !10
  br i1 %8, label %L20, label %if, !dbg !10

L20:                                              ; preds = %if
  ret void, !dbg !11
}

The loop bounds are available statically.

And for inqualities,

function add_1(A)
  assume(size(A,1)>10); assume(size(A,1)<100);
  for i=1:size(A,1)
    @inbounds A[i] += 1;
  end
end

we have ,

define void @julia_add_1_62320(i8** dereferenceable(40)) #0 !dbg !5 {
top:
  br label %top.split

top.split:                                        ; preds = %top
  %1 = bitcast i8** %0 to i64**
  %2 = load i64*, i64** %1, align 8, !tbaa !7
  %3 = getelementptr i8*, i8** %0, i64 3
  %4 = bitcast i8** %3 to i64*
  %5 = load i64, i64* %4, align 8, !tbaa !7
  %6 = icmp ugt i64 %5, 10, !dbg !10
  call void @llvm.assume(i1 %6), !dbg !10
  %7 = icmp ult i64 %5, 100, !dbg !10
  call void @llvm.assume(i1 %7), !dbg !10
  %8 = icmp eq i64 %5, 0, !dbg !11
  br i1 %8, label %L23, label %if.lr.ph, !dbg !11

if.lr.ph:                                         ; preds = %top.split
  br label %if, !dbg !11

if:                                               ; preds = %if.lr.ph, %if
  %"#temp#.04" = phi i64 [ 1, %if.lr.ph ], [ %9, %if ]
  %9 = add i64 %"#temp#.04", 1, !dbg !11
  %10 = add i64 %"#temp#.04", -1, !dbg !12
  %11 = getelementptr i64, i64* %2, i64 %10, !dbg !12
  %12 = load i64, i64* %11, align 8, !dbg !12, !tbaa !13
  %13 = add i64 %12, 1, !dbg !12
  store i64 %13, i64* %11, align 8, !dbg !12, !tbaa !13
  %14 = icmp eq i64 %"#temp#.04", %5, !dbg !11
  br i1 %14, label %L10.L23_crit_edge, label %if, !dbg !11

L10.L23_crit_edge:                                ; preds = %if
  br label %L23, !dbg !11

L23:                                              ; preds = %L10.L23_crit_edge, %top.split
  ret void, !dbg !12
}

Which is also being detected by Polly,

    Function: julia_add_1_62320
    Region: %if---%L23
    Max Loop Depth:  1
    Invariant Accesses: {
    }
    Context:
    [p_0] -> {  : 11 <= p_0 <= 99 }
    Assumed Context:
    [p_0] -> {  :  }
    Invalid Context:
    [p_0] -> {  : 1 = 0 }
    p0: %5

It creates it’s own function so it doesn’t interfere with anything else.