Strange performance characteristics and regression with tuple recursion

Some hints here:
The regression of M = 2 case is caused by a missing vectorization.
In Julia 1.5.2, @code_llvm f(iter) gives

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i65 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.ind = phi <4 x i64> [ <i64 -1, i64 0, i64 1, i64 2>, %vector.ph ], [ %vec.ind.next, %vector.body ]
  %vec.ind36 = phi <4 x i65> [ <i65 0, i65 1, i65 2, i65 3>, %vector.ph ], [ %vec.ind.next39, %vector.body ]
  %vec.ind40 = phi <4 x i64> [ <i64 1, i64 3, i64 5, i64 7>, %vector.ph ], [ %vec.ind.next43, %vector.body ]
  %vec.phi = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %18, %vector.body ]
  %vec.phi44 = phi <4 x i64> [ zeroinitializer, %vector.ph ], [ %19, %vector.body ]
  %step.add = add <4 x i64> %vec.ind, <i64 4, i64 4, i64 4, i64 4>
  %step.add37 = add <4 x i65> %vec.ind36, <i65 4, i65 4, i65 4, i65 4>
  %step.add41 = add <4 x i64> %vec.ind40, <i64 8, i64 8, i64 8, i64 8>
  %8 = zext <4 x i64> %vec.ind to <4 x i65>
  %9 = zext <4 x i64> %step.add to <4 x i65>
  %10 = mul <4 x i65> %vec.ind36, %8
  %11 = mul <4 x i65> %step.add37, %9
  %12 = lshr <4 x i65> %10, <i65 1, i65 1, i65 1, i65 1>
  %13 = lshr <4 x i65> %11, <i65 1, i65 1, i65 1, i65 1>
  %14 = trunc <4 x i65> %12 to <4 x i64>
  %15 = trunc <4 x i65> %13 to <4 x i64>
  %16 = add <4 x i64> %vec.phi, %vec.ind40
  %17 = add <4 x i64> %vec.phi44, %step.add41
  %18 = add <4 x i64> %16, %14
  %19 = add <4 x i64> %17, %15
  %index.next = add i65 %index, 8
  %vec.ind.next = add <4 x i64> %step.add, <i64 4, i64 4, i64 4, i64 4>
  %vec.ind.next39 = add <4 x i65> %step.add37, <i65 4, i65 4, i65 4, i65 4>
  %vec.ind.next43 = add <4 x i64> %step.add41, <i64 8, i64 8, i64 8, i64 8>
  %20 = icmp eq i65 %index.next, %n.vec
  br i1 %20, label %middle.block, label %vector.body

In Julia 1.6.2:

julia> @code_llvm debuginfo=:none f(iter)
define i64 @julia_f_2599([1 x i64]* nocapture nonnull readonly align 8 dereferenceable(8) %0) {
top:
  %1 = getelementptr inbounds [1 x i64], [1 x i64]* %0, i64 0, i64 0
  %2 = load i64, i64* %1, align 8
  br label %L2

L2:                                               ; preds = %L2, %top
  %value_phi = phi i64 [ 1, %top ], [ %value_phi8, %L2 ]
  %value_phi2 = phi i64 [ 1, %top ], [ %value_phi10, %L2 ]
  %value_phi6 = phi i64 [ 0, %top ], [ %3, %L2 ]
  %3 = add i64 %value_phi6, %value_phi
  %.not = icmp slt i64 %value_phi, %value_phi2
  %4 = add i64 %value_phi, 1
  %5 = icmp slt i64 %value_phi2, %2
  %value_phi7 = or i1 %.not, %5
  %value_phi8 = select i1 %.not, i64 %4, i64 1
  %not..not = xor i1 %.not, true
  %6 = zext i1 %not..not to i64
  %value_phi10 = add i64 %value_phi2, %6
  br i1 %value_phi7, label %L2, label %L38

L38:                                              ; preds = %L2
  ret i64 %3
}

While M=4 the vectorization happens…so there is no regression.