Funny Benchmark with Julia (no longer) at the bottom

My last (faster) algorithm completely changes how the method works. As such I don’t know if it will satisy the constraints as is.

I’m also unsure if my method is faster in their environment :frowning:

I think other languages implementation should pick up your sorting algorithms soon

If someone gives me a color map between language names and colors, I will happily use that

You need tokei installed

4 Likes

Or you can use Tokei_jll

1 Like

ah man that thing is missing Odin… anyway, I have updated the Gist

1 Like

now I really want to also shrink code size …

diff --git a/julia/related.jl b/julia/related.jl
index e21c8a2..b79b866 100644
--- a/julia/related.jl
+++ b/julia/related.jl
@@ -1,9 +1,4 @@
-using JSON3
-using StructTypes
-using Dates
-using StaticArrays
-
-# warmup is done by hyperfine
+using JSON3, Dates, StaticArrays
 
 function relatedIO()
     json_string = read("../posts.json", String)
@@ -33,12 +28,9 @@ struct RelatedPost
     related::SVector{5,PostData}
 end
 
-StructTypes.StructType(::Type{PostData}) = StructTypes.Struct()
-
-function fastmaxindex!(xs::Vector, topn, maxn, maxv)
-    maxn .= 1
-    maxv .= 0
-    top = maxv[1]
+function fastmaxindex!(xs, topn, maxn::AbstractVector{T}, maxv) where T
+    maxn .= one(T)
+    maxv .= top = zero(T)
     for (i, x) in enumerate(xs)
         if x > top
             maxv[1] = x
@@ -59,13 +51,7 @@ function fastmaxindex!(xs::Vector, topn, maxn, maxv)
 end
 
 function related(posts)
-    for T in (UInt8, UInt16, UInt32, UInt64)
-        if length(posts) < typemax(T)
-            return related(T, posts)
-        end
-    end
-end
-function related(::Type{T}, posts) where {T}
+    T = UInt32
     topn = 5
     # key is every possible "tag" used in all posts
     # value is indicies of all "post"s that used this tag
2 Likes

Code size is typically better off measured in terms of the compressed file size instead of the number of lines of code.

6 Likes

if you have a way to extract that information that would be great.

there are folders with configuration files for Swift and Java.

You need a way to extract only source code files, and exclude comments, and compress it.

just gzip the source file. Comment filtering would be a bit annoying though yeah

I can’t. because I don’t know what’s source file (i.e. I’m not going to code that up manually). I rely on Tokei to only count source files + only line with code

1 Like

The size of the source is kind of a murky metric for a piece of code, compressed or uncompressed. It is tempting to have a 2-axis chart, but either a regular 1-axis bar-chart is enough or a different 2nd axis can be chosen. For example:

  • x-axis = time/post
  • y-axis = # of posts
    and see how the programs scale with # posts.

Code size in that benchmark is

How source code size is measured

We start with the source-code markup you can see, remove comments, remove duplicate whitespace characters, and then apply minimum GZip compression. The measurement is the size in bytes of that GZip compressed source-code file.

Thanks to Brian Hurt for the idea of using size of compressed source code instead of lines of code.

1 Like

size of compressed code isn’t that great of a metric if you’re concerned about human readability and writeability. If you’ve got a lot of boilerplate that you have to repeat often, it’s easy to compress this away, but you can’t not read it or not write it… it’s annoying crap in the code file.

So from a human readability perspective you’d want also something like compression ratio. A language with a very high compression ratio is basically a language that sucks to write or read.

2 Likes

I mean, nobody likes reading repeated boilerplate, but also compressing it seems pretty fair. It doesn’t completely ignore the boilerplate, but it reduces it’s cost to the code length metric, which is also how I tend to interact with boilerplate when I’m reading it.

I glance at it, skim it, and if I see something repeated a bunch, I look at the repeated code less and less each time it’s repeated.


As a concrete example, consider this part of the diff @jling posted:

-using JSON3
-using StructTypes
-using Dates
-using StaticArrays

+using JSON3, Dates, StaticArrays

When the word using is repeated a bunch of times in a very predictable way like this, it’s very easy for my eye to just completely skip it and only look at the words to the right of the using on each line. I don’t need to figure out what is going on each time, I immediately see that all of those are using statements.

In fact, if there were a few more usings in there, I might actually find the verbose version more pleasant to read than the one-liner version.

7 Likes

Yes, but consider this from the Wikipedia on boilerplate, and then multiply by 25 for some code with 25 different subclasses of Pet.

public class Pet {
    private String name;
    private Person owner;

    public Pet(String name, Person owner) {
        this.name = name;
        this.owner = owner;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public Person getOwner() {
        return owner;
    }

    public void setOwner(Person owner) {
        this.owner = owner;
    }
}

Okay, but is the experience of reading that really so significantly different from the experience of reading this hypothetical syntax?

public class Pet {
    private String name
    private Person owner
    Pet(String name, Person owner) = {this.name = name; this.owner = owner}
    String getName() = return name
    void setName(String name) = this.name = name
    Person getOwner() = return owner
    void setOwner(Person owner) = this.owner = owner}

I’m not really convinced it’s such a big difference.

I am not a Java programmer, but my impression was Java allowed inheriting interface but not implementation. So if you have 25 pets, say Dog, Cat, Bird, Fish etc… then you have this boilerplate 25 times even though it does absolutely nothing different each time. Then when you gzip it it will compress about 50 to 1. Which means when you’re reading through this code only 2% of it is even worth looking at in some sense.

So it seems to me that if you want a “human complexity” metric it should be something that takes into account when the compression ratio is very high that reading or writing that code really sucks.

EDIT: I looked this up, and apparently maybe I’m wrong, Java does allow some implementation inheritance, but I do think there are languages that don’t and you wind up with lots of heavily repeated boilerplate.

I think non-compressed metrics reward golfing too much (eg get everything onto one line if it is per line, or use all 1-character variable names, if it is by character). Compressing devalues these approaches, which I think is appropriate.

(Also the data compression limit is the Shannon entropy, so there’s a nice information theoretic flavor to compression-based metrics).

3 Likes

To be clear, what I’m actually saying is not anything directly against compression as a metric, I like using compression, its more that it’s insufficient by itself, the compression ratio is also an additional important dimensionless ratio in the question.

4 Likes