#count JOSS reviewer by lang and their sectors

I was curious about which languages programmers-researchers in academic were using, so I went on the public list of reviewers of the Journal of Open Source Software (JOSS) available here and grab the data.

Here the result. Please note that although JOSS publish any open source software whenever the underlying language is open source or not, I highly suspect its activity is biased toward languages that are themselves open source.

*** The 20 most "best kwown" languages...
- python         ( 68.74 %)
- r              ( 27.52 %)
- c++            ( 18.85 %)
- c              ( 13.91 %)
- matlab         ( 8.3 %)
- java           ( 7.26 %)
- fortran        ( 5.76 %)
- javascript     ( 4.79 %)
- julia          ( 4.71 %)
- bash           ( 3.07 %)
- go             ( 2.02 %)
- perl           ( 1.65 %)
- c#             ( 1.57 %)
- rust           ( 1.5 %)
- php            ( 1.5 %)
- ruby           ( 1.27 %)
- sql            ( 1.12 %)
- scala          ( 0.9 %)
- haskell        ( 0.82 %)
- cuda           ( 0.75 %)
*** The 20 most "known" languages...
- python         ( 79.43 %)
- r              ( 33.88 %)
- c++            ( 31.41 %)
- c              ( 27.3 %)
- matlab         ( 17.88 %)
- java           ( 16.45 %)
- javascript     ( 12.86 %)
- fortran        ( 10.62 %)
- julia          ( 8.45 %)
- bash           ( 6.36 %)
- perl           ( 4.49 %)
- php            ( 3.89 %)
- c#             ( 3.66 %)
- go             ( 3.14 %)
- rust           ( 2.99 %)
- ruby           ( 2.84 %)
- sql            ( 2.24 %)
- scala          ( 2.09 %)
- html           ( 1.72 %)
- haskell        ( 1.5 %)
*** The 4 most common sectors for the 10 most "known" languages...
python      :   machine learning, bioinformatics, physics, statistics, 
r           :   bioinformatics, machine learning, statistics, genomics, 
c++         :   machine learning, bioinformatics, physics, statistics, 
c           :   machine learning, bioinformatics, astrophysics, statistics, 
matlab      :   machine learning, image processing, statistics, physics, 
java        :   machine learning, bioinformatics, software engineering, data science, 
javascript  :   machine learning, bioinformatics, data science, statistics, 
fortran     :   physics, astrophysics, computational fluid dynamics, computational chemistry, 
julia       :   machine learning, statistics, physics, data science, 
bash        :   bioinformatics, genomics, machine learning, computational biology,
Source code
# Source: reviewer database of JOSS at https://docs.google.com/spreadsheets/d/1PAPRJ63yq9aPC1COLjaQp8mHmEq3rZUzwUYxTulyu78/edit#gid=856801822

using OdsIO

# Loading data..
dataFile = "joss_reviewers_20200724.ods"
db = ods_read(dataFile,range=((4,2),(1340,9)))

# removing email
db = hcat(db[:,1:2],db[:,5:end])

# replacing "nothing"....
# ..with empty string in the first three columns...
for r in eachrow(db)
    for cidx in 1:3
        r[cidx] = isnothing(r[cidx]) ? "" : r[cidx]
    end
end
# ..and with zero in the number of reviews...
for r in eachrow(db)
    for cidx in 4:6
        r[cidx] = isnothing(r[cidx]) ? 0 : r[cidx]
    end
end

# Converting first 3 columns to string and last 4 to integers
db = convert(Array{Union{String,Int64},2},db)

# Cleaning..
for r in eachrow(db)
    for cidx in 1:3
        # ugly...
        r[cidx] = replace(replace(replace(replace(replace(r[cidx], '/'=>','), '('=>','), ')'=> ','), '\n'=> ',') , "and"=> ',') |> strip |> lowercase
        r[cidx] = replace(r[cidx],", " => ',') # to avoid empty data
        r[cidx] = replace(r[cidx]," ," => ',') # to avoid empty data
        r[cidx] = replace(r[cidx], r",$" => "") # remove ending comma

    end
end

# Establishing vocabolaries
vocLangs = Set{String}()
vocActivities = Set{String}()
for (ridx,r) in enumerate(eachrow(db))
    ##if ridx > 20 break end
    for cidx in 1:2
        #=
        debug = strip.(split(r[cidx],','))
        for l in debug
            if l == ""
                println(l)
                println(ridx)
                println(cidx)
            end
        end
        =#
      if r[cidx] == "" continue end
      push!(vocLangs,strip.(split(r[cidx],','))...)
    end
    for cidx in 3:3
      if r[cidx] == "" continue end
      push!(vocActivities,strip.(split(r[cidx],','))...)
    end
end
vocLangs      = collect(vocLangs)
vocActivities = collect(vocActivities)
langIdx       = Dict{String,Int64}()
[langIdx[l]   = id for (id,l) in enumerate(vocLangs)]
actIdx        = Dict{String,Int64}()
[actIdx[a]    = id for (id,a) in enumerate(vocActivities)]

nLangs             = length(vocLangs)
nActs              = length(vocActivities)
nRecords           = size(db,1)
preferredLangCount = zeros(Int64,nLangs)
competentLangCount = zeros(Int64,nLangs)
actCountByLang     = zeros(Int64,nLangs,nActs)

# Let's count!
for r in eachrow(db)
    plangs = strip.(split(r[1],','))
    olangs = strip.(split(r[2],','))
    langs  = union(Set(plangs),Set(olangs))
    acts   = strip.(split(r[3],','))
    [preferredLangCount[langIdx[l]]       += 1 for l in plangs if l != ""]
    [competentLangCount[langIdx[l]]       += 1 for l in langs if l != ""]
    [actCountByLang[langIdx[l],actIdx[a]] += 1 for l in langs, a in acts if l != "" && a != ""]
end

# Let's report:
n = 20
println("*** The $n most \"best kwown\" languages...")
sortIdx = reverse(sortperm(preferredLangCount))[1:n]
[println("- $(rpad(vocLangs[i],12))\t ( $(round(100*preferredLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx]
n = 20
println("*** The $n most \"known\" languages...")
sortIdx = reverse(sortperm(competentLangCount))[1:n]
[println("- $(rpad(vocLangs[i],12))\t ( $(round(100*competentLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx]

n = 10
n2 = 4
println("*** The $n2 most common sectors for the $n most \"known\" languages...")
sortIdx = reverse(sortperm(competentLangCount))[1:n]
for i in sortIdx
    lang = vocLangs[i]
    sortIdxActs = reverse(sortperm(actCountByLang[i,:]))[1:n2]
    print("$(rpad(lang,12)): \t")
    [print("$(vocActivities[j]), ") for j in sortIdxActs]
    print("\n")
end
3 Likes

Interesting, thanks.

How does your code handle “ties” where people list more than one language or work area? Do you just count it as 1 “vote” for each? Not obvious from a quick read-through.

For the first list I count the number of reviewer that have lang x as their primary choice (or x as one of their primary choices in very few cases).

For the second list I count the number of reviewers that have lang x in either the primary or the “other languages” they can review.

In both cases it doesn’t matter if they list 1 or 10 langs, if they have x I count them for x, and then I determine how many reviewers have lang x out of the total number of reviewers.

The percentages of the different langs don’t need to sum to 100.

Note that it remains just an approximation. For example there is one guy that said “all except Fortran”… and he will be counted for Fortran :expressionless:

One can either manually clean the dB or use more advanced tools from natural language processing… this is just a quick and dirty script …