#count JOSS reviewer by lang and their sectors

sylvaticus · July 24, 2020, 12:35pm

I was curious about which languages programmers-researchers in academic were using, so I went on the public list of reviewers of the Journal of Open Source Software (JOSS) available here and grab the data.

Here the result. Please note that although JOSS publish any open source software whenever the underlying language is open source or not, I highly suspect its activity is biased toward languages that are themselves open source.

*** The 20 most "best kwown" languages...
- python         ( 68.74 %)
- r              ( 27.52 %)
- c++            ( 18.85 %)
- c              ( 13.91 %)
- matlab         ( 8.3 %)
- java           ( 7.26 %)
- fortran        ( 5.76 %)
- javascript     ( 4.79 %)
- julia          ( 4.71 %)
- bash           ( 3.07 %)
- go             ( 2.02 %)
- perl           ( 1.65 %)
- c#             ( 1.57 %)
- rust           ( 1.5 %)
- php            ( 1.5 %)
- ruby           ( 1.27 %)
- sql            ( 1.12 %)
- scala          ( 0.9 %)
- haskell        ( 0.82 %)
- cuda           ( 0.75 %)
*** The 20 most "known" languages...
- python         ( 79.43 %)
- r              ( 33.88 %)
- c++            ( 31.41 %)
- c              ( 27.3 %)
- matlab         ( 17.88 %)
- java           ( 16.45 %)
- javascript     ( 12.86 %)
- fortran        ( 10.62 %)
- julia          ( 8.45 %)
- bash           ( 6.36 %)
- perl           ( 4.49 %)
- php            ( 3.89 %)
- c#             ( 3.66 %)
- go             ( 3.14 %)
- rust           ( 2.99 %)
- ruby           ( 2.84 %)
- sql            ( 2.24 %)
- scala          ( 2.09 %)
- html           ( 1.72 %)
- haskell        ( 1.5 %)
*** The 4 most common sectors for the 10 most "known" languages...
python      :   machine learning, bioinformatics, physics, statistics, 
r           :   bioinformatics, machine learning, statistics, genomics, 
c++         :   machine learning, bioinformatics, physics, statistics, 
c           :   machine learning, bioinformatics, astrophysics, statistics, 
matlab      :   machine learning, image processing, statistics, physics, 
java        :   machine learning, bioinformatics, software engineering, data science, 
javascript  :   machine learning, bioinformatics, data science, statistics, 
fortran     :   physics, astrophysics, computational fluid dynamics, computational chemistry, 
julia       :   machine learning, statistics, physics, data science, 
bash        :   bioinformatics, genomics, machine learning, computational biology,

Source code

# Source: reviewer database of JOSS at https://docs.google.com/spreadsheets/d/1PAPRJ63yq9aPC1COLjaQp8mHmEq3rZUzwUYxTulyu78/edit#gid=856801822

using OdsIO

# Loading data..
dataFile = "joss_reviewers_20200724.ods"
db = ods_read(dataFile,range=((4,2),(1340,9)))

# removing email
db = hcat(db[:,1:2],db[:,5:end])

# replacing "nothing"....
# ..with empty string in the first three columns...
for r in eachrow(db)
    for cidx in 1:3
        r[cidx] = isnothing(r[cidx]) ? "" : r[cidx]
    end
end
# ..and with zero in the number of reviews...
for r in eachrow(db)
    for cidx in 4:6
        r[cidx] = isnothing(r[cidx]) ? 0 : r[cidx]
    end
end

# Converting first 3 columns to string and last 4 to integers
db = convert(Array{Union{String,Int64},2},db)

# Cleaning..
for r in eachrow(db)
    for cidx in 1:3
        # ugly...
        r[cidx] = replace(replace(replace(replace(replace(r[cidx], '/'=>','), '('=>','), ')'=> ','), '\n'=> ',') , "and"=> ',') |> strip |> lowercase
        r[cidx] = replace(r[cidx],", " => ',') # to avoid empty data
        r[cidx] = replace(r[cidx]," ," => ',') # to avoid empty data
        r[cidx] = replace(r[cidx], r",$" => "") # remove ending comma

    end
end

# Establishing vocabolaries
vocLangs = Set{String}()
vocActivities = Set{String}()
for (ridx,r) in enumerate(eachrow(db))
    ##if ridx > 20 break end
    for cidx in 1:2
        #=
        debug = strip.(split(r[cidx],','))
        for l in debug
            if l == ""
                println(l)
                println(ridx)
                println(cidx)
            end
        end
        =#
      if r[cidx] == "" continue end
      push!(vocLangs,strip.(split(r[cidx],','))...)
    end
    for cidx in 3:3
      if r[cidx] == "" continue end
      push!(vocActivities,strip.(split(r[cidx],','))...)
    end
end
vocLangs      = collect(vocLangs)
vocActivities = collect(vocActivities)
langIdx       = Dict{String,Int64}()
[langIdx[l]   = id for (id,l) in enumerate(vocLangs)]
actIdx        = Dict{String,Int64}()
[actIdx[a]    = id for (id,a) in enumerate(vocActivities)]

nLangs             = length(vocLangs)
nActs              = length(vocActivities)
nRecords           = size(db,1)
preferredLangCount = zeros(Int64,nLangs)
competentLangCount = zeros(Int64,nLangs)
actCountByLang     = zeros(Int64,nLangs,nActs)

# Let's count!
for r in eachrow(db)
    plangs = strip.(split(r[1],','))
    olangs = strip.(split(r[2],','))
    langs  = union(Set(plangs),Set(olangs))
    acts   = strip.(split(r[3],','))
    [preferredLangCount[langIdx[l]]       += 1 for l in plangs if l != ""]
    [competentLangCount[langIdx[l]]       += 1 for l in langs if l != ""]
    [actCountByLang[langIdx[l],actIdx[a]] += 1 for l in langs, a in acts if l != "" && a != ""]
end

# Let's report:
n = 20
println("*** The $n most \"best kwown\" languages...")
sortIdx = reverse(sortperm(preferredLangCount))[1:n]
[println("- $(rpad(vocLangs[i],12))\t ( $(round(100*preferredLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx]
n = 20
println("*** The $n most \"known\" languages...")
sortIdx = reverse(sortperm(competentLangCount))[1:n]
[println("- $(rpad(vocLangs[i],12))\t ( $(round(100*competentLangCount[i]/nRecords,digits=2)) %)") for i in sortIdx]

n = 10
n2 = 4
println("*** The $n2 most common sectors for the $n most \"known\" languages...")
sortIdx = reverse(sortperm(competentLangCount))[1:n]
for i in sortIdx
    lang = vocLangs[i]
    sortIdxActs = reverse(sortperm(actCountByLang[i,:]))[1:n2]
    print("$(rpad(lang,12)): \t")
    [print("$(vocActivities[j]), ") for j in sortIdxActs]
    print("\n")
end

tbeason · July 24, 2020, 2:20pm

Interesting, thanks.

How does your code handle “ties” where people list more than one language or work area? Do you just count it as 1 “vote” for each? Not obvious from a quick read-through.

sylvaticus · July 24, 2020, 4:32pm

For the first list I count the number of reviewer that have lang x as their primary choice (or x as one of their primary choices in very few cases).

For the second list I count the number of reviewers that have lang x in either the primary or the “other languages” they can review.

In both cases it doesn’t matter if they list 1 or 10 langs, if they have x I count them for x, and then I determine how many reviewers have lang x out of the total number of reviewers.

The percentages of the different langs don’t need to sum to 100.

Note that it remains just an approximation. For example there is one guy that said “all except Fortran”… and he will be counted for Fortran

One can either manually clean the dB or use more advanced tools from natural language processing… this is just a quick and dirty script …

Topic		Replies	Views
Julia ranking trend, TIOBE, RedMonk Community	81	15704	October 16, 2019
TIOBE index rank: #20 as of August 2023 (was #23 in Jan 2021) Community ranking	90	20591	August 17, 2023
Results regarding Julia from HackerRank developer skills report Community	26	3288	January 28, 2018
Has anyone seen this blog? Community question	59	9440	February 4, 2018
Julia is going upwards: Redmonk rank Community	21	3477	March 15, 2018

#count JOSS reviewer by lang and their sectors

Related topics