Accurate counter of source files: source lines, comments, documentation strings

not sure if this is the correct category to post, let me know if I should change it

Alright, so I am looking for an accurate way to count the lines of code in a Julia .jl file. Ideally I’d like a count of actual source code lines, comments, and documentation string lines.

Does such a thing exist? So far I have not been able to find something that really can count all three…

We want this answer to make a strong selling point for our Julia packages, which, in contrast to competitors, have a really low count of actual source code lines. But due to the documentation quality, they have lots of docs, so justs counting the lines of the Julia files isn’t really helpful to make the point clear!

Eg cloc counts source/comment/blank lines separately, for many languages including Julia. Not sure about separate doc files and dirs.

PackageAnalyzer.jl might be of interest.

2 Likes

FWIW, you may check and improve the basic code below, which processes a single *.jl file and outputs:

Basic code
function countlines_docs(filename)
    n = i = 0
    ixd = Tuple{Int64, Int64}[]
    open(filename) do io
        while !eof(io)
            i += 1
            if occursin(r"^\S*\"\"\"", readline(io))
                i1 = i
                n += 2
                while !occursin(r"^\S*\"\"\"", readline(io))
                    i += 1
                    n += 1
                end
                push!(ixd, (i1, i))
            end
        end
    end
    return n, ixd
end

function countlines_comm(filename, ixd)
    n = i = 0
    ixc = Tuple{Int64, Int64}[]
    open(filename) do io
        while !eof(io)
            i += 1
            if occursin(r"^\S*\#\=", readline(io))
                i1 = i
                n += 2
                while !occursin(r"\=\#$", readline(io))
                    n += 1
                end
                push!(ixc, (i1, i))
            end
        end
    end
    open(filename) do io
        i = 0
        while !eof(io)
            i += 1
            if occursin(r"^\S*\#[^\=]", readline(io)) && !any(t[1] ≤ i ≤ t[2] for t in ixd)
                n += 1
            end
        end
    end
    return n, ixc
end

function countlines_blank(filename, ixd, ixc)
    n = 0
    open(filename) do io
        i = 0
        while !eof(io)
            i += 1
            if occursin(r"^\s*$", readline(io)) && !any(t[1]≤i≤t[2] for t in ixd) && !any(t[1]≤i≤t[2] for t in ixc)
                n += 1
            end
        end
    end
    return n
end


using PrettyTables

function countlines_juliafile(filename)
    n_tot  = countlines(filename)
    n_docs, ixd = countlines_docs(filename)
    n_comm, ixc = countlines_comm(filename, ixd)
    n_blank = countlines_blank(filename, ixd, ixc)
    n_code = n_tot - n_docs - n_comm - n_blank

    header = ["File", "#code", "#comments", "#doc", "#blanks", "total"]
    data = [basename(filename) n_code n_docs n_comm n_blank n_tot]
    pretty_table(data, header=header, header_crayon=crayon"blue bold", alignment=:c, formatters=ft_printf("%i",1:6))

    return n_code, n_docs, n_comm, n_blank, n_tot
end


# TEST EXAMPLE:
filename = raw"C:\Users\jrafa\.julia\config\startup.jl"
countlines_juliafile(filename)
1 Like

@rafael.guerra This is fantastic and absolutely what I needed!!! (PackageAnalyzer.jl isn’t good enough for me: it doesn’t take into account docstrings. Or, maybe I have misunderstood its docs if it does…)

@rafael.guerra Do you mind if I improve on your code, put it into a DataFrames.jl analysis pipeline, and make it run on a Package, so that it gives details about all directories (src, docs, test) and all files of the package, and then also gives ratios at the end? I can publish it as a simple package and add you as a co-owner in the MIT license.

1 Like

@Datseris, I’m glad I did something useful with my limited means. By all means, I would be very happy if you improve the code. Cheers.

Just came back here to say that PackageAnalyzer.jl v3 works perfectly with respect to its handling of docstrings now, so it perfectly fits my goals and reports excellent summary of package code stats. Eg:


julia> @time analyze(ComplexityMeasures)
  0.204804 seconds (89.49 k allocations: 16.343 MiB, 3.73% gc time)
PackageV1 ComplexityMeasures:
  * repo:
  * uuid: ab4b797d-85ee-42ba-b621-05d793b346a2
  * version: missing
  * is reachable: true
  * tree hash: bf6898c5ef0f416a90664c997f0de46b0c5dcb7f
  * Julia code in `src`: 3813 lines
  * Julia code in `ext`: 0 lines (0.0% of `test` + `src` + `ext`)
  * Julia code in `test`: 2357 lines (38.2% of `test` + `src` + `ext`)
  * documentation in `docs`: 1493 lines (28.1% of `docs` + `src` + `ext`)
  * documentation in README & docstrings: 3943 lines (50.8% of README + `src`)
  * has license(s) in file: MIT
    * filename: LICENSE
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions
1 Like

glad it’s working! I tried a bunch of tools like cloc, tokei, etc, and none of them handled Julia docstrings correctly, so starting in v3 PackageAnalyzer has it’s own line counting implementation based on JuliaSyntax so we can try to handle things correctly. (We still use tokei for other stuff like TOML files).

If you want to see how particular lines are categorized, you can use PackageAnalyzer.LineCategories. For example, taking a look into the implementation code:

julia> using PackageAnalyzer

julia> file = joinpath(pkgdir(PackageAnalyzer), "src", "LineCategories.jl")
"/Users/eph/.julia/packages/PackageAnalyzer/ddM8Z/src/LineCategories.jl"

julia> PackageAnalyzer.LineCategories(file)
1    | Comment   | # Here, we assign a category to every line of a file, with help from JuliaSyntax
2    | Comment   | # Module to make it easier w/r/t/ import clashes
3    | Code      | module CategorizeLines
4    | Code      | export LineCategories, LineCategory, Blank, Code, Docstring, Comment, categorize_lines!
5    | Blank     |
6    | Code      | using JuliaSyntax: GreenNode, is_trivia, haschildren, is_error, children, span, SourceFile, Kind, kind, @K_str, source_line
7    | Blank     |
8    | Comment   | # Every line will have a single category. This way the total number across all categories
...

Note that I made the implementation choice to give every line exactly one category, which means we have to choose sometimes, since there can be comments on lines with code and so forth.