Package statistics, Python's average package is 4x larger

Average Python package is 2212.6 (average) / 507 (median) = 4.4x larger (for lines of code, as I suspected). I think that might inform discussion about e.g. documentation, if/since Julia packages are smaller, and you tend to use more and compose, rather than one of few bigger in Python. I’m not complaining (just an observation), all else equal, packages should be smaller, for code, though no min. limit on docs.

Largest Julia package is though way larger at 264,115 lines, 5.5x larger 2.2x larger (compared to Python code in root, at 118,793 lines, see comment below). Hecke.jl is 130,560 lines (not as much autogenerated code?).

Both statistics are old, though Python’s may be 8 years old (still interesting to compare to Python of the past), so this might be way too outdated. Do we have similar statistics for Julia, as that page on Python has? And you know of good updated statistics for Python or other languages?

[Biggest doesn’t ring true, TensorFlow, PyToch etc. missing, statistics predating them?]

Here are the biggest packages on PyPI:

b2gpopulate (36MB)
ajenti (35MB)
FinPy (29MB)
django-dojo (28MB)
QSTK (27MB)

Total sizes on packages in PyPI amounted to 4.2 GB. Average package size is 161 KB and standard deviation is 1MB.
[…]
Minimum and maximum lines were 2 and 47 453 respectively. Number of lines averaged to 2212.6 lines per package and standard deviation was 8729.7

Because Python is bloat and you should probably use median, and exclude test code

this is Julia itself, you need to exclude it

and then there’s packages like GitHub - root-project/root: The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

which has python glue code:

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Python                757       118793        82223        17006        19564

1 Like

No, that was actually AWSSDK.jl, next-largest AWS.jl also huge (then Hecke.jl, then outdated TensorFlow.jl), I think by now they merged. Julia itself was already excluded, as not a package.

You’re right I should use median, and same for both, for Python, but that’s what I found so I pointed out inconsistency, but I’m not sure it changes much. Yes, better to exclude tests (or compare them separately, or as combined), but I think, not sure both have them in.

Those AWS packages are auto generated from the API, not really relevant for a comparison like this. They only demonstrate how ridiculously complicated AWS is.

2 Likes

To be fair, I don’t know why AWS[SDK].jl is so huge, I suspect autogenerated code and/or comments.

And not sure either why root is so huge (I guess HEP just this complex). And if Python is only 1.6% of it (only 6th largest language, C++ largest share 78.5%) then is it 118793/0.016 = 7 million lines in total?! I updated my top post with that (Python-part) figure.

Right, still Hecke.jl is or was 130,560 lines (not sure why that large, half the size), then larger then Python part of root.

@ChrisRackauckas Do you know how large SciML is in total (lines of code, or other metric?). I suspect if we count ecosystems, then it is largest. Would be interesting to rerun the stats in the Julia blog post (should be simple, the code for that available). Any guess about the largest “umbrellas”, besides SciML?

I don’t know.

Hecke.jl does not contain auto-generated code (unless you count the CI scripts).

2 Likes