Question about using SnoopPrecompile

Thank you for this excellent work! As a less savvy developer, I have an elementary question that might come up for others.

The current documentation for SnoopCompile models a workflow in which output from SnoopCompile.write is inserted into a block whose current description makes no reference to SnoopPrecompile macros. Should package developers transfer this content to a @precompile_all_calls block now? Is/should this be explained in the documentation?

You should be safe simply deleting all the precompile statements generated by SnoopCompile. SnoopPrecompile is all you need (and it will generate all the necessary precompile statements that you used to add manually or with SnoopCompile).

I see. So the work I used to run to generate the precompile statements now moves into a @precompile_all_calls block? I think it would help to update the pedagogic documentation to demonstrate this work flow.

And is it the case that any dependencies that I need to generate this work must now become dependencies of my package? For example, since my package (WildBootTests.jl) runs statistical tests, the work I have run to generate precompile statements via SnoopCompileCore uses packages such as DataFrames and StatsModels. But these packages are not otherwise dependencies of my package; I had left them out of Project.toml to limit the dependency footprint. I must change that now if I want to use the same workload in conjunction with SnoopPrecompile?

I am a bit confused by the example you are giving. Say you are creating precompile statements for package A, which you are developing. You say you need package B in order to prepare realistic precompile statements for A, but package B is not a dependency of package A. That seems to me to imply that the precompile statements do not actually involve package B and that package B is not actually necessary for generating them. You probably can get away with a much simpler piece of code to generate the precompile statements.

I am probably missing some particularities and difficulties with your setup and how your package interacts with other packages, but nonetheless, I suspect figuring out how to write “typical use case” code for your package without involving DataFrames would be a useful exercise in organizing your code.

Either way, I would suggest focusing on writing “typical use case”, not writing precompile statements directly. It is much too easy to write awkward or useless precompile statements that break when something else in your package changes, while simply writing “typical use case” code (with or without SnoopPrecompile) is much more robust.

Concerning updating the documentation: I am certain Tim would appreciate someone sending a pull request with these updates (core devs like him are quite oversubscribed with tasks).

@Krastanov, you sensibly encourage the construction of typical use cases. For my package, one typical use case is: load a csv file and convert it to a DataFrame using CSV.jl and DataFrames.jl; formulate and fit a linear model to certain variables in the DataFrame using StatsModels.jl; then extract the data used in the model fitting (limiting to the sample of complete observations in that model fit); pass it and a few other objects, all of core types such as Matrix{Float64} and Bool to the main function exported by my package. This typical use case depends on three packages that the package itself does not.

If using SnoopPrecompile with use cases that are typical in this sense requires expanding the package’s dependencies, then I merely want to assure myself that I understand this fact correctly. And highlight the confusion of a less experienced user, which may be representative of others. I don’t think my grasp is good enough to suggest edits to the docs…or at least it isn’t until I have confirmed my understanding of such points.

2 Likes

I am not sure we use the expression “typical use case” the same way. It is important to distill the meaning of “typical” to really cover only your code (well… it is not important, but it does simplify our work). Looking at the README for your library, is the following not enough for a selfcontained “typical use case” that covers everything done by your library:

using WildBootTests
N=50
resp = rand(N)
predexog = [ones(N);;rand(N)]
clustid = [ones(Int, NĂ·2); ones(Int, NĂ·2)*2];
R = [0 1]
r = [1]
wildboottest(R, r; resp, predexog, clustid);

It seems you also often use a special array type. A modified version of the code above that converts the arrays to your special array type should be sufficient.

My logic above was: just make some small fake dataset and run the algorithm on it. It does not matter if the results are garbage data as long the compiler got to trace the execution.

P.S. I can not speak for everyone, but I am actually happy when someone submits a pull request even if they are not sure exactly how to fix the issue they are facing. A “junk” pull request is still an opportunity to make it better and maybe to gain a helper. I suspect this is a common attitude.

Yes and no. The algorithm will branch and call different functions depending on characteristics of the supplied data and the combination of options invoked. A variety of more realistic examples seems the most promising and expedient way to generate good code coverage. Of course, the examples can all be reengineered to avoid those dependencies–in the style you demonstrate, but with a bit more realism than just 0’s and 1’s. For the moment, I’m simply trying to understand the tradeoff. Is it the case that any dependencies that I need to generate this work must now become dependencies of my package?

I think this question is difficult to answer. From the point of view of the compiler, at least for now, there is no value in having extra dependencies. However, there is value in making sure you trace through all the typical branches of your code, and maybe this happens to be easier when you rely on these extra dependencies. I personally would decide what to do based on benchmarks (@time @eval using WildBootTests; @time typical_task()): maybe the lousy fake data from above is actually sufficient and you do not need a more complicated setup.

1 Like

One opportunity that probably most people are still not exploiting: with 1.8, the precompile work does not have to be in the package that “owns” the code. So what that means is that each user can construct a “personal Startup.jl” package and say using Startup at the beginning of a new session. So for example, if you like to use a combination of CSV, DataFrames, and StatsModels and want certain things to be fast, then you can Pkg.generate("Startup") in your dev folder and then the Startup package might look like this:

module Startup

using CSV, DataFrames, StatsModels     # all these should be in the Project.toml for Startup
using SnoopPrecompile

@precompile_setup begin
    data = ...
    @precompile_all_calls begin
        call_some_code(data, ...)
    end
end

end    # Startup module

Then the code for those packages you want to be fast will be cached in a *.ji file for Startup.jl.

28 Likes

It is @precompile_all_calls not @precomile_all_calls.

Thanks, edited to reduce confusion.

Sounds great! Is this independent of the location in the dev folder, and could this be a package located in a subfolder of a project and dev’d via url ?

Yes, you can put it anywhere you want as long as it’s on your path (and/or it can easily be made to be on your path).

Cool. A related question:

If I activate a package environment, say of package MyPack and run code with using MyPack, does this trigger the precompilation of MyPack and all the code called therein?

I’m not sure I understand.

run code with using MyPack

Assuming MyPack is already precompiled, the only code this runs is in the __init__ function of MyPack. Everything else is just loaded from disk.

does this trigger the precompilation of MyPack and all the code called therein

If it wasn’t already precompiled, then yes. That compilation happens in a separate process, and then a “snapshot” of the created module(s) gets stored to disk. In your process (the one you executed using MyPack in), all that happens is that this snapshot gets loaded from disk. None of the .jl files in MyPack/src “run” in your process.