Best way to root a pipeline directory structure inside of a package

I’ve written some bioinformatics analysis scripts in julia. They are fast and impressed everyone! :heart_eyes: Now I would like to share them with my collaborators. They act inside a pipeline that took many years to develop, and which assumes a certain file structure. something like

ScienceAnalysis1
  |-data
  |.   |-img
  |.   |-raw
  |.   |-tmp
  |-figures
  |-scripts

I’m wondering what the best way to share my scripts are to minimize the stress for my non-julia colleagues.

I’ve been impressed by the slightly trivial but powerful here::here() function in R . With here::here() you give a non-programmer colleague your .Rmd file, and if you have an .Rproj file in science analysis, it knows where the files are and can walk those directories with basically no fiddling.

In julia I would love to be able to do the rough equivalent, but perhaps slightly more elevated code sharing. I envision the perfect sharing instructions
"
1.) ] activate ScienceAnalysis1,
2.) ] add MyPackageOnPrivateGithubWithLotsOfDeps
3.) julia>
using MyPackageOnPrivateGithubWithLotsOfDeps
generate_mindblowing_figures()"
"

But right now, in script form, it looks worse

0.) download and move my_script.jl to ScienceAnalysis1
1.) ] activate ScienceAnalysis1,
2.) ] add (all the dependencies)
3.) julia>
include(“path/to/ScienceAnalysis1/my script.jl”) (which makes heavy use of keyword variables that inherit global path-variables set by @__DIR__ )
generate_mindblowing_figures()

I could make a package, but then it’s not clear to me best practice on telling the package where the local directory or the project directory is.

I see the “experimental” Pkg.project
Would setting PROJECT_HOME = Pkg.project().path at the top of the package module work? Does anyone have experience using Pkg.project().path inside a module to tell a package where the “project home” is? Is this a bad idea?

Related questions, is it okay to have environment-dependent path keywords in the exported functions or is this breaking some kind of principle? maybe it should be a dynamic function PROJECT_HOME() = Pkg.project().path?

Another almost perfect option is to set a global variable inside the module MyPackageOnPrivateGithubWithLotsOfDeps.PROJECT_HOME = Pkg.project().path It’s just an extra step.

I appreciate expert thoughts on this, which is preferred and why?

Would this do it for you?

https://docs.julialang.org/en/v1/base/base/#Base.@__DIR__

1 Like

I am currently making heavy use of @__DIR__ but wondering if I can make the code sharing process even smoother and the code smell a bit better. It feels a little delicate, can I use @__DIR__ is inside an imported module? Won’t it give the path of the module where the code is written? It’s not easy for me to reason about it.

Pkg.project().path seems better but also like I’m using something for outside it’s intended purpose. Sorry for the long wind.

My very first thought reading through your description is, would DrWatson.jl be a better, dynamic, and non-brittle fit for your solution? It has a lot of functionalities (I think) you are looking for built-in in a non-brittle manner. Just a thought!

Thanks @TheCedarPrince, I’ve looked into Dr Watson before, and I always think the same thing: It’s too beefy for me! I want a minimally invasive solution that doesn’t disrupt the existing pipeline structure. That said, I think building a pipeline from scratch, it must be an amazing feeling to use, especially if it’s set up on a shared filesystem.

1 Like

If you have a package, using @__DIR__ to reference files within the directory structure of the package seem very safe to me. I use that frequently to write tests, for example.

In any case, I recommend definitely structuring a package. Your “ideal workflow” is just what that is.

(the only complication I see there is the fact that you intend to keep the repository private. That can make your life harder. Otherwise just let the users install the package by adding the package url).

Being more clear: The user will never use @__DIR__, that will be internal of your package, to reference the files you need, that belong to the package structure. Something like:

module MyPkg
     dir = @__DIR__  # usually the MyPkg/src directory
     test_file = dir*"/../tests/data/my_test_file.txt"
     function my_awesome_function(user_input; data = test_file)
          ...
     end
end

That way the function will work without providing the data file, by using the provided test file, and if the user wants, they can provide another file.

1 Like

I misread your post at first – my bad! I think you’re right about Pkg. I’m not aware of a way to get the path to the file where a module is imported.

One other thought is that you could just ask at the top of your packages workflow for the project dir to be passed in. Sure, that means the script needs to call @__DIR__, but seems like a relatively simple thing to document. Not arguing it’s ideal. But an option!

What about this guy?

https://docs.julialang.org/en/v1/base/constants/#Base.PROGRAM_FILE

I will try some stuff out, maybe I will find a useful pattern. It’s still not quite it, but @lmiq thanks again for your generous and patient explanation as always. You are an amazing contributor!

2 Likes

I just wanted to report that my package for drawing multichnnel images with lots of color and possibly cell masks ended up with the global variable approach, Package.home for the project root, and with path kw defaults set to functions returning paths from this global variable, that create their respective directory when called if necessary.

The reason I chose this is approach is because I didn’t want my R-friends mucking with environments, so a global package install is probably easier. Second, I could reason about the code better. Finally if they want to do more than one project, they won’t be hit with precompilation again, so will get a good impression of the julia latency (i.e. pure post-1.9 awesomeness).

So I guess the moral of this story for me was to overcome my global bias against globals in Julia? They seem to have a valuable place in drastically reducing the inputs a function call needs, simplifying public ui’s. But let me know if you think there is a better way now that you can see what the package is trying to do a little better.

Maybe my answer is trivial and my question is also trivial in hindsight?

1 Like