DrWatson - the perfect sidekick to your scientific inquiries!

Dear All,

I am very happy to announce the first beta version of our new package, called DrWatson:

https://juliadynamics.github.io/DrWatson.jl/dev/

Motivation

DrWatson is the result of some scientists being “fed up” with the hardship of handling scientific projects. We want to stop repeating the same processes for every project, stop screaming our lungs out in frustration, and make handling our scientific projects easier. Things like:

  • Urgh, I moved my folders and now my load commands don’t work anymore!
  • Do I have to produce a dataframe of my finished simulations AGAIN?!
  • PFfffff I am tired of typing savename = "w=$w_f=$f_x=$x.jld2, can’t I do it automatically?

are what we want to battle.

Why this can really help you

I believe that DrWatson is a package that can truly make your life easier as a scientist, by removing the annoying parts of managing a project.
In the Description page of the documentation of DrWatson you find the core principles of the package:

  1. Non-Invasive.
  2. Simple.
  3. Consistent.
  4. Allows increments.
  5. Reproducible.
  6. Modular.
  7. Scientific.

Want I want to really focus on are the first two points. Many other similar approaches exist, that aim to support scientific project management (see the “Inspirations” section of our documentation) but what I have come to notice is that they suffer from a common problem: they just aren’t simple enough. You have to do too much work to use them!

DrWatson takes a radically new approach: instead of complicated pipelines that you have to follow to benefit, DrWatson only asks you to just use a couple of functions.

Example 1: savename

As a first example, lets look the very common situation of using variable values to create file names. E.g. you have the case of running a simulation with variables a=3, b=5, model="water". Typically, when saving the simulation data resulting from your script you would consider writing a name prefix_a=$a_b=$b_model=$model. At some point you might be frustrated with having to do this all the time, and you might write down a function that takes in a dictionary and produces such a string.

This is what the function savename(c) from DrWatson does. It transforms any key-based Julia container c (Dict, NamedTuple, Composite Type) into a string like the one above. And besides of doing what you want, it also is deterministic and allows for customization.

(of course, you don’t have to use the customization aspects, this is where the Modular aspect of DrWatson shines)

Example 2: tagsave

Wouldn’t it be awesome if every saved datafile contains a record of the Git commit of the project, when the file was saved? Wouldn’t it be awesome if achieving this required no additional effort?

If in your code you replace the function save(file, data) with tagsave(file, data) then the saved file will have one additional field called :commit, which will contain the commit ID of your project when you saved the file. And without writing a single extra line of code, all of your saved data tell you the commit they come from! :slight_smile:

Of course, this requires your scientific project to be a Git repository as well. Well, no worries, since DrWatson also offers a simple template for a scientific project which is also a Git repository.

Beta Invitation

We actively looking for beta testers and contributors!

Please, consider using DrWatson. Consider helping us in the development and please report problems, and give us ideas and feedback by using the GitHub page of the repository.

DrWatson is part of JuliaDynamics, so you can also chat in the channel #dynamics-bridged of the Julia Slack.


logo developed by @cormullion (of course)

35 Likes

This is neat - I’ve been contemplating something parallel to this idea. The tagsave concept is definitely critical, but it seems like this is for saving data blobs? What if I want to generate a CSV file or a figure image from a plot? My idea was to have an automatic generation of a data.toml file or something that logs dates, commits and other metadata linked to files, could this do something similar?

My other major requirement is the generation of docs-like files (pdf/html) from notebooks, either from Weave or Literate files, and re-running or doing some sort of testing to make sure code in old notebooks is kept up to date. Any thoughts on that end?

In any case, I’d definitely be interested in testing/contributing.

1 Like

Is this something similar to https://ropensci.github.io/drake/ ?

@kevbonham

Adding the commit ID can be done as long as the file you save is a Dictionary. tagsave is just a lightweight wrapper around FileIO.save but it still needs to somehow add the commit id. I do not know of another way to do it generally (e.g. for an image; how to “add a string” to an image?).

Seems like a very valid suggestion, please go ahead and open an issue to discuss this further!

I don’t really use these approaches for scientific purposes so I don’t have any experience… But this is exactly why we are looking for beta testers and users and people to contribute opinions.

@mkborregaard

No, from the 6-minute video I just watched they don’t seem similar. DrWatson also seems to have a much simpler and cleaner approach to helping you (e.g. you don’t have to make a “DataFrame” out of every function in your code!)

1 Like

Cool to see that you are making such rapid progress on this!

Nice would also be an integration with the input side of modeling. Maybe some integration with DataDeps.jl?

2 Likes

For saving metadata along with the contents, it might be worth adopting JLSO. It wraps other formats (currently BSON), while storing key metadata such as the version of julia and all involved packages that were used in creating the file.

1 Like

Default project setup could use a test directory to encourage people to write unit tests on their src.

6 Likes

Thanks, please open suggestions on the GitHub page so that it is easier to keep track for developers! :slight_smile:

1 Like

Interesting. I have found that “stashing” data is a surprisingly difficult problem which I still don’t have a solution for. This seems like it should be very simple but in practice I’ve always found it hard, in part because it’s hard to decide on exactly what the “metadata” is, but also because it’s hard to guarantee that said metadata gets propagated everywhere that something needs to be saved (which is perhaps a sign that I should give up and just be willing to use some global variables). It’s also interesting that this is one of the few problems that seems very much the same in my current job as a data scientist as it did during my career in physics.

I’m eager to take a look at this and see to what extent it suits my needs.

By the way, a “new variant” of this problem for me is that now I frequently have to save to S3 buckets rather than the filesystem (whether actual or emulated). Do you allow saving and loading to arbitrary IO buffers with arbitrary serialization methods?

2 Likes

Right, I don’t think this would work, hence the ideas about having a toml or something that keeps track.

Another ignore that occurs to me - you have to be sure you tagsave after committing the code that makes it. This issue usually the reverse of my workflow. I typically write code, try a free things, then once I get a result, commit. Presumably this would mean my tagsaved file would have the hash of the previous commit.

1 Like

@ExpandingMan

Metadata is an area we have not developed DrWatson a lot, mostly because we are considering adding an API to CaosDB, a new Database management system. DrWatson is more about managing your scientific project and less about data management, at least at this present state. The data management is as basic as it can get (but still useful I feel).

There are no restrictions by DrWatson: you continue to use your project as you see fit. Now, if you want to use the function tagsave, well, it is a wrapper around FileIO.save. This was the only way we could make it general enough. I do not know the full extend of the capabilities of FileIO, but it should not be hard for us to dispatch on different IO buffers before calling save. Open an issue if you have a concrete idea/suggestion!

@kevbonham

Yeah unfortunately this is most often the case for most of us… :smiley: But it is bad practice. And writing a package that takes care of this for you automatically would be too invasive. For now, if you try to get the commit ID of a dirty repo, DrWatson throws a warning (not error!) and also adds the string _dirty to the commit ID, see the docs around tagsave.

This somewhat compensates for the problem you mention.

1 Like

This is really and something I’ve been thinking about a lot has well. I was using Sumatra for a while, which worked pretty well, but since it’s written in Python, it required a bit of a context switch every time I used it. The requirement to commit my code before saving data also became kind of a pain, even though as was pointed out by others, it is good practice. I like the idea of throwing a warning and adding the _dirty tag.
I will definitely give this a try. Thanks!

1 Like

Just pinging in, there is a standard way to add almost any information to images: in their metadata. A related effort was discussed in this post. So for an actual image, yea you can add a string (among other things). But this is a specific solution for images, maybe a more general solution (like the TOML idea of @kevbonham) is better.

Arggghh. Meaningful file names, and meaningful directory names. Arrrgggh.
I suppose there’s no stopping them, but we are in an era of S3 compliant object stores.
One day we will get away from huge deep directory trees where the directory names are the metadata.

Look at Architecta, for instance (commercial product)

2 Likes

The Tagsave thing I really like though!

1 Like

I don’t agree with this approach. I’d much rather have my data in the folder of my science project, in a clear and well organized folder tree. I know what each folder means. The arcitecta that you shared is already too complicated for me. Only the fact that it uses a server to handle data is already too complicated for me.

I think there are two things to separate here:

  1. A specific, contained scientific project (which of course could use data from other projects)
  2. A huge collaboration with many projects, many many people and absurd amount of data.

For example, I do the first, and DrWatson has also the clear goal of helping the first. DrWatson it is not a data management system (but as I said, we plan to interface CaosDB, which is a data management system tailored for scientific projects).

1 Like

I’ve added a note in the docs clarifying the data management part (which is not the purpose of DrWatson) and mentioning CaosDB (whos purpose is data management).

But doesn’t CaosDB fall into the too-complicated category as well? Here a fig from their paper:

I’ve been procrastinating reading up on https://www.datalad.org/, which was posted by @c42f in Data versioning · Issue #37 · JuliaDynamics/DrWatson.jl · GitHub. Based around git, this might be a better fit for Julians?

2 Likes

Yes, of course, it absolutely is too complicated. Which is why it will a separate package if we manage to port it to Julia.

2 Likes

@Datseris Thankyou for separating out the problem like that. I agree.

1 Like