Newcomer contributor in JuliaGeo and co. - Help me get started!

Hello everyone! I am George, lead dev/maintainer of JuliaDynamics and JuliaMusic. I just started my first postdoc position in the Max Planck Institute for Meteorology. My topic will be about spatiotemporal correlations and nonlinear timeseries analysis on data from the CERES dataset, as well as dynamical systems approaches for simplified models of the earth’s energy budget. In fact my project will be a nice fun mix of the principles surrounding DynamicalSystems.jl and earth observations.

I am very excited about this project, and since I made a habit of making my work open source Julia packages in the past, I am looking forward to contribute to the relevant existing packages (or even make new ones). So I thought I should take the opportunity to introduce myself into this community and maybe also ask for some getting started tips.

I’ve checked some of the existing packages in JuliaGeo, and also doing some google searching which led me to ClimateTools.jl and NCDatasets.jl. Is there anything else noteworthy? Seems like ClimateTools.jl is a package I should really study and try to contribute, it will probably become a part of my workflow (@Balinus). I would like someone to address or comment on the following points:

  1. Is it correct to say that meteorology and climate related functionalities and packages are under JuliaGeo? (There is no JuliaMeteorology or JuliaClimate or similar). If not, should there be a different org? (Having orgs is extremely helpful for many reasons)
  2. Preferred package to load NetCDF files: At a first glance there seem to be 2 Julia packages: NetCDF.jl and NCDatasets.jl (btw, why isn’t the latter included in JuliaGeo?). Can someone lay down the core differences to help me decide why I should be using one or the other? The documentation of NCDatasets.jl only states that NetCDF.jl has a more “Matlab” interface.
  3. Transforming an LL-coordinates based dataset into a Geodesic polyhedron representation of the Earth (so that the data points have equal spatial coverage and create a uniformly weighted set w.r.t. the area). I have looked at Geodesy.jl, which seems to be a very cool package, but the coordinate types dont seem to have something about it.
  4. @Balinus what is the status of the ClimateTools.jl package ? It is still maintained, right? Would you consider making it part of JuliaGeo (or other org)? Seems like what I asked in the above section would naturally fit into ClimateTools, because of the regrid function.
  5. Do you have any good newcomer issues I could tackle, that would also help me get accustomed to the packages relevant to the work I will be doing?

Notice of course that if any of the above questions have no answer at the moment, I am very much willing to create and contribute this answer. I think once I get started to get the hang of using geodata (as I come from a different field, condensed matter physics), I will contribute more and more.

best,
George

19 Likes

Welcome to JuliaGeo @Datseris, it is exciting to have you here. I am particularly interested in the spatial aspect of the spatiotemporal research you mentioned, and have been putting together some nice features in the GeoStats.jl project over the years. ClimateTools.jl is using GeoStats.jl under the hood for performing some spatial interpolation, and I am aligning features with @Balinus whenever possible. There is still a lot of work to be done in the GeoStats.jl stack, particularly now that it is integrated with the MLJ.jl stack for learning spatial properties. Collaborating with someone that has a strong mathematical background would be awesome. Climate models, and statistics on spherical coordinates is a non-trivial research arena. :slight_smile:

There are a few issues open in the project that you could take a look. Example issues that don’t need much theory include (1) the port of all plot recipes to the Makie.jl recipe system when it is ready #46, (2) the ability to use Unitful.jl coordinates in the spatial object types defined in the framework #45, and (3) the better support for missing values throughout the methods #42. Example issues that require theory and more careful design are coming up as well, including the need for better support for volumetric measurements (e.g. general mesh types a la VTK) #40. This last issue is connected to a whole discussion in the JuliaGeo org to improve the integration of different spatial data types across the stack that @Raf is leading: https://github.com/JuliaGeo/meta/issues/6

Overall, we are trying to come up with a trait system that is general enough to accommodate all the requirements from a spatial statistics (or geostatistics) perspective, and from a high-performance computing perspective (fast indexing, xarrays, etc.). Lately I’ve been busy adding features to GeoStats.jl more directly connected to my research, but soon I find the time, I will come back to this important issue raised by org members.

4 Likes

Hello @Datseris, I’m really happy that our packages can help you begin your research!

Perhaps a bit of background. I began developing ClimateTools when I decided to leave Matlab for Python (Julia was at version 0.3). I found Julia on my way and saw an opportunity to design something for my need with the hope that someone would use it! So, the current design of ClimateTools might need some refreshing, as the first commits dates back to 2016 (I think). And also because at the time, I had to code the whole workflow (i.e. extraction of netCDF files, interpolation, post-processing, climate indices calculations, plotting!). This means that some parts of this workflow is suboptimal and should rely on newer approaches (see below). Specifically, the whole loading of the data is a mix of workarounds (but it works!).

One aspect of my old job was to create “climate scenarios”: more specifically, post-processing of climate simulations. In one project I had, the post-processing computations using MATLAB was on the order of months using Quantile-Quantile mapping (~200 simulations, interpolated to 10km grid, post-processing of daily values). Despite being very long, this lead to an interesting article though! I quickly prototyped a version in Julia and I got an order of 20x faster, so that was the end of MATLAB for me! I think that statistical treatment of climate timeseries is certainly one focus of ClimateTools.jl.

Another aspect of ClimateTools is to quickly draw figures and maps for analysis purpose. I was planning to unload this aspect to a new package “ClimateMaps.jl” but never got the time to do this migration (this would imply creating “ClimateTypes.jl” for shared type). This would also mean that loading ClimateTools would be quicker (mapping uses Python’s Basemap).

As pointed out by @juliohm I now uses GeoStats capabilities for regridding (I was using scipy previously). Right now, the API is quite simple and only use InverseDistanceWeighting as the solver (hardcoded), but my aim is to parametrize regrid according to GeoStats capabilities, taking care of spatial constraints.

After some thinking about where Julia and Geo stands right now, I think that packages should apply to specific tasks. In other words, ClimateTools does no longer needs to do it all. (Better) alternatives now exists (or close to be “production-ready" and general enough) for some tasks of the workflow. Hence, I think the future of ClimateTools would be to unload the extraction of netCDF/GRIB/etc data to GeoData.jl and uses ClimateTools.jl for the analysis part (complex post –processing, regridding, plots, maps, climate indices, etc…).

Finally (longer post than I expected!), to answer your questions:

  1. There is JuliaGeo and JuliaAtmosOcean.

  2. For NetCDF files, I prefer NCDatasets.jl (currently, both are used in ClimateTools but my aim is to use solely NCDatasets). Other packages worth mentioning would be the current effort by @Raf : DimensionalData.jl and GeoData.jl. The former looks to replace AxisArrays.jl and the former is another alternative to ClimateTools.jl (at least for the extraction of netCDF data and spatial considerations).

  3. See here: https://github.com/ikroener/FluxConservative

  4. Yes, the package is still maintained! I’m currently developing extreme values post-processing. I have no problem making it part of JuliaGeo. Not sure how we get invited though?

  5. Next version of ClimateTools should use a backend for extraction of data. Linking the package to GeoData might be an idea and you would certainly learns a lot about a lot of packages. Unloading mapping capabilities to ClimateMaps might also be a first (and easier) step to get into Julia and ClimateTools.

Cheers!

2 Likes

Hi George, congrats on the postdoc! Excited to hear that you’ll be working in earth science.

About the two NetCDF packages, you can have a look at this comment and below to get some history. Back then NetCDF.jl was less developed, and @Alexander-Barth wanted to approach the package and API differently which led to a separate package. Even though I occasionally commit to NetCDF.jl, it looks like NCDatasets.jl may be a better pick to learn. Things like the time handling and value scaling are automatically applied, so you don’t have to worry about them. One downside to both is that they get their binary dependency through conda, which can get quite large. Having binaries from BinaryBuilder would be a nice improvement. I still wish the two packages would share more of the low level code though.

Regarding the organisation organisation, I’m also curious to hear what others think, but so far everything in JuliaGeo is quite general related to spatial data. If we decide to move ClimateTools in, it would be the first field-specific package there, which is maybe a good reason to discuss if we want that or not. I’d incline to keeping JuliaGeo general, and having it provide tools that are generally useful to many spatial/earth science domains. I’m a hydrologist, and if I would want to collaborate on hydrology packages I think I’d rather do that in something like JuliaHydrology (doesn’t exist yet). Besides being generally useful I’d say that we should also make an effort to have the packages in JuliaGeo work well together.

5 Likes

On a higher level of access note, netCDF data can also be accessed with GMT.jl, (either from GMT itself or its GDAL bridge), And it’s mapping capabilities are … good.
For example, to display the SST layer of an .nc file in the OceanColor site, one can do

using GMT

# file address and SUBDATSET selection
julia> fname = "/vsizip/vsicurl/https://oceandata.sci.gsfc.nasa.gov/cgi/getfile/SNPP_VIIRS.20191001_20191031.L3m.MO.NSST.sst.9km.nc=gd?NETCDF:\"SNPP_VIIRS.20191001_20191031.L3m.MO.NSST.sst.9km.nc\":sst";

# Remotely read just the SST layer
julia> G = gmtread(fname, grd=true);

# and display it
imshow(G, region=:global, projection=:Mollweide, frame=:ag, title="SST - September 2019", figsize=15, colorbar=true, coast=true)

5 Likes

Depending on your exact requirements I would mention another package. In case your data gets too big to be processed in memory and you want a tool to map your (multivariate) time series analysis methods over a large chunked out-of-core dataset you may have a look at ESDL.jl. It is optimized to work with zarr datasets to be cloud-compatible but can work with NetCDF data as well. Distributed as well as threaded parallelism is built in and data access is optimized according to your dataset chunks.
As soon as you start working with larger data sets you will realize that the convention to store gridded data with time as the slowest moving index leads to very slow data access along the time dimension, which is why we store our data in chunks along all dimensions to have good data access performance along all axes.
Although the package was originally built around a single dataset (https://www.earthsystemdatalab.net/) is is now generic enough to be applied to different all kinds of data like Senintel 1/2 or see this short CMIP6 demo which directly accesses data hosted on google cloud storage.

As mentioned before, unlike ClimateTools.jl we don’t provide a rich set of methods but focus on implementing an efficient (and extended) mapslices (or mapCube, which is very similar to xarray’s apply_ufunc) for datasets with labelled axes. This way you simply bring your own analysis methods (probably from JuliaDynamics) and map them over a big dataset. A very simple example how this works is this one where we estimate the intrinsic dimensionality in each multivariate time series of a large spatiotemporal dataset.

Let me know if you have further questions, I think I might try to stop by your office next time I visit MPI-MET.

2 Likes

Thanks everyone for your swift and detailed replies, happy to be here!

Before I reply to each one individually I want to justify putting packages into an organization. From my perspective there are numerous advantages in having the repos in the organization:

  1. Inspires more trust.
  2. Increases the pool of people that are likely to review a PR.
  3. Newcomers have a specific collection to search for packages.
  4. Invites more contributions by non-members (why? because the repo is detached from a single person’s name).
  5. It is much easier to find! Because once you find one package from JuliaGeo, you immediately see that it is from JuliaGeo, and thus you click and you go to the org’s page.

I urge you all to consider putting your packages in an organization (probably JuliaGeo, since JuliaAtmosOcean is actually just 1 package which could join JuliaGeo). At the moment there is a lot of scattered material, e.g. GeoStats, ClimateTools, ClimateMaps, NCDatasets all have different owners (and in fact, I’ve only became aware of GeoStats and ClimateMaps after your posts, which would not be the case if they were part of JuliaGeo).

Transferring is quite trivial thanks to GitHub, this is how I’ve done it for people joining JuliaDynamics or JuliaMusic:

  1. Invite the owner of the repo to the org (and all important contributors).
  2. The owner transfers the repo to the org via the settings.
  3. Create a Team on the org with owner level access to the transferred repo. Make original owner (and all important contributors) part of this team.

The last step ensures that the person that initially owned the repo still maintains all privilliges while not getting full privileges over the entire org. JuliaClimate is an alternative if JuliaGeo is not fitting, but one should be transparent about their scopes then.


@juliohm, GeoStats.jl will certainly be useful, so I’ll add it to the list of “packages to study”. For the beginner issues you cite, probably the missing values and the Unitful.jl are the ones I could tackle at the moment. Could you please describe them in more detail on the GitHub issue page, so that they are more approachable from a beginner’s view? I’ll also keep an eye on the unified interface for data.


@Balinus ,

Cool, I can help with that for sure. This is what we did with Agents.jl lately when we ported it to JuliaDynamics. I think after porting to version 2.0 we have effectively reduced the complexity of the package by half (which is amazing!!!).

I couldn’t agree more. Separating plotting functionality from actually scientific computing is extremely important. Not only it speeds up everything, it reduces file sizes but it also makes it much easier to run stuff on the cluster. Perhaps you can open up an issue outlying in detail what should be done, and I have a look? (Btw we also separated plotting from Agents.jl when we ported it to JuliaDynamics)

I also agree fully with that and it (massively) helps newcomers. As Stefan Karpinski once said “there should be one package that does that one thing, and it should be the best package”. (Same goes for the 2 NC reader packages imho).


@visr thanks for pointing to this comment, so I can see some different scopes there. What you propose is definitely useful, i.e. reducing code by adding inter-dependencies. But it should also be made clearer in the docs/readmes which package to use for what reason (i.e. what are their actual target goals).

Although you have a point, I am not so sure I would be as concerned as you about this. In the end of the day, if there is enough material so that a new organization is necessary, one can just transfer the repos with 2 clicks. At the moment maybe it is worth considering JuliaClimate as an org, but of course I am a newcomer and i shouldn’t be the one judging that.

In my eyes an organization is more about thematic connection between packages, and ease of finding them, and not so much for functionality connection.


@joa-quim thank you for the suggestion of GMT.jl, seems that it is useful to easily plot an ncdataset. But on the other hand yet one more way to load netCDF data just makes things more complicated for me. I’ll stick with NCDatasets.jl for now.

@fabiangans Cool, thanks! Thankfully I will be working with a small dataset (~1gb) for the start of my project, but as I move along I’ll keep this in mind!

I also like your approach of how one “maps” their analysis onto the datasets. It will certainly come in handy for me.

Please do, the number is 408!

1 Like

Off-topic comment

From a user’s perspective I agree with that, but from a package owner perspective it feels like I loose my package upon transfer. Sure if one digs into the contribution stats, then it’s easy to find out, who’s package it really is. But my visibility is somewhat lost.

It would be nice if somehow the two opposing interests could be more aligned within GitHub’s setup.

1 Like

I see your point, but this argument goes the other way as well: If I am a user and I want to contribute to someone else’s package then “my effort will be hidden”, as they will take “all the credit” being the “one name”. That’s also unfair, isn’t it? A good solution would be to have a list of “main developers” on the upper part of the README maybe?

In general, I think these things just happen when the packages are ready. In the early days of Julia, a lot of repos ended up in organizations, and now a lot of organizations have a bunch of effectively abandoned packages.

I am not sure I understand the reasoning here. From my perspective, clean code, good documentation, and responsive maintainers invite contributions. These are pretty much orthogonal to repos being in an organization.

If anything, I find it easier to contribute to package which has a single main contributor who has a vision about where the package should be going. A lot of PRs just sit around for a long time because no one is willing to make some major decisions (which include just saying no to a PR, but quickly).

2 Likes

Thank you @Datseris, I will add more detail to the issues on GitHub, and will ping you there if that is ok.

Regarding the move to the organization, I think it makes sense to move a package to an org when there is a shared vision for a project. Right now GeoStats.jl is pretty much my own vision for what spatial statistics should look like. In the future, if I perceive that other people with similar background are joining the effort and are contributing to this vision in non-trivial ways, then it makes sense to start an org. I think I agree with what @Tamas_Papp said: it is better to have a single owner that is fast making major decisions with a clear vision in mind, than a community of people touching the code freely. We have examples of projects in the spatial statistics literature that were touched by multiple people and that became a spaghetti code. I’d like to avoid this.

2 Likes

Discoverability is indeed important for the whole ecosystem (larger than one org) to thrive, and avoid unneccesary duplicate efforts. But having a large list of packages in various states under one org may not be the best either. We know that not everybody wants to transfer it in, for a variety of reasons. To help discoverability we can expand on https://juliageo.org/, which lists packages in the whole ecosystem, not just JuliaGeo. And of course tagging packages can help in finding them on https://pkg.julialang.org/docs/.

4 Likes

Sure, this seems like a good middle-ground solution.

1 Like

@Datseris

Here’s some infos about what needs to be done to separate the mapping/plotting functionalities into a new package.

Cheers!

1 Like

seems that it is useful to easily plot an ncdataset. But on the other hand yet one more way to load netCDF data just makes things more complicated for me.

@Datseris It looks I passed a slightly misleading message. GMT stands for Generic Mapping Tools and is known mostly for its mapping quality, but the data used in it can come from whatever origin the user wants. I provided that example to show an easy way of loading data from a nc file, but the way data is obtained is irrelevant. Right, it need to be formatted into a GMTgrid structure and here is where the ongoing effort to create and grid abstraction model can be useful. I’ll keep an eye on it and try to integrate it in the GMT.jl workflow.

But GMT has been developed for 30 years and is much more then just mapping. For example, you mentioned somewhere above the interest in a regriding utility. In GMT.jl one can downsample the example grid to 0.2 degrees (it was ~0.08) with

Gresamp = grdresample(G, increment=0.2);

But although this would be good for mapping, it’s not the correct thing to do because the downsampling introduced aliasing. So the best is to filter the grid to avoid aliasing. That would be done with the grdfilter module

Gresamp = grdfilter(G, increment=0.2, filter="g20", distflag=4);

The above does a gaussian filter with a filter width of 20 km, where in each node the 20 km are calculated knowing that we are on the sphere.

Hope this provides a better idea of what the GMT.jl package is all about.

1 Like

@Datseris you may also want to look at GeoData.jl if you are working with raster datasets in models. It generalises load/save and indexing for quite a few geospatial file types, including working with large multi-file datasets. It also does lat/long/time etc arbitrary dim order indexing using DimensionalData.jl. And also plotting.

It was really made for modelling - to avoid hard coding your models to specific formats and file storage structures, especially for larger than memory datasets. That’s what I use it for the most. But its also great to have easily plottable model inputs and outputs as spatial data will propagate through most Base/Statistics methods you apply to the array.

It should be released in the next month or two, but a lot of my modelling packages already use it so it’s relatively stable.

1 Like

Seems that github could help with these issues [ie dilution of credit for authorship], by maintaining a provenance for packages as they move from private ownership into organizations and through subsequent reorganizations, and keeping a CV on the personal github accounts detailing contributions to and participation in organizations. Maybe they already do some of this? Documented package history and authorship should also help with the inevitable malware and typo-squatting attacks.

Let’s try to keep the conversation on topic please. I merely suggested that there could be benefits for having repos in orgs, but talking about the complications of it is honestly unrelated with the topic, which is: a newcomer in meteorology-related work would like to contribute to Julia packages. Anything outside this should be discussed in a separate topic.

1 Like

Thanks @joa-quim, now I see that GMT is something useful. I am trying to collect all the packages that can plot a spatial field over the earth, and so far I have GMT and ClimateTools (which will be ClimateMaps in the future). Is there something else?

Not that I know. Of course if you are dealing with a small part of the earth and don’t need to take the earth’s shape into account you can use standard Plots.jl / Makie / etcetera.

Would be nice to eventually have pure Julia support for this, for instance in GitHub - MakieOrg/GeoMakie.jl: Geographical plotting utilities for Makie.jl.