Help me think through an efficient workflow for GPS data:

I have 371-files of detrended GPS data formatted like such:
[decimalyear, longitude, latitude, value, sigma, station]

My end goal is plot smoothed data by region to highlight an anomaly I’m looking at for my masters thesis.

My workflow so far is probably clunky and inefficient, and any input/advice on how to improve my workflow is much appreciated.

First, I take my detrended data and take a centered median (I am going to have to take a centered average now, since the median is a bit noisy - but that’s beside the point right now) like such:

function calculate_centered_median(data, dates, all_possible_dates, window_size)
    @assert isodd(window_size) "Window size must be odd"
    half_window = div(window_size - 1, 2)
    rolling_median = similar(data)
    n = length(data)

    # median calculation
    function find_median(arr)
        sorted = sort(arr)
        len = length(sorted)
        return if isodd(len)
            sorted[div(len + 1, 2)]
        else
            (sorted[div(len, 2)] + sorted[div(len, 2) + 1]) / 2
        end
    end

    # for each point, check for gaps and adjust window
    for i in 1:n
        current_date = dates[i]
        
        # indices of dates within the time window
        date_range_start = current_date - half_window/365.25  # Convert days to years
        date_range_end = current_date + half_window/365.25
        
        # valid data within the window
        valid_indices = findall(j -> 
            dates[j] >= date_range_start && 
            dates[j] <= date_range_end &&
            !any(gap -> 
                # gaps > 5 days between consecutive dates
                dates[j] < gap[1] && dates[j] > gap[2],
                # create pairs of gap boundaries
                [(dates[k]+5/365.25, dates[k+1]) for k in 1:n-1 
                    if dates[k+1] - dates[k] > 5/365.25]
            ),
            1:n  # range we're searching over
        )
        
        if isempty(valid_indices)
            rolling_median[i] = NaN
        else
            window_values = data[valid_indices]
            rolling_median[i] = find_median(window_values)
        end
    end

    return rolling_median
end

Which the output creates a new CSV that has all the same data as the original detrended dataset, but, now has the windowed value after the value column:
[decyr, lon, lat, value, centeredmedian, sigma, station]

From here I have a naming convention problem because each of the 371 files are organized by GPS stations, ie their named like AGMT_detrend and I need them organized by epochs (decimal year), so I iterate through all my files and match each file by decimal year, and if there is a match, it is fed into it’s own file by epoch 2005.296_CenterVert.csv which creates 1826 individual files that each contain all stations within that epoch:

function matchyr(yr)
    mathold = []
    c_glob = glob("*.txt", data_path)
    for file in c_glob
        c_hold = readdlm(file)
        # ensure c_hold is not empty and has the expected structure
        if size(c_hold, 2) >= 1
            # Convert types if necessary
            column_one = c_hold[:, 1]
            ffilt = column_one .== yr  # change 'yr' if you want a specific year
            fdata = c_hold[ffilt, :]
            if !isempty(fdata)
                println("Match in Dataset: ", basename(file))
                push!(mathold, fdata)
            end
        else
            println("Warning: File ", basename(file), " does not have the expected structure.")
        end
    end
    return mathold
end

From here, I now want create longitudinal bins and feed these into a single, large file to make visualization easier. Now, this CSV does end up being almost 700k rows deep, which may not be ideal – so I may split this into 4-5 CSV’s depending on how many bins I settle on. But, here’s the function for that:

# assign regions in longitude increments
function assign_region(longitude::Float64)
    bin_start = floor(longitude / 24) * 24
    bin_end = bin_start + 24
    return "Region $(Int(bin_start))° to $(Int(bin_end))°"
end

# directory
directory = "/path/to/data/"
files = glob("*.txt", directory)

# Initialize an empty DataFrame to hold all data
all_data = DataFrame(
    Epoch = Float64[],
    Longitude = Float64[],
    Latitude = Float64[],
    Value = Float64[],
    Sigma = Float64[],
    Station = String[],
    Region = String[]
)

Almost done now that I have everything in a single CSV that is formatted like [decyr, lon, lat, value, centeredmedian, sigma, station, region].
I will now take the median value of all the centered medians for stations within the same region & where the epoch matches to better visualize the temporal/spatial relationship between regions – the actual value represented here does not matter – only when and where amplitude peaks. This is how I accomplish this:

function calculate_median_of_medians(data, region, epoch)
    # filter data to match both region and epoch, excluding station LOU5
    matching_data = filter(row -> row.Region == region && 
                               row.Epoch == epoch)

    # median column values that match criteria
    medians = matching_data.Median
    
    # Return NaN if no matching data found
    if isempty(medians)
        return NaN
    end
    
    # sort the medians array
    sorted_medians = sort(medians)
    n = length(sorted_medians)
    println("n: $n")
    # calculate median
    if isodd(n)
        # odd number of elements, return middle value
        return round.(sorted_medians[div(n + 1, 2)], digits=5)
    else
        # even number of elements, return average of two middle values
        return round.((sorted_medians[div(n, 2)] + sorted_medians[div(n, 2) + 1]) / 2, digits=5)
    end
end

When all is said and done, I create a plot like this:

I guess what I’m trying to do is make it so that this whole process doesn’t take forever to do for when I inevitably need to start from scratch again because some variable changed. This does end up taking a few hours – and there are steps prior to the starting point I shared here, but those can’t be changed since they’re Fortran scripts from my advisor – and I’m not about to mess with those.

Are there any superfluous aspects I should think about removing? Can any of these be condensed? I’m not 100% convinced that creating a file output for each step is necessary, but, I don’t want to bottle-neck the capabilities of my computer with too much data either.

Sorry for the lengthy post, and I apologize if this is outside of the scope of this particular domain – I just don’t know where else to post or who to ask for help. I feel like I’m not going to get any better at optimizing my scientific code if I just keep doing the same thing over and over without any feedback.

Thank you in advance for any input/advice! It is always appreciated and I’m grateful for the time anyone takes to provide any feedback.

  • Rob

I didn’t read all the details, so I apologize if I’m missing the mark here, but this sounds like a good case for partitioned parquet files.

If you had partition directories on decimal year and station, you could easily group and filter on these. Additional, you could compute column statistics on longitude to give Parquet more info in the metadata as you write out the table.

Check out Parquet2.jl’s multi-file dataset docs here:

Awesome, thanks and I’ll check that out!

To speed up working with (subsets of) your data, I would 1st put the whole lot, from all 371 files, into a single table and store this on disk. Potential formats are HDF5 or a database like SQLite.

This will greatly speed up reading your data compared to CSV in many files. The columns of the table should include ‘centeredmedian’, ‘region’, … even if they are not computed yet. You fill the column values by updating the table.

The curves, especially the red one, shows a maximum in year 2003, the solar maximum about 22 years ago (presently we are seeing the 2nd maximum since then). Are you seeing ionospheric effects in the GPS data? Would this be expected? Or are these geodetic data and ionospheric effects are unwanted disturbance?