Help me think through an efficient workflow for GPS data:

GeodeticR · November 12, 2024, 11:45pm

I have 371-files of detrended GPS data formatted like such:
[decimalyear, longitude, latitude, value, sigma, station]

My end goal is plot smoothed data by region to highlight an anomaly I’m looking at for my masters thesis.

My workflow so far is probably clunky and inefficient, and any input/advice on how to improve my workflow is much appreciated.

First, I take my detrended data and take a centered median (I am going to have to take a centered average now, since the median is a bit noisy - but that’s beside the point right now) like such:

function calculate_centered_median(data, dates, all_possible_dates, window_size)
    @assert isodd(window_size) "Window size must be odd"
    half_window = div(window_size - 1, 2)
    rolling_median = similar(data)
    n = length(data)

    # median calculation
    function find_median(arr)
        sorted = sort(arr)
        len = length(sorted)
        return if isodd(len)
            sorted[div(len + 1, 2)]
        else
            (sorted[div(len, 2)] + sorted[div(len, 2) + 1]) / 2
        end
    end

    # for each point, check for gaps and adjust window
    for i in 1:n
        current_date = dates[i]
        
        # indices of dates within the time window
        date_range_start = current_date - half_window/365.25  # Convert days to years
        date_range_end = current_date + half_window/365.25
        
        # valid data within the window
        valid_indices = findall(j -> 
            dates[j] >= date_range_start && 
            dates[j] <= date_range_end &&
            !any(gap -> 
                # gaps > 5 days between consecutive dates
                dates[j] < gap[1] && dates[j] > gap[2],
                # create pairs of gap boundaries
                [(dates[k]+5/365.25, dates[k+1]) for k in 1:n-1 
                    if dates[k+1] - dates[k] > 5/365.25]
            ),
            1:n  # range we're searching over
        )
        
        if isempty(valid_indices)
            rolling_median[i] = NaN
        else
            window_values = data[valid_indices]
            rolling_median[i] = find_median(window_values)
        end
    end

    return rolling_median
end

Which the output creates a new CSV that has all the same data as the original detrended dataset, but, now has the windowed value after the value column:
[decyr, lon, lat, value, centeredmedian, sigma, station]

From here I have a naming convention problem because each of the 371 files are organized by GPS stations, ie their named like AGMT_detrend and I need them organized by epochs (decimal year), so I iterate through all my files and match each file by decimal year, and if there is a match, it is fed into it’s own file by epoch 2005.296_CenterVert.csv which creates 1826 individual files that each contain all stations within that epoch:

function matchyr(yr)
    mathold = []
    c_glob = glob("*.txt", data_path)
    for file in c_glob
        c_hold = readdlm(file)
        # ensure c_hold is not empty and has the expected structure
        if size(c_hold, 2) >= 1
            # Convert types if necessary
            column_one = c_hold[:, 1]
            ffilt = column_one .== yr  # change 'yr' if you want a specific year
            fdata = c_hold[ffilt, :]
            if !isempty(fdata)
                println("Match in Dataset: ", basename(file))
                push!(mathold, fdata)
            end
        else
            println("Warning: File ", basename(file), " does not have the expected structure.")
        end
    end
    return mathold
end

From here, I now want create longitudinal bins and feed these into a single, large file to make visualization easier. Now, this CSV does end up being almost 700k rows deep, which may not be ideal – so I may split this into 4-5 CSV’s depending on how many bins I settle on. But, here’s the function for that:

# assign regions in longitude increments
function assign_region(longitude::Float64)
    bin_start = floor(longitude / 24) * 24
    bin_end = bin_start + 24
    return "Region $(Int(bin_start))° to $(Int(bin_end))°"
end

# directory
directory = "/path/to/data/"
files = glob("*.txt", directory)

# Initialize an empty DataFrame to hold all data
all_data = DataFrame(
    Epoch = Float64[],
    Longitude = Float64[],
    Latitude = Float64[],
    Value = Float64[],
    Sigma = Float64[],
    Station = String[],
    Region = String[]
)

Almost done now that I have everything in a single CSV that is formatted like [decyr, lon, lat, value, centeredmedian, sigma, station, region].
I will now take the median value of all the centered medians for stations within the same region & where the epoch matches to better visualize the temporal/spatial relationship between regions – the actual value represented here does not matter – only when and where amplitude peaks. This is how I accomplish this:

function calculate_median_of_medians(data, region, epoch)
    # filter data to match both region and epoch, excluding station LOU5
    matching_data = filter(row -> row.Region == region && 
                               row.Epoch == epoch)

    # median column values that match criteria
    medians = matching_data.Median
    
    # Return NaN if no matching data found
    if isempty(medians)
        return NaN
    end
    
    # sort the medians array
    sorted_medians = sort(medians)
    n = length(sorted_medians)
    println("n: $n")
    # calculate median
    if isodd(n)
        # odd number of elements, return middle value
        return round.(sorted_medians[div(n + 1, 2)], digits=5)
    else
        # even number of elements, return average of two middle values
        return round.((sorted_medians[div(n, 2)] + sorted_medians[div(n, 2) + 1]) / 2, digits=5)
    end
end

When all is said and done, I create a plot like this:

I guess what I’m trying to do is make it so that this whole process doesn’t take forever to do for when I inevitably need to start from scratch again because some variable changed. This does end up taking a few hours – and there are steps prior to the starting point I shared here, but those can’t be changed since they’re Fortran scripts from my advisor – and I’m not about to mess with those.

Are there any superfluous aspects I should think about removing? Can any of these be condensed? I’m not 100% convinced that creating a file output for each step is necessary, but, I don’t want to bottle-neck the capabilities of my computer with too much data either.

Sorry for the lengthy post, and I apologize if this is outside of the scope of this particular domain – I just don’t know where else to post or who to ask for help. I feel like I’m not going to get any better at optimizing my scientific code if I just keep doing the same thing over and over without any feedback.

Thank you in advance for any input/advice! It is always appreciated and I’m grateful for the time anyone takes to provide any feedback.

Rob

mrufsvold · November 13, 2024, 2:36am

I didn’t read all the details, so I apologize if I’m missing the mark here, but this sounds like a good case for partitioned parquet files.

If you had partition directories on decimal year and station, you could easily group and filter on these. Additional, you could compute column statistics on longitude to give Parquet more info in the metadata as you write out the table.

Check out Parquet2.jl’s multi-file dataset docs here:

GeodeticR · November 13, 2024, 3:55am

Awesome, thanks and I’ll check that out!

stephancb · November 14, 2024, 7:10am

To speed up working with (subsets of) your data, I would 1st put the whole lot, from all 371 files, into a single table and store this on disk. Potential formats are HDF5 or a database like SQLite.

This will greatly speed up reading your data compared to CSV in many files. The columns of the table should include ‘centeredmedian’, ‘region’, … even if they are not computed yet. You fill the column values by updating the table.

The curves, especially the red one, shows a maximum in year 2003, the solar maximum about 22 years ago (presently we are seeing the 2nd maximum since then). Are you seeing ionospheric effects in the GPS data? Would this be expected? Or are these geodetic data and ionospheric effects are unwanted disturbance?

GeodeticR · November 18, 2024, 1:00am

This is great, I’ll try to implement this next time around!

We would see an increase in common-mode noise across all GPS stations in our network (MAGNET) as well as any other stations operated by other agencies/institutions that are processed by the NGL (Nevada Geodetic Laboratory). But even prior to an increase in ionospheric interference, there is a “boiler-plate” filter that handles the effect of the ionosphere in general. That process is more behind-the-scenes and outside of my knowledge – but my assumption is that they’re aware of solar maximums and other periodic events…

All that to say, it is an unwanted variable in GPS data (generally speaking) and is filtered out. Further, I would expect to see a fairly uniform signal in the GPS data across the US if the origin was from an external source as large as the Sun (just taking an educated guess here though).

Topic		Replies	Views
Repartitioning 2TB of csv into parquets Data big-data	21	4711	June 25, 2020
How to subset a stream I/O and pass it to TensorFlow.jl? (question has been updated) Data question	61	4385	May 6, 2017
Grid Interpolation of Data from Polar-Orbiting Satellites New to Julia interpolations , geo , gmt	25	3956	May 28, 2019
Binning Large Collections of Timestamps Performance	6	2571	August 28, 2018
Partition a large CSV file into smaller files without loading into memory General Usage question , csv , io	6	3712	March 10, 2019

Help me think through an efficient workflow for GPS data:

Related topics