I have 371-files of detrended GPS data formatted like such:
[decimalyear, longitude, latitude, value, sigma, station]
My end goal is plot smoothed data by region to highlight an anomaly I’m looking at for my masters thesis.
My workflow so far is probably clunky and inefficient, and any input/advice on how to improve my workflow is much appreciated.
First, I take my detrended data and take a centered median (I am going to have to take a centered average now, since the median is a bit noisy - but that’s beside the point right now) like such:
function calculate_centered_median(data, dates, all_possible_dates, window_size)
@assert isodd(window_size) "Window size must be odd"
half_window = div(window_size - 1, 2)
rolling_median = similar(data)
n = length(data)
# median calculation
function find_median(arr)
sorted = sort(arr)
len = length(sorted)
return if isodd(len)
sorted[div(len + 1, 2)]
else
(sorted[div(len, 2)] + sorted[div(len, 2) + 1]) / 2
end
end
# for each point, check for gaps and adjust window
for i in 1:n
current_date = dates[i]
# indices of dates within the time window
date_range_start = current_date - half_window/365.25 # Convert days to years
date_range_end = current_date + half_window/365.25
# valid data within the window
valid_indices = findall(j ->
dates[j] >= date_range_start &&
dates[j] <= date_range_end &&
!any(gap ->
# gaps > 5 days between consecutive dates
dates[j] < gap[1] && dates[j] > gap[2],
# create pairs of gap boundaries
[(dates[k]+5/365.25, dates[k+1]) for k in 1:n-1
if dates[k+1] - dates[k] > 5/365.25]
),
1:n # range we're searching over
)
if isempty(valid_indices)
rolling_median[i] = NaN
else
window_values = data[valid_indices]
rolling_median[i] = find_median(window_values)
end
end
return rolling_median
end
Which the output creates a new CSV that has all the same data as the original detrended dataset, but, now has the windowed value after the value column:
[decyr, lon, lat, value, centeredmedian, sigma, station]
From here I have a naming convention problem because each of the 371 files are organized by GPS stations, ie their named like AGMT_detrend
and I need them organized by epochs (decimal year), so I iterate through all my files and match each file by decimal year, and if there is a match, it is fed into it’s own file by epoch 2005.296_CenterVert.csv
which creates 1826 individual files that each contain all stations within that epoch:
function matchyr(yr)
mathold = []
c_glob = glob("*.txt", data_path)
for file in c_glob
c_hold = readdlm(file)
# ensure c_hold is not empty and has the expected structure
if size(c_hold, 2) >= 1
# Convert types if necessary
column_one = c_hold[:, 1]
ffilt = column_one .== yr # change 'yr' if you want a specific year
fdata = c_hold[ffilt, :]
if !isempty(fdata)
println("Match in Dataset: ", basename(file))
push!(mathold, fdata)
end
else
println("Warning: File ", basename(file), " does not have the expected structure.")
end
end
return mathold
end
From here, I now want create longitudinal bins and feed these into a single, large file to make visualization easier. Now, this CSV does end up being almost 700k rows deep, which may not be ideal – so I may split this into 4-5 CSV’s depending on how many bins I settle on. But, here’s the function for that:
# assign regions in longitude increments
function assign_region(longitude::Float64)
bin_start = floor(longitude / 24) * 24
bin_end = bin_start + 24
return "Region $(Int(bin_start))° to $(Int(bin_end))°"
end
# directory
directory = "/path/to/data/"
files = glob("*.txt", directory)
# Initialize an empty DataFrame to hold all data
all_data = DataFrame(
Epoch = Float64[],
Longitude = Float64[],
Latitude = Float64[],
Value = Float64[],
Sigma = Float64[],
Station = String[],
Region = String[]
)
Almost done now that I have everything in a single CSV that is formatted like [decyr, lon, lat, value, centeredmedian, sigma, station, region]
.
I will now take the median value of all the centered medians for stations within the same region & where the epoch matches to better visualize the temporal/spatial relationship between regions – the actual value represented here does not matter – only when and where amplitude peaks. This is how I accomplish this:
function calculate_median_of_medians(data, region, epoch)
# filter data to match both region and epoch, excluding station LOU5
matching_data = filter(row -> row.Region == region &&
row.Epoch == epoch)
# median column values that match criteria
medians = matching_data.Median
# Return NaN if no matching data found
if isempty(medians)
return NaN
end
# sort the medians array
sorted_medians = sort(medians)
n = length(sorted_medians)
println("n: $n")
# calculate median
if isodd(n)
# odd number of elements, return middle value
return round.(sorted_medians[div(n + 1, 2)], digits=5)
else
# even number of elements, return average of two middle values
return round.((sorted_medians[div(n, 2)] + sorted_medians[div(n, 2) + 1]) / 2, digits=5)
end
end
When all is said and done, I create a plot like this:
I guess what I’m trying to do is make it so that this whole process doesn’t take forever to do for when I inevitably need to start from scratch again because some variable changed. This does end up taking a few hours – and there are steps prior to the starting point I shared here, but those can’t be changed since they’re Fortran scripts from my advisor – and I’m not about to mess with those.
Are there any superfluous aspects I should think about removing? Can any of these be condensed? I’m not 100% convinced that creating a file output for each step is necessary, but, I don’t want to bottle-neck the capabilities of my computer with too much data either.
Sorry for the lengthy post, and I apologize if this is outside of the scope of this particular domain – I just don’t know where else to post or who to ask for help. I feel like I’m not going to get any better at optimizing my scientific code if I just keep doing the same thing over and over without any feedback.
Thank you in advance for any input/advice! It is always appreciated and I’m grateful for the time anyone takes to provide any feedback.
- Rob