Function works correctly for single file, but not in a loop?

GeodeticR · July 13, 2024, 1:21am

Okay, I am at a loss here.

I have this loop here to skip any data that doesn’t meet a certain threshold. But what I cannot seem to figure out is why this isn’t working in the loop I have, but works fine when I enter it as a single file.

Here is what I have for the function:

# simple function to read in my data, and filter based on length
function gmt_read(file)
    
    epoch_hold = gmtread(file, incols = "0,1,2");

    if length(epoch_hold) > 100
        return epoch_hold
    end

end

And here is the relevant part of the for-loop:

for file in glob("*2sigma.txt", twosigpath)
           epoch_hold = gmt_read(file)
end

Now this is where I am not understanding what’s wrong. I tested this loop with a single file first like this:

# file that contains data
fullfile = gmtread("/home/rob/Documents/julia/twosigma/2005.919_2sigma.txt", incols = "0,1,2")

length(fullfile)
5748


# file that is empty
emptyfile = gmtread("/home/rob/Documents/julia/twosigma/2005.916_2sigma.txt", incols = "0,1,2")

length(emptyfile)
0

If I run the above variable I get the correct feedback:

if length(fullfile) > 100
           return fullfile
       end
BoundingBox: [-121.0, -57.6, 31.8, 47.0, 0.1262, 0.2498]

1916×3 GMTdataset{Float64, 2}
  Row │ col.1  col.2   col.3 
──────┼──────────────────────
    1 │ -86.6   31.8  0.21
    2 │ -86.4   31.8  0.21
    3 │ -87.2   32.0  0.2164
    4 │ -87.0   32.0  0.2164
    5 │ -86.8   32.0  0.2164
    6 │ -87.6   32.2  0.2161
    7 │ -87.4   32.2  0.2145
    8 │ -87.2   32.2  0.2164
    9 │ -87.0   32.2  0.2164
   10 │ -86.8   32.2  0.2164
  ⋮   │   ⋮      ⋮      ⋮
 1908 │ -59.6   46.8  0.1852
 1909 │ -59.4   46.8  0.1852
 1910 │ -59.2   46.8  0.1852
 1911 │ -68.2   47.0  0.1852
 1912 │ -68.0   47.0  0.1852
 1913 │ -67.8   47.0  0.1852
 1914 │ -58.2   47.0  0.1852
 1915 │ -58.0   47.0  0.1852
 1916 │ -57.8   47.0  0.1852
            1897 rows omitted


# reversed comparison
 if length(fullfile) < 100
           return fullfile
       end

julia>

Now for the empty set, which also works correctly in Terminal:

if length(emptyfile) > 100
           return emptyfile
       end

julia> 

# reverse comparison
if length(emptyfile) < 100
           return emptyfile
       end

julia> String[]

To get to the main point, if I run this as the loop, which is exactly as written above, Julia only return all the empty sets, and excludes all the datasets with values. This is regardless of the comparison operator. Does anyone know what the heck I’m doing wrong? Is there some glaringly obvious error that blind to from staring at for too long??

for file in glob("*2sigma.txt", twosigpath)
           epoch_hold = gmt_read(file)
end

returns every file that does not contain data…

┌ Warning:      file "/home/rob/Documents/julia/twosigma/2005.965_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/PVGhC/src/gmtreadwrite.jl:174
gmtread [WARNING]: File /home/rob/Documents/julia/twosigma/2005.968_2sigma.txt is empty!
┌ Warning:      file "/home/rob/Documents/julia/twosigma/2005.968_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/PVGhC/src/gmtreadwrite.jl:174
gmtread [WARNING]: File /home/rob/Documents/julia/twosigma/2005.971_2sigma.txt is empty!
┌ Warning:      file "/home/rob/Documents/julia/twosigma/2005.971_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/PVGhC/src/gmtreadwrite.jl:174

I double-checked that the path was correct, and that glob() was pulling the correct file names.

for file in glob("*2sigma.txt", twosigpath)
           println(file)
end


/home/rob/Documents/julia/twosigma/2005.974_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.976_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.979_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.982_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.984_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.987_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.990_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.993_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.995_2sigma.txt
/home/rob/Documents/julia/twosigma/2005.998_2sigma.txt

Those are just the last 10 listed files, but there are a total of 1,826 files.

Anyone have any idea what’s going on here??

Benny · July 13, 2024, 1:57am

Not sure what you mean here because the following excerpt is just printed warnings, not returns of any value. The return value of gmtread, whether an instance of GMTdataset or Nothing, is assigned to the variable epoch_hold, and you don’t provide the surrounding code showing what you do with it.

GeodeticR · July 13, 2024, 2:08am

Perhaps I’m not understanding the loop then, because to my understanding there should be values being printed since I’m using ```return``, same as when I run the function with a single dataset, right?

These warnings are telling me that empty datasets are being returned back, but, I’m not getting anything that contains data.

Further, if I run my entire loop, I get this error:

ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)
Stacktrace:
 [1] top-level scope
   @ ~/Documents/julia/gridimage.jl:22

And there should be no empty sets if I’m excluding any datasets that are less than 100 in length.

If you’re interested in seeing where exactly my error is getting thrown, it’s at this line, because it can’t make a cpt with no data.

C = GMT.makecpt(cmap=:roma, 
        range=(minimum(epoch_hold[:, 3]), maximum(epoch_hold[:, 3]), 0.002), 
        reverse = true
);

Benny · July 13, 2024, 2:19am

return does not print at all, those are completely independent things. You are only seeing some return values being automatically printed by the REPL because your entered expression is a function call or is a block that can exit early with a return statement. Most return values are never printed, and most printed text, including those warnings, are not return values.

julia> function foo() return "hello I'm a return value" end
foo (generic function with 1 method)

julia> foo() # REPL prints the value of the overall expression
"hello I'm a return value"

julia> for i in 1:10000 foo() end # overall value is nothing

epoch_hold is temporarily assigned to the return value of gmt_read each iteration, you need to figure out what to do with it. If you want to collect all the data, push it to a collection like a vector. If you want to save memory and process the data to something much smaller, do something else in the iteration.

Some of the gmt_read outputs are nothing, it’s not indexable. You’ll have to skip that step. To be clear, functions implicitly return their last expression’s value if no return statement had run by then. Let’s look at your function:

function gmt_read(file)
    epoch_hold = gmtread(file, incols = "0,1,2");
    if length(epoch_hold) > 100
        return epoch_hold
    end
end

The last expression is that if-block. If the branch runs (enough data), then it’ll hit the return statement, making the function return an explicit value. If not, there’s no explicit else branch so it defaults to a value of nothing. That’s where your nothings are coming from.

GeodeticR · July 15, 2024, 11:24pm

Ah, okay. Unfortunately, I never figured out how to remove or skip the files that had less than the amount of data necessary to create a cpt and ended up writing a bash script instead.

Is there a way to accomplish the same thing this bash script does in Julia, or, is this process just better suited for bash scripts?

Here is the script, in case anyone is wondering.

#!/usr/bin/env -S bash -e

DIR0="$HOME/Documents/twosigma_rename/"

for file in "$DIR0"*2sigma.txt; do
  if [[ -f "$file" ]]; then
    wc -l "$file" | awk 'NR < 150 {exit 1}' && rm -f "$file"
  fi
done

As for where I left my Julia script, I never got past my issue of returning empty datasets, but, my last attempt before giving up was this:

function gmt_read(file)
    hold = []    
    gmt_hold = gmtread(file);
    if (size(gmt_hold) >= (150, 3))
        push!(hold, gmt_hold)
        println(size(hold))
        return hold 
    else
        nothing
    end
end

Which would always return the following ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)

Guess there is just some fundamental understanding I’m not grasping here. But oh well.

mkitti · July 16, 2024, 8:55am

Let’s just start with generating some sample data.

DIR0="$(ENV["HOME"])/Documents/twosigma_demo"
# Create directory and parent directories
mkpath(DIR0)

# Create alpha_2sigma.txt with two rows
write(
    joinpath(DIR0, "alpha_2sigma.txt"),
    """
    0 0 0
    1 1 1
    """
)

# Create beta_2sigma.txt with no rows
write(
    joinpath(DIR0, "beta_2sigma.txt"),
    """
    """
)

# Create gamma_2sigma.txt with 150 rows and 3 columns
write(
    joinpath(DIR0, "gamma_2sigma.txt"),
    join(join.(eachrow(rand(0:9, 150,3)), " "), "\n") * "\n"
)

# Create delta_2sigma.txt with 160 rows and 4 columns
write(
    joinpath(DIR0, "delta_2sigma.txt"),
    join(join.(eachrow(rand(0:9, 160,4)), " "), "\n") * "\n"
)

Next let’s inspect the results.

julia> DIR0="$(ENV["HOME"])/Documents/twosigma_demo"
"/home/mkitti/Documents/twosigma_demo"

julia> readdir(DIR0)
4-element Vector{String}:
 "alpha_2sigma.txt"
 "beta_2sigma.txt"
 "gamma_2sigma.txt"
 "delta_2sigma.txt"

julia> println(read(joinpath(DIR0, "alpha_2sigma.txt"), String))
0 0 0
1 1 1

julia> println(read(joinpath(DIR0, "beta_2sigma.txt"), String))

julia> println(read(joinpath(DIR0, "gamma_2sigma.txt"), String))
7 4 0
1 8 1
8 4 0
0 1 6
9 5 0
0 0 5
...
# abbreviated

julia> println(read(joinpath(DIR0, "delta_2sigma.txt"), String))
2 4 1 1
4 1 7 2
1 3 2 0
3 6 2 7
2 4 6 4
7 5 5 6
1 3 0 7
5 8 2 8
7 5 1 4
2 1 5 9
...
# abbreviated

Now let’s try some simple for loops.

julia> for file in readdir(DIR0)
           println(file)
       end
alpha_2sigma.txt
beta_2sigma.txt
delta_2sigma.txt
gamma_2sigma.txt

julia> for file in readdir(DIR0)
           println(joinpath(DIR0, file))
       end
/home/mkitti/Documents/twosigma_demo/alpha_2sigma.txt
/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt
/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt
/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           lines = readlines(file)
           println(length(lines))
       end
2
0
160
150

Instead of printing the number of lines in each file, let’s try to collect that into a Vector by pushing.

julia> nrows = Int[]
Int64[]

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           lines = readlines(file)
           push!(nrows, length(lines))
       end

julia> nrows
3-element Vector{Int64}:
   2
   0
 160
 150

There’s a simpler way to do the above via map:

julia> map(readdir(DIR0)) do file
           file = joinpath(DIR0, file)
           lines = readlines(file)
           return length(lines)
       end
3-element Vector{Int64}:
   2
   0
 160
 150

Now, let’s say I wanted to print the name of the file which has 150 or more lines.

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           lines = readlines(file)
           if length(lines) >= 150
               println(file)
           end
       end
/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt
/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt

We could then try to examine the number of columns.

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           lines = readlines(file)
           if length(lines) > 0
               first_line = first(lines)
               columns_in_first_line = split(first_line, " ")
               println(length(columns_in_first_line))
           else
               println("No rows!")
           end
       end
3
No rows!
4
3

Next let’s print out the files that more than 150 rows and more than 3 columns.

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           lines = readlines(file)
           NR = length(lines)
           if NR >= 150
               first_line = first(lines)
               columns_in_first_line = split(first_line, " ")
               NC = length(columns_in_first_line)
               if NC >= 3
                   println(file, " has ", NR, " rows and ", NC, " columns")
               end
           end
       end
/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt has 160 rows and 4 columns
/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt has 150 rows and 3 columns

Instead of printing this, let’s put it into a Vector.

julia> large_files = String[]
String[]

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           lines = readlines(file)
           NR = length(lines)
           if NR >= 150
               first_line = first(lines)
               columns_in_first_line = split(first_line, " ")
               NC = length(columns_in_first_line)
               if NC >= 3
                   push!(large_files, file)
               end
           end
       end

julia> large_files
2-element Vector{String}:
 "/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt"
 "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"

Now let’s make this a function.

julia> function get_large_files(DIR0)
           large_files = String[]
           for file in readdir(DIR0)
               file = joinpath(DIR0, file)
               lines = readlines(file)
               NR = length(lines)
               if NR >= 150
                   first_line = first(lines)
                   columns_in_first_line = split(first_line, " ")
                   NC = length(columns_in_first_line)
                   if NC >= 3
                       push!(large_files, file)
                   end
               end
           end
           return large_files
       end

get_large_files (generic function with 1 method)

julia> get_large_files(DIR0)
2-element Vector{String}:
 "/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt"
 "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"

We can also return the number of rows and columns.

julia> large_files, nrows, ncols = get_large_files_nrows_ncols(DIR0)
(["/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt", "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"], [160, 150], [4, 3])

julia> large_files
2-element Vector{String}:
 "/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt"
 "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"

julia> nrows
2-element Vector{Int64}:
 160
 150

julia> ncols
2-element Vector{Int64}:
 4
 3

Alternatively, we could return the rows and columns as a tuple.

julia> function get_large_files_and_sizes(DIR0)
           large_files = String[]
           large_file_sizes = Tuple{Int,Int}[]
           for file in readdir(DIR0)
               file = joinpath(DIR0, file)
               lines = readlines(file)
               NR = length(lines)
               if NR >= 150
                   first_line = first(lines)
                   columns_in_first_line = split(first_line, " ")
                   NC = length(columns_in_first_line)
                   if NC >= 3
                       push!(large_files, file)
                       push!(large_file_sizes, (NR, NC))
                   end
               end
           end
           return large_files, large_file_sizes
       end
get_large_files_and_sizes (generic function with 1 method)

julia> large_files, sizes = get_large_files_and_sizes(DIR0)
(["/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt", "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"], [(160, 4), (150, 3)])

julia> large_files
2-element Vector{String}:
 "/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt"
 "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"

julia> sizes
2-element Vector{Tuple{Int64, Int64}}:
 (160, 4)
 (150, 3)

Before continuing I want to comment on comparing tuples.

The following may be surprising.

julia> (160, 4) >= (150, 5)
true

I think you you might want to do the following to explitly compare each pair of numbers and determine they are all greater than or equal to the number on the right.

julia> (160, 4) .>= (150, 5)
(true, false)

julia> all((160, 4) .>= (150, 5))
false

Using the above, let’s rewrite the function

julia> function get_large_files_and_sizes_2(DIR0)
           large_files = String[]
           large_file_sizes = Tuple{Int,Int}[]
           for file in readdir(DIR0)
               file = joinpath(DIR0, file)
               lines = readlines(file)
               NR = length(lines)
               NC = 0
               if NR > 0
                   first_line = first(lines)
                   columns_in_first_line = split(first_line, " ")
                   NC = length(columns_in_first_line)
               end
               _size = (NR, NC)
               if all(_size .>= (150, 3))
                   push!(large_files, file)
                   push!(large_file_sizes, _size)
               end
           end
           return large_files, large_file_sizes
       end
get_large_files_and_sizes_2 (generic function with 1 method)

julia> large_files, large_file_sizes = get_large_files_and_sizes_2(DIR0)
(["/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt", "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"], [(160, 4), (150, 3)])

julia> large_files
2-element Vector{String}:
 "/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt"
 "/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt"

julia> large_file_sizes
2-element Vector{Tuple{Int64, Int64}}:
 (160, 4)
 (150, 3)

Now let’s try this with GMT.jl.

julia> using GMT

julia> gmtread(joinpath(DIR0, "alpha_2sigma.txt"))
BoundingBox: [0.0, 1.0, 0.0, 1.0, 0.0, 1.0]

2×3 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3 
─────┼─────────────────────
   1 │   0.0    0.0    0.0
   2 │   1.0    1.0    1.0

julia> gmtread(joinpath(DIR0, "beta_2sigma.txt"))
gmtread [WARNING]: File /home/mkitti/Documents/twosigma_demo/beta_2sigma.txt is empty!
┌ Warning: 	file "/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/SI3aF/src/gmtreadwrite.jl:189
String[]

julia> gmtread(joinpath(DIR0, "gamma_2sigma.txt"))
BoundingBox: [0.0, 9.0, 0.0, 9.0, 0.0, 9.0]

150×3 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3 
─────┼─────────────────────
   1 │   7.0    4.0    0.0
   2 │   1.0    8.0    1.0
   3 │   8.0    4.0    0.0
   4 │   0.0    1.0    6.0
   5 │   9.0    5.0    0.0

julia> gmtread(joinpath(DIR0, "delta_2sigma.txt"))
BoundingBox: [0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0]

160×4 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3  col.4 
─────┼────────────────────────────
   1 │   2.0    4.0    1.0    1.0
   2 │   4.0    1.0    7.0    2.0
   3 │   1.0    3.0    2.0    0.0
   4 │   3.0    6.0    2.0    7.0

Let’s make sure we can examine the sizes as we would expect.

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           gmt_hold = gmtread(file)
           println(size(gmt_hold))
       end
(2, 3)
gmtread [WARNING]: File /home/mkitti/Documents/twosigma_demo/beta_2sigma.txt is empty!
┌ Warning: 	file "/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/SI3aF/src/gmtreadwrite.jl:189
(0, 0)
(160, 4)
(150, 3)

Let’s write a function to retrieve large datasets.

julia> function get_large_datasets_gmt(DIR0)
           large_datasets = GMTdataset{Float64, 2}[]
           for file in readdir(DIR0)
               file = joinpath(DIR0, file)
               dataset = gmtread(file)
               if all(size(dataset) .>= (150,3))
                   push!(large_datasets, dataset)
               end
           end
           return large_datasets
       end
get_large_datasets_gmt (generic function with 1 method)

julia> large_datasets = get_large_datasets_gmt(DIR0);
gmtread [WARNING]: File /home/mkitti/Documents/twosigma_demo/beta_2sigma.txt is empty!
┌ Warning: 	file "/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/SI3aF/src/gmtreadwrite.jl:189

julia> length(large_datasets)
2

julia> large_datasets[1]
BoundingBox: [0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0]

160×4 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3  col.4 
─────┼────────────────────────────
   1 │   2.0    4.0    1.0    1.0
   2 │   4.0    1.0    7.0    2.0
   3 │   1.0    3.0    2.0    0.0
   4 │   3.0    6.0    2.0    7.0
   5 │   2.0    4.0    6.0    4.0
# abbreviated

julia> large_datasets[2]
BoundingBox: [0.0, 9.0, 0.0, 9.0, 0.0, 9.0]

150×3 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3 
─────┼─────────────────────
   1 │   7.0    4.0    0.0
   2 │   1.0    8.0    1.0
   3 │   8.0    4.0    0.0
   4 │   0.0    1.0    6.0
   5 │   9.0    5.0    0.0
   6 │   0.0    0.0    5.0
   7 │   7.0    9.0    3.0

I will also note that there is a much more compact way of writing this.

julia> function get_large_datasets_gmt_2(DIR0)
           filter(gmtread.(joinpath.(DIR0, readdir(DIR0)))) do dataset
                  all(size(dataset) .>= (150,3))
           end
       end
get_large_datasets_gmt_2 (generic function with 1 method)

julia> datasets = get_large_datasets_gmt(DIR0);
gmtread [WARNING]: File /home/mkitti/Documents/twosigma_demo/beta_2sigma.txt is empty!
┌ Warning: 	file "/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/SI3aF/src/gmtreadwrite.jl:189

julia> datasets[1]
BoundingBox: [0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0]

160×4 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3  col.4 
─────┼────────────────────────────
   1 │   2.0    4.0    1.0    1.0
   2 │   4.0    1.0    7.0    2.0
   3 │   1.0    3.0    2.0    0.0
   4 │   3.0    6.0    2.0    7.0
   5 │   2.0    4.0    6.0    4.0
# abbreviated

julia> datasets[2]
BoundingBox: [0.0, 9.0, 0.0, 9.0, 0.0, 9.0]

150×3 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3 
─────┼─────────────────────
   1 │   7.0    4.0    0.0
   2 │   1.0    8.0    1.0
   3 │   8.0    4.0    0.0
   4 │   0.0    1.0    6.0
   5 │   9.0    5.0    0.0

# abbreviated

As for the error that you are getting, the issue arises if you attempt to do the following.

julia> epoch_hold = nothing

julia> epoch_hold[:,1]
ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[201]:1

We could avoid the above error by testing for nothing:

julia> epoch_hold = gmtread(joinpath(DIR0, "alpha_2sigma.txt"))
BoundingBox: [0.0, 1.0, 0.0, 1.0, 0.0, 1.0]

2×3 GMTdataset{Float64, 2}
 Row │ col.1  col.2  col.3 
─────┼─────────────────────
   1 │   0.0    0.0    0.0
   2 │   1.0    1.0    1.0
julia> if !isnothing(epoch_hold)
           epoch_hold[:,1]
       end
2-element Vector{Float64}:
 0.0
 1.0

julia> epoch_hold = nothing

julia> if !isnothing(epoch_hold)
           my_dataset[:,1]
       end

Benny pointed out that there is a flaw in your function as written. We can demonstrate the flaw as follows.

julia> function gmt_read(file)
           epoch_hold = gmtread(file, incols = "0,1,2");
           if length(epoch_hold) > 100
               return epoch_hold
           end
       end
gmt_read (generic function with 1 method)

julia> gmt_read(joinpath(DIR0, "gamma_2sigma.txt"))[:,1]
150-element Vector{Float64}:
 7.0
 1.0
 8.0
# abbreviated

julia> gmt_read(joinpath(DIR0, "delta_2sigma.txt"))[:,1]
160-element Vector{Float64}:
 2.0
 4.0
# abbreviated 

julia> gmt_read(joinpath(DIR0, "alpha_2sigma.txt"))[:,1]
ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[229]:1

julia> gmt_read(joinpath(DIR0, "beta_2sigma.txt"))[:,1]
gmtread [WARNING]: File /home/mkitti/Documents/twosigma_demo/beta_2sigma.txt is empty!
┌ Warning: 	file "/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/SI3aF/src/gmtreadwrite.jl:189
ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[230]:1

The flaw here is that for small datasets your function returns nothing and then you try to index the result.

julia> alpha = gmt_read(joinpath(DIR0, "alpha_2sigma.txt"))

julia> isnothing(alpha)
true

julia> alpha[:, 1]
ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)
Stacktrace:
 [1] top-level scope
   @ REPL[234]:1

The reason this happens is that your function is identical to the following equivalent function definitions.

julia> function gmt_read(file)
           epoch_hold = gmtread(file, incols = "0,1,2");
           if length(epoch_hold) > 100
               return epoch_hold
           end
           return nothing
       end
gmt_read (generic function with 1 method)

julia> function gmt_read(file)
           epoch_hold = gmtread(file, incols = "0,1,2");
           if length(epoch_hold) > 100
               return epoch_hold
           else
               return nothing
           end
       end
gmt_read (generic function with 1 method)

We can reproduce the error as follows.

julia> function gmt_read(file)
           epoch_hold = gmtread(file, incols = "0,1,2");
           if length(epoch_hold) > 100
               return epoch_hold
           else
               return nothing
           end
       end
gmt_read (generic function with 1 method)

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           gmt_hold = gmt_read(file)
           gmt_hold[:,1]
       end
ERROR: MethodError: no method matching getindex(::Nothing, ::Colon, ::Int64)
Stacktrace:
 [1] top-level scope
   @ ./REPL[243]:4

We can correct it as follows.

julia> function gmt_read(file)
           epoch_hold = gmtread(file, incols = "0,1,2");
           return epoch_hold
       end
gmt_read (generic function with 1 method)

julia> for file in readdir(DIR0)
           file = joinpath(DIR0, file)
           gmt_hold = gmt_read(file)
           if length(gmt_hold) > 100
               println(gmt_hold[:,1])
           end
       end
gmtread [WARNING]: File /home/mkitti/Documents/twosigma_demo/beta_2sigma.txt is empty!
┌ Warning: 	file "/home/mkitti/Documents/twosigma_demo/beta_2sigma.txt" is empty or has no data after the header.
└ @ GMT ~/.julia/packages/GMT/SI3aF/src/gmtreadwrite.jl:189
[2.0, 4.0, 1.0, 3.0, 2.0, 7.0, 1.0, 5.0, 7.0, 2.0, 5.0, 8.0, 6.0, 1.0, 8.0, 5.0, 5.0, 3.0, 9.0, 7.0, 7.0, 6.0, 2.0, 6.0, 4.0, 4.0, 7.0, 8.0, 9.0, 0.0, 8.0, 6.0, 7.0, 6.0, 6.0, 7.0, 4.0, 4.0, 1.0, 9.0, 2.0, 4.0, 0.0, 2.0, 6.0, 2.0, 8.0, 5.0, 8.0, 4.0, 8.0, 0.0, 2.0, 3.0, 1.0, 4.0, 6.0, 1.0, 5.0, 1.0, 0.0, 5.0, 6.0, 4.0, 6.0, 6.0, 4.0, 0.0, 0.0, 0.0, 4.0, 8.0, 5.0, 6.0, 3.0, 4.0, 9.0, 7.0, 5.0, 8.0, 0.0, 7.0, 8.0, 0.0, 2.0, 0.0, 7.0, 5.0, 3.0, 8.0, 5.0, 0.0, 9.0, 2.0, 0.0, 1.0, 8.0, 2.0, 3.0, 9.0, 4.0, 9.0, 9.0, 9.0, 3.0, 5.0, 7.0, 1.0, 6.0, 8.0, 3.0, 9.0, 3.0, 7.0, 7.0, 9.0, 1.0, 9.0, 6.0, 0.0, 0.0, 8.0, 3.0, 0.0, 8.0, 2.0, 2.0, 5.0, 1.0, 1.0, 1.0, 0.0, 7.0, 8.0, 6.0, 9.0, 3.0, 4.0, 1.0, 5.0, 3.0, 3.0, 8.0, 1.0, 1.0, 3.0, 1.0, 5.0, 8.0, 7.0, 9.0, 0.0, 0.0, 3.0, 9.0, 7.0, 2.0, 6.0, 1.0, 5.0]
[7.0, 1.0, 8.0, 0.0, 9.0, 0.0, 7.0, 0.0, 8.0, 7.0, 9.0, 6.0, 9.0, 3.0, 8.0, 3.0, 6.0, 1.0, 8.0, 0.0, 5.0, 9.0, 0.0, 7.0, 5.0, 9.0, 3.0, 7.0, 6.0, 9.0, 5.0, 0.0, 1.0, 4.0, 6.0, 6.0, 4.0, 5.0, 5.0, 0.0, 1.0, 9.0, 3.0, 9.0, 6.0, 8.0, 4.0, 7.0, 3.0, 6.0, 8.0, 4.0, 5.0, 0.0, 5.0, 8.0, 5.0, 9.0, 0.0, 7.0, 1.0, 5.0, 6.0, 4.0, 6.0, 7.0, 2.0, 2.0, 2.0, 1.0, 7.0, 9.0, 7.0, 7.0, 9.0, 5.0, 7.0, 0.0, 9.0, 3.0, 4.0, 3.0, 8.0, 6.0, 9.0, 6.0, 4.0, 3.0, 4.0, 7.0, 4.0, 3.0, 5.0, 7.0, 4.0, 4.0, 4.0, 6.0, 9.0, 3.0, 0.0, 2.0, 6.0, 4.0, 0.0, 4.0, 0.0, 6.0, 1.0, 5.0, 4.0, 8.0, 5.0, 9.0, 8.0, 1.0, 4.0, 0.0, 6.0, 9.0, 1.0, 6.0, 8.0, 4.0, 1.0, 0.0, 6.0, 2.0, 6.0, 4.0, 6.0, 4.0, 2.0, 9.0, 5.0, 1.0, 5.0, 0.0, 1.0, 1.0, 9.0, 8.0, 1.0, 9.0, 2.0, 7.0, 8.0, 0.0, 5.0, 0.0]

Benny · July 16, 2024, 9:23am

I don’t know bash, but I’m getting deja vu from another thread where a major takeaway there is that reading particular file formats into in-memory data structures e.g. gmtread are usually much more expensive than bash commands doing simpler, generic text processing. I’d hazard a guess that you’re only reading 150 lines at most in bash before deciding whether to remove a file, whereas gmtread must read, parse, and validate the entire file regardless of whether you do.

mkitti · July 16, 2024, 11:28am

I’m confused by your bash script because you are piping the result wc -l to awk rather than the file itself. Then you compare NR the number of records so far for each line to 150. Your bash script would have no effect on the files I generated above since the output of wc -l would always return a single line. Your awk script would then return an exit status of 1 for all the files. It would then not remove any files.

I’ll propose a version of your bash script that works for me.

DIR0="$HOME/Documents/twosigma_demo"

for file in "$DIR0"/*2sigma.txt; do
  if [[ -f "$file" ]]; then
      wc -l "$file" | awk '$1 < 150 { exit 1}' && echo "$file" 
  fi
done

The results are as follows.

$ bash demo.sh
/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt
/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt

Here’s a literal translation into Julia:

DIR0="$(ENV["HOME"])/Documents/twosigma_demo"

for file in filter(endswith("2sigma.txt"), joinpath.(DIR0, readdir(DIR0)))
    if isfile(file)
        countlines(file) < 150 || println(file)
    end
end

The results are as follows.

$ julia blah.jl
/home/mkitti/Documents/twosigma_demo/delta_2sigma.txt
/home/mkitti/Documents/twosigma_demo/gamma_2sigma.txt

GeodeticR · July 19, 2024, 4:57pm

@mkitti This is fantastic. Thank you for taking time out of your day to write such a detailed comment. I’ll make sure to go through it line by line to deepen my understanding. Really, this is awesome and I appreciate this greatly.

Rob

Topic		Replies	Views
Reading a file from line x to line y General Usage csv	28	742	May 22, 2024
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2052	February 23, 2024
Reading a few rows from a BIG CSV file General Usage dataframes , csv , big-data	39	4566	January 18, 2024
Is the problem the function or the dataset I'm trying to use? New to Julia question	11	607	November 18, 2021
Iterative Filter Loop Help: New to Julia question , loops	3	105	May 9, 2024

Function works correctly for single file, but not in a loop?

Related topics