Custom module to import, manipulate and export dataframes issue

Hi, thank you for your help!

I built a module to subset a broader dataframe into smaller dataframes: MWE below.

module df_prep
using Pkg
Pkg.add("DataFrames")
using DataFrames
function df_preps(df1::DataFrame)
    exported_df = df1[findall(in(["b"]),df1.a),:]
end
export exported_df, df_preps
end

I then call the code and load the dataframes to pass, but I get the error “AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator” MWE below.

using Pkg

Pkg.add("DataFrames")
Pkg.add("CSV")
include("dfprep.jl")

using DataFrames, CSV, .df_prep

df1 = CSV.read("df1.csv", DataFrame)
2×2 DataFrame
│ Row │ a     │ b     │
│         │  str  │ str   │
├───┼── ─┼───┤
│  1   │   b     │ a     │
│  2   │   a     │ c     │

exporteddf = df_preps(df1)

I know what it states to use, but I’m not sure what to change… regardless, thank you for your help and elt me know if I need to clarify the question better.

I can’t reproduce this - are you sure you didn’t change your module and were still running an old version of the code when you got the error?

shell> cat "Documents/Julia/dfprep.jl"
module df_prep
using Pkg
Pkg.add("DataFrames")
using DataFrames
function df_preps(df1::DataFrame)
    exported_df = df1[findall(in(["b"]),df1.a),:]
end
export exported_df, df_preps
end

julia> include("Documents/Julia/dfprep.jl")
    Updating registry at `~/.julia/registries/General`
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.6/Project.toml`
  No Changes to `~/.julia/environments/v1.6/Manifest.toml`
Main.df_prep

julia> using DataFrames, .df_prep

julia> df = DataFrame(a = ["b", "a"], b = ["a", "c"])
2×2 DataFrame
 Row │ a       b      
     │ String  String 
─────┼────────────────
   1 │ b       a
   2 │ a       c

julia> df_preps(df)
1×2 DataFrame
 Row │ a       b      
     │ String  String 
─────┼────────────────
   1 │ b       a

A few additional comments:

  • I’m assuming your actual df_preps function is more complicated, but I’d say in general it is uncommon to have a separate module for a single function. Are you coming from Matlab where all functions live in a separate file? You could have just defined that function inline.
  • If you do stick to a module, no need to export exported_df - it’s just a local variable inside the df_preps function so exporting it won’t work (calling exported_df after including your file just gives an undefined reference error)
  • It’s also unusual to have modules to package management like you do above with adding DataFrames. If you think df_preps should have its own dependencies, you should probably turn it into its own package with a Project.toml file. Otherwise you can just using DataFrames in Main and remove all package operations from your module.
  • On your function itself, df1[findall(in(["b"]),df1.a),:] seems an awfully complicated way to express df1[df1.a .== "b", :], or alternatively using DataFrames functions filter(:a => (==)("b"), df1). You also don’t have to assign this to a variable exported_df given that you never use that name anywhere else. Most style guides for Julia recommend an explicit return at the end of a function. Finally, there’s no need to type annotate your function like df_preps(df1::DataFrame), unless you want to define other methods df_preps(df1::SomeOtherType). Julia will always specialize on the concrete type of df1 passed to the function, so there’s no performance benefit to the type annotation.

To summarize, my module would probably look like this:

module df_prep

df_preps(df1) = df1[df1.a .== "b",:]

export df_preps

end

although of course in this case I would have just used this one line directly in my main script rather than writing any functions or modules…

2 Likes

Cheers your detailed note!

I tested my MWE and it worked, so I copied and pasted the functions into that and it worked… In response to your comments:

  1. I’m new to coding in general.
  2. I thought your had to return and export the df to access it in another script, but it sounds like you’re saying you just return it?
  3. Okay, so it seems I can just reference the package in the main script for the function to recognize the datatype in the df_prep module?
  4. Thank you for the concise coding recommendation! It wasn’t running without the definition, but that was probably a different error and I minsinterpreted.

Summary:
So, I have zero understanding on best practices to structure code: do you have any references for a complete newbie to read?
I currently have 10 csv files that I subset into 36 DataFrames, and I create quite a few “calculated columns” based on prior inputs/fill in missing data. My file df_preps file contains 7 separate functions. What would standard practice suggest for construction? I really stripped it down for the MWE.

Cheers for your advice, super helpful!

John.