[ANN] DataConvenience.jl - convenience functions I find useful

It’s there GitHub - xiaodaigh/DataConvenience.jl: Convenience functions missing in Julia

@replicate n expr

will run the expr n times. For example

using Random
@replicate 1_000_000 randstring(10)

Also my other favourite function is

cleannames!

which cleans up a DataFrames’s column names to remove characters other than _1-9a-zA-Z. It’s almost like R’s janitor::clean_names

6 Likes

Thank you for sharing. That cleannames! function is nice. I have a method for renaming columns but not quite as automated.
Whenever I find myself doing the same thing multiple times, try to generalize and put into a function. Here’s one I’m working through now to clean up web URL’s (there could be a parse package for this but didn’t see this functionality in parseHTML or others). Need to learn regex’s now!
Have found @Tamas_Papp’s BrowseTables.jl package helpful for this type of iterative cleansing.
Here is example with database of US post-secondary schools that is in-process but getting close:

using DataFrames
using BrowseTables
df2 = filter(x -> x.:TOT_ENROLL > 5000, df) # brings list from 7K down to 1K schools
write_html_table("index.html", df2; title="Schools", caption="Schools US over 5K Enrollment")
1,070 rows × 43 columns (omitted printing of 39 columns)

then will pull out the df column being cleaned up:

urls = (df2[:, :WEBSITE])
1070-element Array{String,1}:
 "www.uwec.edu"                    
 "www.tacomacc.edu"                
 "https://jtcc.edu"                
 "www.losmedanos.edu/default.aspx" 
 ⋮                                 
 "www.lsus.edu/"  
 "https://www.kapiolani.hawaii.edu"
 "suny.buffalostate.edu/"          
 "WWW.SUNYBROOME.EDU" 

clean it up, convert to df and give it a column name, then join back with the filtered dataframe adding the clean column:

arr3=[]
for url in urls
    if url[1:5]=="www2." || url[1:5]=="WWW2." || url[1:5]=="www3." || url[1:5]=="WWW3." || url[1:5]=="www4." || url[1:5]=="WWW4."
        arr3=[arr3; url[6:length(url)]]
    elseif url[1:4]=="www." || url[1:4]=="WWW."
        arr3=[arr3; url[5:length(url)]]
    elseif length(url) < 13
        arr3=[arr3; url]
    elseif url[1:13]=="https://www2." || url[1:13]=="https://WWW2." || url[1:13]=="https://www3."|| url[1:13]=="https://WWW3." || url[1:13]=="https://www4."|| url[1:13]=="https://WWW4."
        arr3=[arr3; url[14:length(url)]]
    elseif url[1:12]=="https://www." || url[1:12]=="https://WWW."
        arr3=[arr3; url[13:length(url)]]
    elseif url[1:8]=="https://"
        arr3=[arr3; url[9:length(url)]]
    else
        arr3=[arr3; url]
    end
end
# Next go back and clean the urls's containing forward-slash in url
df_col=DataFrame(WEBSITE_CLEAN = arr3)
df4 =  hcat(df2, df_col)
write_html_table("index.html", (df4); title="Schools", caption="Schools USA over 5K Enrollment")

finally need to add another routine to re-process the output array above to find any url with “/” in it and trim that to finally have a clean set of domain names. Could force all to lowercase, work all within loop vs global scope, and some other things, but these hacky methods move things along nicely using Julia, esp compared to old ways (text manipulation in Excel :woozy_face:).

1 Like