[ANN] DataConvenience.jl - convenience functions I find useful

xiaodai · November 25, 2019, 12:46pm

It’s there GitHub - xiaodaigh/DataConvenience.jl: Convenience functions missing in Julia

@replicate n expr

will run the expr n times. For example

using Random
@replicate 1_000_000 randstring(10)

Also my other favourite function is

cleannames!

which cleans up a DataFrames’s column names to remove characters other than _1-9a-zA-Z. It’s almost like R’s janitor::clean_names

pontus · November 26, 2019, 3:39pm

Thank you for sharing. That cleannames! function is nice. I have a method for renaming columns but not quite as automated.
Whenever I find myself doing the same thing multiple times, try to generalize and put into a function. Here’s one I’m working through now to clean up web URL’s (there could be a parse package for this but didn’t see this functionality in parseHTML or others). Need to learn regex’s now!
Have found @Tamas_Papp’s BrowseTables.jl package helpful for this type of iterative cleansing.
Here is example with database of US post-secondary schools that is in-process but getting close:

using DataFrames
using BrowseTables
df2 = filter(x -> x.:TOT_ENROLL > 5000, df) # brings list from 7K down to 1K schools
write_html_table("index.html", df2; title="Schools", caption="Schools US over 5K Enrollment")
1,070 rows × 43 columns (omitted printing of 39 columns)

then will pull out the df column being cleaned up:

urls = (df2[:, :WEBSITE])
1070-element Array{String,1}:
 "www.uwec.edu"                    
 "www.tacomacc.edu"                
 "https://jtcc.edu"                
 "www.losmedanos.edu/default.aspx" 
 ⋮                                 
 "www.lsus.edu/"  
 "https://www.kapiolani.hawaii.edu"
 "suny.buffalostate.edu/"          
 "WWW.SUNYBROOME.EDU"

clean it up, convert to df and give it a column name, then join back with the filtered dataframe adding the clean column:

arr3=[]
for url in urls
    if url[1:5]=="www2." || url[1:5]=="WWW2." || url[1:5]=="www3." || url[1:5]=="WWW3." || url[1:5]=="www4." || url[1:5]=="WWW4."
        arr3=[arr3; url[6:length(url)]]
    elseif url[1:4]=="www." || url[1:4]=="WWW."
        arr3=[arr3; url[5:length(url)]]
    elseif length(url) < 13
        arr3=[arr3; url]
    elseif url[1:13]=="https://www2." || url[1:13]=="https://WWW2." || url[1:13]=="https://www3."|| url[1:13]=="https://WWW3." || url[1:13]=="https://www4."|| url[1:13]=="https://WWW4."
        arr3=[arr3; url[14:length(url)]]
    elseif url[1:12]=="https://www." || url[1:12]=="https://WWW."
        arr3=[arr3; url[13:length(url)]]
    elseif url[1:8]=="https://"
        arr3=[arr3; url[9:length(url)]]
    else
        arr3=[arr3; url]
    end
end
# Next go back and clean the urls's containing forward-slash in url
df_col=DataFrame(WEBSITE_CLEAN = arr3)
df4 =  hcat(df2, df_col)
write_html_table("index.html", (df4); title="Schools", caption="Schools USA over 5K Enrollment")

finally need to add another routine to re-process the output array above to find any url with “/” in it and trim that to finally have a clean set of domain names. Could force all to lowercase, work all within loop vs global scope, and some other things, but these hacky methods move things along nicely using Julia, esp compared to old ways (text manipulation in Excel ).

Topic		Replies	Views
[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems v1.0 Package Announcements package , announcement , dataframes , tables	1	682	February 6, 2022
[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems Package Announcements package , announcement	12	2332	October 29, 2021
Clean up DataFrame code for Julia benchmarks General Usage	0	776	October 12, 2017
Extra several columns from a dataframe to create a new dataframe in julia General Usage	5	593	April 12, 2019
Better way to clean and normalize a DataFrame of strings? Data strings , dataframes	4	1294	April 12, 2021

[ANN] DataConvenience.jl - convenience functions I find useful

Related topics