How to do a reduceByKey in Julia

DWSchulze · February 1, 2019, 10:32pm

I have an Array{String,1} and I need to get a count of each String in the Array. In Spark this is done with reduceByKey. What’s the Julia equivalent?

I’ve seen approaches using DataFrames, but I can’t convert the Array{String,1} into a DataFrame if that is the right approach.

So really two questions: How do I do reduceByKey in Julia and can I convert Array{String,1} to a DataFrame?

Daniel_Berge · February 1, 2019, 10:44pm

The StatsBase package has what you are probably looking for.

using StatsBase

a=rand('a':'f',50)

countmap(a)
Dict{Char,Int64} with 6 entries:
  'f' => 11
  'd' => 8
  'c' => 9
  'e' => 10
  'a' => 6
  'b' => 6

DWSchulze · February 2, 2019, 12:02am

That worked, thanks.

How would I plot a dictionary with keys on the x-axis and values on the y-axis?

asinghvi17 · February 2, 2019, 2:38am

Using Plots.jl:

using StatsBase,       # for the countmap function
      Plots            # to plot
gr()                   # setup plotting backend, GR is default.  
# There is also PyPlot, PlotlyJS, and PGFPlots.

a=rand('a':'f',50)     # create a random distribution of letters a through f to serve as the dataset

dc = countmap(a)       # convert dataset to a Dict

bar(
  string.(collect(keys(dc))),   # the x-axis - the keys of the dictionary collected into an Array converted to Strings
  collect(values(dc)), # the y-axis - frequency.
  xlabel = "Letter",   # the label on the x-axis
  ylabel = "Frequency",# the label on the y-axis
  title = "Frequency per letter"
)

bennedich · February 2, 2019, 7:42am

I discussed various ways to do it here:

How to count all unique character frequency in a string?

countmap is an excellent choice, if you don’t mind the dependency on StatsBase.

a5vzener · February 4, 2019, 6:11pm

I like Gadfly for plotting. It’s little easier with DataFrames, but arrays work as well.

using Gadfly
using StatsBase
a=rand(‘a’:‘f’,100)
m=countmap(a)
d=hcat(collect(keys(m)),collect(values(m))) # create a 2x100 array with the results of countmap
d=sortslices(d,dims=1,by=x->x[1]) # sort the whole array by character in column 1
plot(d,x=convert.(Int,d[:,1]),y=d[:,2],Scale.x_discrete(labels=x->Char(x)),Geom.bar(position=:dodge), Guide.xlabel(“Character”),Guide.ylabel(“Count”),Theme(bar_spacing=1cm))

davidanthoff · February 6, 2019, 1:40am

Here is the Queryverse way of doing it:

rand('a':'f',100) |> 
  @groupby(_) |>
  @map({char=key(_), count=length(_)}) |>
  @vlplot(:bar, x="count", y="char:n")

It is not entirely clear though whether this starts with the kind of data structure you were asking for? Here we start with a Vector{Char}, not a Vector{String}, which you were asking about?

Topic		Replies	Views
I have an array of 31 arrays and would like to make it a DataFrame. Need some help General Usage	4	269	April 6, 2020
Sort vector by frequency New to Julia sort , arrays	8	2868	July 7, 2022
Counting Occurrences in JuliaDB New to Julia juliadb	3	1994	March 11, 2019
Counting number of occurences in an array Tooling question , statistics , arrays , splitapplycombine	10	15307	December 18, 2019
Frequency counts on a square lattice Performance	6	533	September 13, 2020

How to do a reduceByKey in Julia

Related topics