How to do a reduceByKey in Julia


#1

I have an Array{String,1} and I need to get a count of each String in the Array. In Spark this is done with reduceByKey. What’s the Julia equivalent?

I’ve seen approaches using DataFrames, but I can’t convert the Array{String,1} into a DataFrame if that is the right approach.

So really two questions: How do I do reduceByKey in Julia and can I convert Array{String,1} to a DataFrame?


#2

The StatsBase package has what you are probably looking for.

using StatsBase

a=rand('a':'f',50)

countmap(a)
Dict{Char,Int64} with 6 entries:
  'f' => 11
  'd' => 8
  'c' => 9
  'e' => 10
  'a' => 6
  'b' => 6

#3

That worked, thanks.

How would I plot a dictionary with keys on the x-axis and values on the y-axis?


#4

Using Plots.jl:

using StatsBase,       # for the countmap function
      Plots            # to plot
gr()                   # setup plotting backend, GR is default.  
# There is also PyPlot, PlotlyJS, and PGFPlots.

a=rand('a':'f',50)     # create a random distribution of letters a through f to serve as the dataset

dc = countmap(a)       # convert dataset to a Dict

bar(
  string.(collect(keys(dc))),   # the x-axis - the keys of the dictionary collected into an Array converted to Strings
  collect(values(dc)), # the y-axis - frequency.
  xlabel = "Letter",   # the label on the x-axis
  ylabel = "Frequency",# the label on the y-axis
  title = "Frequency per letter"
) 

#5

I discussed various ways to do it here:

How to count all unique character frequency in a string?

countmap is an excellent choice, if you don’t mind the dependency on StatsBase.


#6

I like Gadfly for plotting. It’s little easier with DataFrames, but arrays work as well.

using Gadfly
using StatsBase
a=rand(‘a’:‘f’,100)
m=countmap(a)
d=hcat(collect(keys(m)),collect(values(m))) # create a 2x100 array with the results of countmap
d=sortslices(d,dims=1,by=x->x[1]) # sort the whole array by character in column 1
plot(d,x=convert.(Int,d[:,1]),y=d[:,2],Scale.x_discrete(labels=x->Char(x)),Geom.bar(position=:dodge), Guide.xlabel(“Character”),Guide.ylabel(“Count”),Theme(bar_spacing=1cm))


#7

Here is the Queryverse way of doing it:

rand('a':'f',100) |> 
  @groupby(_) |>
  @map({char=key(_), count=length(_)}) |>
  @vlplot(:bar, x="count", y="char:n")

It is not entirely clear though whether this starts with the kind of data structure you were asking for? Here we start with a Vector{Char}, not a Vector{String}, which you were asking about?