Build dataframe from Dict

UPDATE
You can jump here to make the long story short :slight_smile:

Question start point
Iโ€™m reading set of pdf files, and based on the content I add tag(s) in a dic that have the file name and tags, so some of the names will be zero tags, some will be many tags, and I want to convert the generated dict into dataframe.

dict = Dict{String, Array{String}}()   #(name, tags)
println(dict)

Dict{String,Array{String,N} where N}("ุนุตุงู… ู…ุญู…ุฏ ูุงูŠุฒ ุงู„ุดู‡ุฑูŠ - Mechanical engineering.pdf" => [],"ุฅุจุฑุงู‡ูŠู… ูˆู„ูŠุฏ ุฃุญู…ุฏ ุงู„ุนุฑูŠูุฌ - Mechanical engineering.pdf" => ["Erp"],"ู…ุญู…ุฏ ุนู„ูŠ ู…ุญู…ุฏ ุงู„ู…ุญู…ุฏ ุนู„ูŠ - Mechanical engineering.pdf" => [],"Nawaf Al Yousef.pdf" => [],"ู…ุญู…ุฏ ู†ุงุตุฑ ุญุณู†ูŠ ุงู„ุญุงุฑุซูŠ - Mechanical engineering.pdf" => [],"ู…ู‡ุฏูŠ ุนุจุฏุงู„ู„ู‡ ู…ู‡ุฏูŠ ุขู„ ุดู‡ุงุจ - Mechanical engineering.pdf" => [],"ูŠูˆุณู ุฅุจุฑุงู‡ูŠู… ู…ุญู…ุฏ ุงู„ ู‚ุฑูŠุดู‡ - Supply chain management.pdf" => ["Erp", "Supply Chain"],"Noufal Al Zaher.pdf" => [],"Ziyad Al Essa.pdf" => ["Erp", "Shipping Documents", "Supply Chain"],"Majed Al Zahrani - General Secondary.pdf" => [])

And trying to convert it to DataFrame as:

using DataFrames
df = DataFrame(dict)

But I got this error:

DimensionMismatch("column length 0 for column(s) Majed Al Zahrani - General Secondary.pdf, Nawaf Al Yousef.pdf, Noufal Al Zaher.pdf, ุนุตุงู… ู…ุญู…ุฏ ูุงูŠุฒ ุงู„ุดู‡ุฑูŠ - Mechanical engineering.pdf, ู…ุญู…ุฏ ุนู„ูŠ ู…ุญู…ุฏ ุงู„ู…ุญู…ุฏ ุนู„ูŠ - Mechanical engineering.pdf, ู…ุญู…ุฏ ู†ุงุตุฑ ุญุณู†ูŠ ุงู„ุญุงุฑุซูŠ - Mechanical engineering.pdf and ู…ู‡ุฏูŠ ุนุจุฏุงู„ู„ู‡ ู…ู‡ุฏูŠ ุขู„ ุดู‡ุงุจ - Mechanical engineering.pdf is incompatible with column length 3 for column(s) Ziyad Al Essa.pdf is incompatible with column length 1 for column(s) ุฅุจุฑุงู‡ูŠู… ูˆู„ูŠุฏ ุฃุญู…ุฏ ุงู„ุนุฑูŠูุฌ - Mechanical engineering.pdf, and is incompatible with column length 2 for column(s) ูŠูˆุณู ุฅุจุฑุงู‡ูŠู… ู…ุญู…ุฏ ุงู„ ู‚ุฑูŠุดู‡ - Supply chain management.pdf")

Stacktrace:
 [1] (::getfield(DataFrames, Symbol("##DataFrame#91#94")))(::Bool, ::Type{DataFrame}, ::Array{Any,1}, ::DataFrames.Index) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:121
 [2] Type at .\array.jl:0 [inlined]
 [3] #DataFrame#101(::Bool, ::Type{DataFrame}, ::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:155
 [4] DataFrame(::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:147
 [5] top-level scope at In[34]:1
  • Is the way I used to convert Dict to DataFrame correct?
  • Is there a way to iterate over the Dict so I remove the tuples where the column is empty?

UPDATE

I tried the below:

shortlisted = filter((k, v) -> v > [], dict)

But got the below:

โ”Œ Warning: In `filter(f, dict)`, `f` is now passed a single pair instead of two arguments.
โ”‚   caller = top-level scope at In[65]:1
โ”” @ Core In[65]:1
Dict{String,Array{String,N} where N} with 7 entries:
  "ุฅุจุฑุงู‡ูŠู… ูˆู„ูŠุฏ ุฃุญู…ุฏ ุงู„ุนุฑูŠโ€ฆ => ["Bachelor", "English", "Erp"]
  "ู…ุญู…ุฏ ุนู„ูŠ ู…ุญู…ุฏ ุงู„ู…ุญู…ุฏ ุนู„โ€ฆ => ["Bachelor", "English", "Follow Up"]
  "ู…ุญู…ุฏ ู†ุงุตุฑ ุญุณู†ูŠ ุงู„ุญุงุฑุซูŠ โ€ฆ => ["Ba", "Bas", "Excel"]
  "Amal Al-Wabel CV.PDF"    => ["Bachelor", "Chemicals Permits", "Customs", "Enโ€ฆ
  "ู…ู‡ุฏูŠ ุนุจุฏุงู„ู„ู‡ ู…ู‡ุฏูŠ ุขู„ ุดู‡โ€ฆ => ["Ba", "Bsc", "Follow Ups"]
  "ูŠูˆุณู ุฅุจุฑุงู‡ูŠู… ู…ุญู…ุฏ ุงู„ ู‚ุฑโ€ฆ => ["Ba", "Bachelor", "Erp", "Supply Chain"]
  "Ziyad Al Essa.pdf"       => ["Bachelor", "Bas", "Erp", "Shipping Documents",โ€ฆ

Then I used:

df = DataFrame(shortlisted)

And got:

DimensionMismatch("column length 13 for column(s) Amal Al-Wabel CV.PDF is incompatible with column length 6 for column(s) Ziyad Al Essa.pdf is incompatible with column length 3 for column(s) ุฅุจุฑุงู‡ูŠู… ูˆู„ูŠุฏ ุฃุญู…ุฏ ุงู„ุนุฑูŠูุฌ - Mechanical engineering.pdf, ู…ุญู…ุฏ ุนู„ูŠ ู…ุญู…ุฏ ุงู„ู…ุญู…ุฏ ุนู„ูŠ - Mechanical engineering.pdf, ู…ุญู…ุฏ ู†ุงุตุฑ ุญุณู†ูŠ ุงู„ุญุงุฑุซูŠ - Mechanical engineering.pdf and ู…ู‡ุฏูŠ ุนุจุฏุงู„ู„ู‡ ู…ู‡ุฏูŠ ุขู„ ุดู‡ุงุจ - Mechanical engineering.pdf, and is incompatible with column length 4 for column(s) ูŠูˆุณู ุฅุจุฑุงู‡ูŠู… ู…ุญู…ุฏ ุงู„ ู‚ุฑูŠุดู‡ - Supply chain management.pdf")

Stacktrace:
 [1] (::getfield(DataFrames, Symbol("##DataFrame#91#94")))(::Bool, ::Type{DataFrame}, ::Array{Any,1}, ::DataFrames.Index) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:121
 [2] Type at .\array.jl:0 [inlined]
 [3] #DataFrame#101(::Bool, ::Type{DataFrame}, ::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:155
 [4] DataFrame(::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:147
 [5] top-level scope at In[66]:1

DataFrames columns most have the same length. Is there a reason you want this in a DataFrame, rather than in the dict? One solution might be too have a column for the tags, and then for each PDF, a column with Boolean values (true/false) to indicate whether they have that tag.

Thanks Iโ€™ll give it a try,
Any comment about the second point Is there a way to iterate over the Dict so I remove the tuples where the column is empty?

julia> for (key, value) in filter(p->!isempty(p.second), dict)
           @show key => value
       end
key => value = "ุฅุจุฑุงู‡ูŠู… ูˆู„ูŠุฏ ุฃุญู…ุฏ ุงู„ุนุฑูŠูุฌ - Mechanical engineering.pdf" => ["Erp"]
key => value = "ูŠูˆุณู ุฅุจุฑุงู‡ูŠู… ู…ุญู…ุฏ ุงู„ ู‚ุฑูŠุดู‡ - Supply chain management.pdf" => ["Erp", "Supply Chain"]
key => value = "Ziyad Al Essa.pdf" => ["Erp", "Shipping Documents", "Supply Chain"]
1 Like

[quote=โ€œhasanOryx, post:3, topic:28538โ€] @kevbonham
Iโ€™ll give it a try
[/quote]

The DataFrame appeared in a weird way:

To simplify the data, I re-wrote the example as below:

fruits = ["apple", "orange", "banana"]
fruits_matrix = Dict(fruits .=> false)
purchase_fruits = ["apple", "banana"]
available_fruits = Dict{String, Dict{String, Bool}}()
push!(available_fruits, "home" => fruits_matrix)
for (key, value) in available_fruits["home"]
   for entry in purchase_fruits
         if key == entry
            available_fruits["home"][key] = true
        end  
    end
end 
@show available_fruits
#=
available_fruits = Dict("home" => 
               Dict("orange" => 0,"banana" => 1,"apple" => 1))
Dict{String,Dict{String,Bool}} with 1 entry:
  "home" => Dict("orange"=>0,"banana"=>1,"apple"=>1)
=#
using DataFrames
DataFrame(available_fruits)

The result appeared as:

image

While Iโ€™m expecting something like:

image

Thereโ€™s no DataFrame constructor for a dictionary of dictionaries. Furthermore, you seem to want an index for rows, but DataFrames doesnโ€™t have this at the moment - no column is special. The DataFrame constructor just takes a Dict, eg

julia> d = Dict("apple"=>1, "banana"=>0)
Dict{String,Int64} with 2 entries:
  "banana" => 0
  "apple"  => 1

julia> DataFrame(d)
1ร—2 DataFrame
โ”‚ Row โ”‚ apple โ”‚ banana โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 1     โ”‚ 0      โ”‚

Or a Dict of Arrays, eg:

julia> d2 = Dict("apple"=>rand(1:10, 5), "banana"=>rand(1:10, 5))
Dict{String,Array{Int64,1}} with 2 entries:
  "banana" => [4, 5, 9, 8, 10]
  "apple"  => [8, 8, 8, 8, 5]

julia> DataFrame(d2)
5ร—2 DataFrame
โ”‚ Row โ”‚ apple โ”‚ banana โ”‚
โ”‚     โ”‚ Int64 โ”‚ Int64  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 8     โ”‚ 4      โ”‚
โ”‚ 2   โ”‚ 8     โ”‚ 5      โ”‚
โ”‚ 3   โ”‚ 8     โ”‚ 9      โ”‚
โ”‚ 4   โ”‚ 8     โ”‚ 8      โ”‚
โ”‚ 5   โ”‚ 5     โ”‚ 10     โ”‚

EDIT: Consider also -

julia> f1 = Dict(:index=>"home", :banana=>true, :apple=>false, :orange=>false)
Dict{Symbol,Any} with 4 entries:
  :index  => "home"
  :banana => true
  :apple  => false
  :orange => false

julia> f2 = Dict(:index=>"bag", :banana=>true, :apple=>false, :orange=>true)
Dict{Symbol,Any} with 4 entries:
  :index  => "bag"
  :banana => true
  :apple  => false
  :orange => true

julia> df = DataFrame(f1)
1ร—4 DataFrame
โ”‚ Row โ”‚ apple โ”‚ banana โ”‚ index  โ”‚ orange โ”‚
โ”‚     โ”‚ Bool  โ”‚ Bool   โ”‚ String โ”‚ Bool   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 0     โ”‚ 1      โ”‚ home   โ”‚ 0      โ”‚

julia> push!(df, f2)
2ร—4 DataFrame
โ”‚ Row โ”‚ apple โ”‚ banana โ”‚ index  โ”‚ orange โ”‚
โ”‚     โ”‚ Bool  โ”‚ Bool   โ”‚ String โ”‚ Bool   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ 0     โ”‚ 1      โ”‚ home   โ”‚ 0      โ”‚
โ”‚ 2   โ”‚ 0     โ”‚ 1      โ”‚ bag    โ”‚ 1      โ”‚