Build dataframe from Dict

hasanOryx · September 8, 2019, 11:52am

UPDATE
You can jump here to make the long story short

Question start point
I’m reading set of pdf files, and based on the content I add tag(s) in a dic that have the file name and tags, so some of the names will be zero tags, some will be many tags, and I want to convert the generated dict into dataframe.

dict = Dict{String, Array{String}}()   #(name, tags)
println(dict)

Dict{String,Array{String,N} where N}("عصام محمد فايز الشهري - Mechanical engineering.pdf" => [],"إبراهيم وليد أحمد العريفج - Mechanical engineering.pdf" => ["Erp"],"محمد علي محمد المحمد علي - Mechanical engineering.pdf" => [],"Nawaf Al Yousef.pdf" => [],"محمد ناصر حسني الحارثي - Mechanical engineering.pdf" => [],"مهدي عبدالله مهدي آل شهاب - Mechanical engineering.pdf" => [],"يوسف إبراهيم محمد ال قريشه - Supply chain management.pdf" => ["Erp", "Supply Chain"],"Noufal Al Zaher.pdf" => [],"Ziyad Al Essa.pdf" => ["Erp", "Shipping Documents", "Supply Chain"],"Majed Al Zahrani - General Secondary.pdf" => [])

And trying to convert it to DataFrame as:

using DataFrames
df = DataFrame(dict)

But I got this error:

DimensionMismatch("column length 0 for column(s) Majed Al Zahrani - General Secondary.pdf, Nawaf Al Yousef.pdf, Noufal Al Zaher.pdf, عصام محمد فايز الشهري - Mechanical engineering.pdf, محمد علي محمد المحمد علي - Mechanical engineering.pdf, محمد ناصر حسني الحارثي - Mechanical engineering.pdf and مهدي عبدالله مهدي آل شهاب - Mechanical engineering.pdf is incompatible with column length 3 for column(s) Ziyad Al Essa.pdf is incompatible with column length 1 for column(s) إبراهيم وليد أحمد العريفج - Mechanical engineering.pdf, and is incompatible with column length 2 for column(s) يوسف إبراهيم محمد ال قريشه - Supply chain management.pdf")

Stacktrace:
 [1] (::getfield(DataFrames, Symbol("##DataFrame#91#94")))(::Bool, ::Type{DataFrame}, ::Array{Any,1}, ::DataFrames.Index) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:121
 [2] Type at .\array.jl:0 [inlined]
 [3] #DataFrame#101(::Bool, ::Type{DataFrame}, ::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:155
 [4] DataFrame(::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:147
 [5] top-level scope at In[34]:1

Is the way I used to convert Dict to DataFrame correct?
Is there a way to iterate over the Dict so I remove the tuples where the column is empty?

UPDATE

I tried the below:

shortlisted = filter((k, v) -> v > [], dict)

But got the below:

┌ Warning: In `filter(f, dict)`, `f` is now passed a single pair instead of two arguments.
│   caller = top-level scope at In[65]:1
└ @ Core In[65]:1
Dict{String,Array{String,N} where N} with 7 entries:
  "إبراهيم وليد أحمد العري… => ["Bachelor", "English", "Erp"]
  "محمد علي محمد المحمد عل… => ["Bachelor", "English", "Follow Up"]
  "محمد ناصر حسني الحارثي … => ["Ba", "Bas", "Excel"]
  "Amal Al-Wabel CV.PDF"    => ["Bachelor", "Chemicals Permits", "Customs", "En…
  "مهدي عبدالله مهدي آل شه… => ["Ba", "Bsc", "Follow Ups"]
  "يوسف إبراهيم محمد ال قر… => ["Ba", "Bachelor", "Erp", "Supply Chain"]
  "Ziyad Al Essa.pdf"       => ["Bachelor", "Bas", "Erp", "Shipping Documents",…

Then I used:

df = DataFrame(shortlisted)

And got:

DimensionMismatch("column length 13 for column(s) Amal Al-Wabel CV.PDF is incompatible with column length 6 for column(s) Ziyad Al Essa.pdf is incompatible with column length 3 for column(s) إبراهيم وليد أحمد العريفج - Mechanical engineering.pdf, محمد علي محمد المحمد علي - Mechanical engineering.pdf, محمد ناصر حسني الحارثي - Mechanical engineering.pdf and مهدي عبدالله مهدي آل شهاب - Mechanical engineering.pdf, and is incompatible with column length 4 for column(s) يوسف إبراهيم محمد ال قريشه - Supply chain management.pdf")

Stacktrace:
 [1] (::getfield(DataFrames, Symbol("##DataFrame#91#94")))(::Bool, ::Type{DataFrame}, ::Array{Any,1}, ::DataFrames.Index) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:121
 [2] Type at .\array.jl:0 [inlined]
 [3] #DataFrame#101(::Bool, ::Type{DataFrame}, ::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:155
 [4] DataFrame(::Dict{String,Array{String,N} where N}) at C:\Users\hasan.DESKTOP-HU2FQ29\.julia\packages\DataFrames\XuYBH\src\dataframe\dataframe.jl:147
 [5] top-level scope at In[66]:1

kevbonham · September 8, 2019, 1:50pm

DataFrames columns most have the same length. Is there a reason you want this in a DataFrame, rather than in the dict? One solution might be too have a column for the tags, and then for each PDF, a column with Boolean values (true/false) to indicate whether they have that tag.

hasanOryx · September 8, 2019, 1:53pm

Thanks I’ll give it a try,
Any comment about the second point Is there a way to iterate over the Dict so I remove the tuples where the column is empty?

kristoffer.carlsson · September 8, 2019, 2:07pm

julia> for (key, value) in filter(p->!isempty(p.second), dict)
           @show key => value
       end
key => value = "إبراهيم وليد أحمد العريفج - Mechanical engineering.pdf" => ["Erp"]
key => value = "يوسف إبراهيم محمد ال قريشه - Supply chain management.pdf" => ["Erp", "Supply Chain"]
key => value = "Ziyad Al Essa.pdf" => ["Erp", "Shipping Documents", "Supply Chain"]

hasanOryx · September 8, 2019, 4:01pm

[quote=“hasanOryx, post:3, topic:28538”] @kevbonham
I’ll give it a try
[/quote]

The DataFrame appeared in a weird way:

To simplify the data, I re-wrote the example as below:

fruits = ["apple", "orange", "banana"]
fruits_matrix = Dict(fruits .=> false)
purchase_fruits = ["apple", "banana"]
available_fruits = Dict{String, Dict{String, Bool}}()
push!(available_fruits, "home" => fruits_matrix)
for (key, value) in available_fruits["home"]
   for entry in purchase_fruits
         if key == entry
            available_fruits["home"][key] = true
        end  
    end
end 
@show available_fruits
#=
available_fruits = Dict("home" => 
               Dict("orange" => 0,"banana" => 1,"apple" => 1))
Dict{String,Dict{String,Bool}} with 1 entry:
  "home" => Dict("orange"=>0,"banana"=>1,"apple"=>1)
=#
using DataFrames
DataFrame(available_fruits)

The result appeared as:

While I’m expecting something like:

kevbonham · September 8, 2019, 6:26pm

There’s no DataFrame constructor for a dictionary of dictionaries. Furthermore, you seem to want an index for rows, but DataFrames doesn’t have this at the moment - no column is special. The DataFrame constructor just takes a Dict, eg

julia> d = Dict("apple"=>1, "banana"=>0)
Dict{String,Int64} with 2 entries:
  "banana" => 0
  "apple"  => 1

julia> DataFrame(d)
1×2 DataFrame
│ Row │ apple │ banana │
│     │ Int64 │ Int64  │
├─────┼───────┼────────┤
│ 1   │ 1     │ 0      │

Or a Dict of Arrays, eg:

julia> d2 = Dict("apple"=>rand(1:10, 5), "banana"=>rand(1:10, 5))
Dict{String,Array{Int64,1}} with 2 entries:
  "banana" => [4, 5, 9, 8, 10]
  "apple"  => [8, 8, 8, 8, 5]

julia> DataFrame(d2)
5×2 DataFrame
│ Row │ apple │ banana │
│     │ Int64 │ Int64  │
├─────┼───────┼────────┤
│ 1   │ 8     │ 4      │
│ 2   │ 8     │ 5      │
│ 3   │ 8     │ 9      │
│ 4   │ 8     │ 8      │
│ 5   │ 5     │ 10     │

EDIT: Consider also -

julia> f1 = Dict(:index=>"home", :banana=>true, :apple=>false, :orange=>false)
Dict{Symbol,Any} with 4 entries:
  :index  => "home"
  :banana => true
  :apple  => false
  :orange => false

julia> f2 = Dict(:index=>"bag", :banana=>true, :apple=>false, :orange=>true)
Dict{Symbol,Any} with 4 entries:
  :index  => "bag"
  :banana => true
  :apple  => false
  :orange => true

julia> df = DataFrame(f1)
1×4 DataFrame
│ Row │ apple │ banana │ index  │ orange │
│     │ Bool  │ Bool   │ String │ Bool   │
├─────┼───────┼────────┼────────┼────────┤
│ 1   │ 0     │ 1      │ home   │ 0      │

julia> push!(df, f2)
2×4 DataFrame
│ Row │ apple │ banana │ index  │ orange │
│     │ Bool  │ Bool   │ String │ Bool   │
├─────┼───────┼────────┼────────┼────────┤
│ 1   │ 0     │ 1      │ home   │ 0      │
│ 2   │ 0     │ 1      │ bag    │ 1      │

Topic		Replies	Views
Dict from dataframe General Usage dictionary , dataframes	5	1309	July 15, 2022
Best practice for the conversion from a vector of dictionaries to a dataframe General Usage dictionary , dataframes	1	1438	September 16, 2021
[DataFrames Question]: How to convert single column with row of dictionary to multiple columns Specific Domains question , dataframes	4	523	May 14, 2022
Applying Dict to DataFrame column General Usage	2	1488	October 12, 2017
Dash datatable using a nested dict anyone done it? New to Julia dataframes , dash	7	1298	June 15, 2022

Build dataframe from Dict

Related topics