Creating dataframe with arrays of different length

Hi,

I have 3 arrays with which I´d like to make a dataframe where each column corresponds to an array ( column 1 = abc, column 2 = def, column 3 = ghi) and the column names to the names of the arrays. I tried hcat but I get an error because my arrays don`t have the same length.

abc = [6,7,8,9,10]
def = [“a”,“b”,“c”]
ghi = [87]

Does anyone have any ideas ?

Thank you,

1 Like

The DataFrames.jl documentation is really good, please read it when you find some time.

This should work fine:

DataFrame(abc=abc, def=def, ghi=ghi)

That won’t work, as OP said this will raise an error:

julia> DataFrame(abc = abc, def = def)
ERROR: DimensionMismatch("column :abc has length 5 and column :def has length 3")

This is just how DataFrames works - columns have to be the same length. You need to ask yourself:

  1. Is a DataFrame the right structure for my data? Why do I want to use a DataFrame if the columns aren’t the same length?

  2. If the answer to (1) is yes I really want a DataFrame, what should be stored in shorter columns in those rows for which information in the longer columns is available? To make it concrete, your table looks like this:

abc def ghi
6 “a” 87
7 “b” ?
8 “c” ?
9 ? ?

What’s in those question marks? If there’s no natural correspondence between the value of abc and ghi in a given row, then maybe a DataFrame just isn’t the right storage format for your data.

1 Like

Thank you @nilshg I did read the first part of his question and skipped the details. Didn’t notice the columns of different size.

@jnewbie what would you expect as the result when columns have different size?

1 Like

The ? could be replaced “missing”. I need to make that DataFrame so it can be exported as a CSV file for a purpose outside of Julia.

abc def ghi
6 “a” 87
7 “b” mssing
8 “c” missing
9 missing missing

I would like to get this if it is possible.

You could do:

julia> DataFrame(abc = abc,
                 def = [def; fill(missing, df_length-length(def))],
                 ghi = [ghi; fill(missing, df_length-length(ghi))])
5×3 DataFrame
 Row │ abc    def      ghi     
     │ Int64  String?  Int64?  
─────┼─────────────────────────
   1 │     6  a             87
   2 │     7  b        missing 
   3 │     8  c        missing 
   4 │     9  missing  missing 
   5 │    10  missing  missing 
1 Like

With

df_length=maximum(length.([abc,def,ghi]))

for example :wink:

DataFrame(abc = [abc; fill(missing, df_length-length(abc))],
                        def = [def; fill(missing, df_length-length(def))],
                        ghi = [ghi; fill(missing, df_length-length(ghi))])

is a bit more generic, if abc isn’t the longest one.
But it’s a great one-liner!
I don’t show you mine :frowning:

it works, thanks !

Thanks !

Another option using LazyStack.jl:

DataFrame(rstack(abc, def, ghi, fill=missing), [:abc, :def, :ghi])
4 Likes

Another option using PaddedViews.jl:

n = maximum(length.((abc, def, ghi)))
abc, def, ghi = PaddedView.(missing, (abc,def,ghi), Ref((n,)), Ref((1,)))
df = DataFrame(abc=abc, def=def, ghi=ghi)
1 Like