Help needed with DataFrames and Query

data

#1

I am learning how to use DataFrames and find am struggling to use it for what I need it for.

My present problem is illustrated by this attempt to use Quey.jl. I cannot understand why there result of q1 has no values.

And I cannot understand why I cannot use @collect DataFrame in this case.

julia> typeof(bs)
DataFrames.DataFrame

julia> size(bs)
(33067, 7)

julia> bs[1:2,:]
2×7 DataFrames.DataFrame
│ Row │ broadfield            │ field                  │ title            │ pubtype   │ pubyear │ doctype   │ publications │
├─────┼───────────────────────┼────────────────────────┼──────────────────┼───────────┼─────────┼───────────┼──────────────┤
│ 1   │ "Biological Sciences" │ "Anatomy & Morphology" │ "ACTA ZOOLOGICA" │ "Journal" │ 2006    │ "Article" │ 1            │
│ 2   │ "Biological Sciences" │ "Anatomy & Morphology" │ "ACTA ZOOLOGICA" │ "Journal" │ 2007    │ "Article" │ 2            │

julia> names(bs)
7-element Array{Symbol,1}:
 :broadfield  
 :field       
 :title       
 :pubtype     
 :pubyear     
 :doctype     
 :publications

julia> bfs = Set(bs[:broadfield])
Set(Nullable{String}["Mathematics", "Statistics", "Geological Sciences", "Biological Sciences", "Physics", "Computer Science", "Chemistry"])


julia> describe(bs)
broadfield
Summary Stats:
Length:         33067
Type:           Nullable{String}
Number Unique:  7
Number Missing: 0
% Missing:      0.000000

field
Summary Stats:
Length:         33067
Type:           Nullable{WeakRefString{UInt16}}
Number Unique:  55
Number Missing: 0
% Missing:      0.000000

title
Summary Stats:
Length:         33067
Type:           Nullable{WeakRefString{UInt16}}
Number Unique:  5229
Number Missing: 0
% Missing:      0.000000

pubtype
Summary Stats:
Length:         33067
Type:           Nullable{WeakRefString{UInt16}}
Number Unique:  5
Number Missing: 0
% Missing:      0.000000

pubyear
Summary Stats:
Mean:           2010.479421
Minimum:        2005.000000
1st Quartile:   2008.000000
Median:         2011.000000
3rd Quartile:   2013.000000
Maximum:        2015.000000
Length:         33067
Type:           Int16
Number Missing: 0
% Missing:      0.000000

doctype
Summary Stats:
Length:         33067
Type:           Nullable{WeakRefString{UInt16}}
Number Unique:  15
Number Missing: 0
% Missing:      0.000000

publications
Summary Stats:
Mean:           2.720930
Minimum:        1.000000
1st Quartile:   1.000000
Median:         1.000000
3rd Quartile:   3.000000
Maximum:        210.000000
Length:         33067
Type:           Int64
Number Missing: 0
% Missing:      0.000000


julia> q1 = @from i in bs begin
           @where i.broadfield == "Chemistry" 
           @select  i.field
           @collect
       end

0-element Array{Nullable{WeakRefString{UInt16}},1}

julia> q1 = @from i in bs begin
           @where i.broadfield == "Chemistry" 
           @select  i.field
           @collect DataFrame
       end
ERROR: MethodError: Cannot `convert` an object of type Query.EnumerableSelect{Nullable{WeakRefString{UInt16}},Query.EnumerableWhere{NamedTuples._NT_broadfield_field_title_pubtype_pubyear_doctype_publications{Nullable{String},Nullable{WeakRefString{UInt16}},Nullable{WeakRefString{UInt16}},Nullable{WeakRefString{UInt16}},Nullable{Int16},Nullable{WeakRefString{UInt16}},Nullable{Int64}},Query.EnumerableIterable{NamedTuples._NT_broadfield_field_title_pubtype_pubyear_doctype_publications{Nullable{String},Nullable{WeakRefString{UInt16}},Nullable{WeakRefString{UInt16}},Nullable{WeakRefString{UInt16}},Nullable{Int16},Nullable{WeakRefString{UInt16}},Nullable{Int64}},IterableTables.DataFrameIterator{NamedTuples._NT_broadfield_field_title_pubtype_pubyear_doctype_publications{Nullable{String},Nullable{WeakRefString{UInt16}},Nullable{WeakRefString{UInt16}},Nullable{WeakRefString{UInt16}},Nullable{Int16},Nullable{WeakRefString{UInt16}},Nullable{Int64}},Tuple{NullableArrays.NullableArray{String,1},NullableArrays.NullableArray{WeakRefString{UInt16},1},NullableArrays.NullableArray{WeakRefString{UInt16},1},NullableArrays.NullableArray{WeakRefString{UInt16},1},NullableArrays.NullableArray{Int16,1},NullableArrays.NullableArray{WeakRefString{UInt16},1},NullableArrays.NullableArray{Int64,1}}}},##10#12},##11#13} to an object of type DataFrames.DataFrame
This may have arisen from a call to the constructor DataFrames.DataFrame(...),
since type constructors fall back to convert methods.

Regards
Johann


Lack of stable PostgreSQL-communication a major showstopper
#2

When you do a @select in Query and then collect into a DataFrame, you need to provide an expression that constructs a named tuple so that Query knows the column names. In Query there’s a shorthand for doing this that uses curly braces {}. Does replacing your select statement with

  @select {i.field}

do what you want?

Also, when posting here you can quote your code by putting triple backticks around it
```
like this
```
to make it easier to read.


#3
julia> q1 = @from i in bs begin
@where i.broadfield == “Chemistry”
@select i.field
@collect
end

0-element Array{Nullable{WeakRefString{UInt16}},1}

This shouldn’t be empty. I’m afraid your DataFrame has Nullable strings rather than normal strings so that the comparison never holds. I’m not sure how Nullables ended up there, maybe some CSV reader that hasn’t updated yet. If you loaded the data using CSV.jl, could you try just loading it with readtable and see if that works?


#4

If you use CSVFiles.jl to read your CSV files into a DataFrame it should always create the right column types (normal array if there are no missing values in the column, otherwise a DataArray).


#5

Thanks @swt30, @piever and @davidanthoff for your replies.

I am a relative newcomer to Julia and have yet to get to grips with it’s typing system.

The DataFrame was created using ODBC to query a PostgreSQL-database. (Unfortunately there is not yet a good PostgreSQL-specific interface to do that in Julia).

My next question would then be: How is it possible to convert the type of Nullable strings in a DataFrame to normal strings?

Regards
Johann


#6

As yes, PostgreSQL support in Julia is pretty poor at the moment. I have forks of PostgreSQL.jl and DBI.jl that just merge some pending pull requests to get them working on julia 0.6 if you were interested. But I won’t blame you for wanting to stick with something a little more “official”!

In the meantime, the best way I can come up with to manually convert columns from DataArray{Nullable{something}} to DataArray{something} is this:

using DataFrames

"Unwrap any Nullables inside a DataArray or DataFrame, turning nulls into NAs"
function unwrap(x::DataArray{Nullable{T}})::DataArray{T} where {T <: Any}
    get.(x, NA)
end
unwrap(x::DataArray) = x
unwrap(df::DataFrame) = DataFrame(unwrap.(df.columns), names(df))

cleaned_bs = unwrap(bs)

That function will unwrap any column of nullables (first method) while leaving ‘normal’ columns alone (second method). The third method is so you can just call unwrap on your entire DataFrame. Hope that helps—let us know if you have any questions about the way that types are being used in the above code, as it uses some recent-ish additions to Julia (return type annotation and the where syntax).


#7

Thanks @swt30

Your unwrap function did not solve the problem. The types of the columns were more complicated than I (and your code) expected.

So I did clone your forks of PostgreSQL and DBI and could get it working.

Now my Query.jl stuff also works as expected and I can carry on exploring DataFrames.

Regards
Johann


#8

It could be something to do with those WeakRefStrings—I don’t know what those are. Anyway, glad to hear it is working for you.


#9

See this post about how to disable WeakRefString (the data ecosystem is quite inconsistent at the moment, but we hope to stabilize it before Julia 1.0):
https://discourse.julialang.org/t/how-to-change-value-of-nullable-weakrefstring-uint8-using-csv-package?source_topic_id=6077