How to convert DataValues.DataValue to Array? dane_stat.data.values not works

#1

I have somthing like below: How to convert dane_stat.data to simple Array of Julia ?

julia> using ReadStat
julia> dane_stat=read_sav(savy[j])
julia> typeof(dane_stat.data)
Array{Any,1}
julia> dane_stat.data
958-element Array{Any,1}:
 DataValues.DataValue{Float64}[4.0, 4.0, 7.0, 6.0, 2.0, 5.0, 9.0, 6.0, 4.0, 12.0]
 DataValues.DataValue{Float64}[4.0, 4.0, 7.0, 6.0, 2.0, 5.0, 9.0, 6.0, 4.0, 12.0]
 DataValues.DataValue{Float64}[3.0, 3.0, 6.0, 5.0, 2.0, 4.0, 8.0, 5.0, 3.0, 9.0]
 DataValues.DataValue{Float64}[2.0, 2.0, 3.0, 3.0, 1.0, 3.0, 4.0, 3.0, 2.0, 4.0]
...
 DataValues.DataValue{Float64}[2.0, 2.0, 3.0, 3.0, 1.0, 1.0, 3.0, 3.0, 2.0, 4.0]

dane_stat.data.values
julia> dane_stat.data.values
ERROR: type Array has no field values
Stacktrace:
 [1] getproperty(::Any, ::Symbol) at .\sysimg.jl:18

julia> danes.data[1].values
1020-element Array{Float64,1}:
  1.0
  1.0
  1.0
  1.0
...

give only one column,
dane_stat.data.values not work , how to simply take all array ?
Thanks, Paul

0 Likes

#2

I’m confused. Why not just collect it into a DataFrame?

1 Like

#3

Yes, your best option right now is probably to use StatFiles.jl and read it into a DataFrame:

using StatFiles, DataFrames

df = load(filename) |> DataFrame

StatFiles.jl is a thin wrapper around ReadStat.jl that provides integration with other data packages. StatFiles.jl is really the package that most users should use, whereas ReadStat.jl is more of a low level library.

0 Likes

#4

Thanks, but DataFrame is to lage (this data has more then 8 GB, Julia works slowly in SWAP) , I need simply Array.
How to get it without DataFrames ?
Paul

0 Likes

#5

A DataFrame is pretty lightweight. It won’t add very much overhead compared to the size of the arrays. How much RAM does your computer have?

0 Likes

#6

8 core , 8GB RAM, win7 64, fast model, but e.g. : file sav 300 000 KB by
df = load(filename) |> DataFrame
is using in RAM 1.7 GB ! Is normal at this size but in task julia proces growing to 5 GB and using SWAP , if a make rand( 237253,954), task is 3 time lower.

julia> varinfo()
name                    size summary
---------------- ----------- --------------------
Base                         Module
Core                         Module
InteractiveUtils 167.330 KiB Module
Main                         Module
df                 1.737 GiB 237253×954 DataFrame

When I am using ReadStat no this problem, are others : data with missing show small numbers, around epsilon, like there: https://github.com/queryverse/ReadStat.jl/issues/48

Paul

0 Likes

#7

Comparing the DataFrame case with the result of the rand call is not a good comparison: the rand call returns an array that can’t hold missing values. If you compare it with Matrix{Union{Float64,Missing}}(undef,237253,954) then you’ll see that the memory requirement is almost identical. That is still not the correct comparison, though: both DataFrame and ReadStat return things as a vector of vectors, and that adds some additional overhead. I think ad the end of the day, you’ll see that the memory overhead introduced by DataFrame is really, really negligible.

0 Likes

#8

Is 40 second with no to bigg file ±1.5 MB. If I must to read lage sav files my machine stoping in SWAP.

julia> using ReadStat
julia> @time danes=read_sav("My_file.sav");
  0.323099 seconds (1.81 M allocations: 48.456 MiB, 9.83% gc time)
julia> danes.data[1][1]
DataValue{Float64}(1.0)
julia> @time danes.data[1];
  0.000007 seconds (4 allocations: 160 bytes)
julia> exit()

Restart

julia> using StatFiles, DataFrames
julia> @time df=DataFrame(load("My_file.sav"))
 40.178075 seconds (40.84 M allocations: 1.719 GiB, 2.81% gc time)
1020×998 DataFrame. Omitted printing of 990 columns

I still cant to read missing data

julia> danes.data[1][10].value
1.29189526e-315

julia> danes.data[1]
1020-element DataValues.DataValueArray{Float64,1}:
 DataValue{Float64}(1.0)
 DataValue{Float64}(1.0)
 DataValue{Float64}(1.0)
 DataValue{Float64}(1.0)
 DataValue{Float64}(12.0)
 DataValue{Float64}(2.0)
 DataValue{Float64}(2.0)
 DataValue{Float64}(2.0)
 DataValue{Float64}()
 **DataValue{Float64}()**
 DataValue{Float64}()

Empty row =1.29189526e-315 :confused:

I care about reading without DataFrames and then I analyze each column separately, so as not to mix strings with numbers.

Paul

0 Likes