Converting to DataFrame

danicaratelli · August 12, 2020, 9:45pm

I am loading a large .dta file and then converting it to a DataFrame. However, because of its size, the conversion takes a very long time (~20mins). I don’t need all of the columns in the file so I am wondering whether anyone knows of a way to convert only a subset of it.

Specifically:

using DataFrames, StatFiles

df_1 = load("data.dta");
df_2 = DataFrame(df_1);
df_3 = df_2[!,:col1];

step 2 takes a long time so if there were a way to invert steps 3 and 2, namely first select the columns I want and then convert to DataFrame, that would speed things up a lot. Any help would be greatly appreciated.

Thanks!

anon37204545 · August 12, 2020, 10:59pm

I think you can convert .dta to .csv first, manipulate CSV (e.g. with CSV.jl) and then convert that to a DataFrame.

pdeffebach · August 13, 2020, 1:08am

I don’t think there is a way to load a subset of the columns.

I agree that you should save the data as .csv. If you have access to stata thats very easy. Otherwise both R and pandas have .dta. However I’m not confident they are any better than Julia’s, since I think StatFiles uses the same underlying library as they do.

danicaratelli · August 13, 2020, 1:45am

Thank you both. I ended up calling Julia via RCall. That seems to work much faster than loading the .dta file directly via Julia.

Thanks!

nalimilan · August 13, 2020, 10:55am

It’s weird that it takes much longer than R. @davidanthoff probably knows.

Topic		Replies	Views
R's dplyr and data.table 2x faster than Julia's DataFrames.jl + libraries New to Julia	9	1705	September 30, 2020
SubDataFrame to DataFrame Statistics question	4	1775	August 29, 2018
Suggestions for a package to read tabular data Data question	12	2726	February 13, 2017
Shout out to JuliaConnectoR and DataFrames.jl / Tables.jl Offtopic appreciation	0	478	April 26, 2022
Recommended Saves and Loads of DataFrame : JLD, CSV, etc Data	8	2894	August 30, 2020

Converting to DataFrame

Related topics