JuliaDB Questions/Issues

I have a few questions about using JuliaDB that Iโ€™m hoping the fantastic community here can help me answer:

  • I have a .csv file that is 4.3 GB and when I load it into JuliaDB and then save it, it ends up being 8.5 GB. Is it normal for the JuliaDB table to be so much larger than the .csv representation of the same data?
  • The data are in approximately 4.7 million rows and 286 columns. Most of the columns are concretely typed, but many of them are Union types of Missing,Int64. If I replace the Missing types with integers, will this speed up operations on this table? If so, whatโ€™s the best way to do this?
  • Lastly, Iโ€™ve tried to apply a simple filter to this table and Iโ€™ve let the program run for a solid 15 minutes and it never finished (so I stopped it). Is this normal? Iโ€™ve used JuliaDB to process another table that is roughly 2.5 GB and basic filtering operations are quite fast on that dataset (all columns in that dataset are concretely typed, no union types)โ€ฆ
  1. @shashi or @joshday might better understand if that storage size sounds reasonable
  2. For columns of type Union{Int, Missing}, there shouldnโ€™t be much of a slowdown, but it may depend on the exact operation. There are numerous optimizations to account for these kinds of Union types, but there may also be cases where things are slow and need to be fixed, so if you notice certain operations are much faster on plain Int vs. Union{Int, Missing}, please post here on discourse and others (like myself) can help take a look to figure out whatโ€™s going on. That said, if you have a natural โ€œsentinelโ€ Int value you can replace missing values, that can be not only more performant, but conceptually simpler during processing tasks. In fact, it would probably be helpful if, for example, CSV.jl provided a way to directly replace missing values w/ a user-provided sentinel value of the column type. (Iโ€™ve submitted an issue to explore that for CSV.jl)
  3. It sounds like thereโ€™s some kind of real issue here; itโ€™d be most helpful to open an issue here with as much code as possible showing the slowdown; with an easy reproduction, it makes it easiest for package maintainers to find a fix.
1 Like

If the number of columns is in the hundreds, I suspect row iteration can get a bit slow: one solution is to use select keyword to select on which column (or columns) the filtering condition is applied. For example:

julia> t = table((a=rand(10), b=rand(10)));

julia> filter(x -> x > 0.5, t; select = :a)
Table with 4 rows, 2 columns:
a         b
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
0.99744   0.29201
0.616544  0.853681
0.511714  0.107963
0.630841  0.856068

Alternatively, the @filter macro from JuliaDBMeta automatically does the column selection for you.

If you can share the data set and the code to load it, we could better answer your questions. If you cannot, maybe you can describe the columns to us (any strings in there? all Int64, or are there some Int8, Int16? etc).

1 Like

Thanks for the input. The data can be found in the zip file here: https://www2.census.gov/programs-surveys/acs/data/pums/2017/5-Year/csv_pus.zip

It is the first .csv file in the .zip folder. The code that Iโ€™m using to load it looks like this (and only takes a few minutes to execute):

using JuliaDB

acs = loadtable("psam_pusa.csv", type_detect_rows=200)

This results in the following output:

Table with 4691835 rows, 286 columns:
Columns:
#    colname    type
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
1    RT         String
2    SERIALNO   Int64
3    DIVISION   Int64
4    SPORDER    Int64
5    PUMA       Int64
6    REGION     Int64
7    ST         Int64
8    ADJINC     Int64
9    PWGTP      Int64
10   AGEP       Int64
11   CIT        Int64
12   CITWP      Union{Missing, Int64}
13   COW        Union{Missing, Int64}
14   DDRS       Union{Missing, Int64}
15   DEAR       Int64
16   DEYE       Int64
17   DOUT       Union{Missing, Int64}
18   DPHY       Union{Missing, Int64}
19   DRAT       Union{Missing, Int64}
20   DRATX      Union{Missing, Int64}
21   DREM       Union{Missing, Int64}
22   ENG        Union{Missing, Int64}
23   FER        Union{Missing, Int64}
24   GCL        Union{Missing, Int64}
25   GCM        Union{Missing, Int64}
26   GCR        Union{Missing, Int64}
27   HINS1      Int64
28   HINS2      Int64
29   HINS3      Int64
30   HINS4      Int64
31   HINS5      Int64
32   HINS6      Int64
33   HINS7      Int64
34   INTP       Union{Missing, Int64}
35   JWMNP      Union{Missing, Int64}
36   JWRIP      Union{Missing, Int64}
37   JWTR       Union{Missing, Int64}
38   LANX       Union{Missing, Int64}
39   MAR        Int64
40   MARHD      Union{Missing, Int64}
41   MARHM      Union{Missing, Int64}
42   MARHT      Union{Missing, Int64}
43   MARHW      Union{Missing, Int64}
44   MARHYP     Union{Missing, Int64}
45   MIG        Union{Missing, Int64}
46   MIL        Union{Missing, Int64}
47   MLPA       Union{Missing, Int64}
48   MLPB       Union{Missing, Int64}
49   MLPCD      Union{Missing, Int64}
50   MLPE       Union{Missing, Int64}
51   MLPFG      Union{Missing, Int64}
52   MLPH       Union{Missing, Int64}
53   MLPI       Union{Missing, Int64}
54   MLPJ       Union{Missing, Int64}
55   MLPK       Union{Missing, Int64}
56   NWAB       Union{Missing, Int64}
57   NWAV       Union{Missing, Int64}
58   NWLA       Union{Missing, Int64}
59   NWLK       Union{Missing, Int64}
60   NWRE       Union{Missing, Int64}
61   OIP        Union{Missing, Int64}
62   PAP        Union{Missing, Int64}
63   RELP       Int64
64   RETP       Union{Missing, Int64}
65   SCH        Union{Missing, Int64}
66   SCHG       Union{Missing, Int64}
67   SCHL       Union{Missing, Int64}
68   SEMP       Union{Missing, Int64}
69   SEX        Int64
70   SSIP       Union{Missing, Int64}
71   SSP        Union{Missing, Int64}
72   WAGP       Union{Missing, Int64}
73   WKHP       Union{Missing, Int64}
74   WKL        Union{Missing, Int64}
75   WKW        Union{Missing, Int64}
76   WRK        Union{Missing, Int64}
77   YOEP       Union{Missing, Int64}
78   ANC        Int64
79   ANC1P      Int64
80   ANC2P      Int64
81   DECADE     Union{Missing, Int64}
82   DIS        Int64
83   DRIVESP    Union{Missing, Int64}
84   ESP        Union{Missing, Int64}
85   ESR        Union{Missing, Int64}
86   FOD1P      Union{Missing, Int64}
87   FOD2P      Union{Missing, Int64}
88   HICOV      Int64
89   HISP       Int64
90   INDP       Union{Missing, Int64}
91   JWAP       Union{Missing, Int64}
92   JWDP       Union{Missing, Int64}
93   LANP       Union{Missing, Int64}
94   MIGPUMA    Union{Missing, Int64}
95   MIGSP      Union{Missing, Int64}
96   MSP        Union{Missing, Int64}
97   NAICSP     String
98   NATIVITY   Int64
99   NOP        Union{Missing, Int64}
100  OC         Union{Missing, Int64}
101  OCCP       Union{Missing, Int64}
102  PAOC       Union{Missing, Int64}
103  PERNP      Union{Missing, Int64}
104  PINCP      Union{Missing, Int64}
105  POBP       Int64
106  POVPIP     Union{Missing, Int64}
107  POWPUMA    Union{Missing, Int64}
108  POWSP      Union{Missing, Int64}
109  PRIVCOV    Int64
110  PUBCOV     Int64
111  QTRBIR     Int64
112  RAC1P      Int64
113  RAC2P      Int64
114  RAC3P      Int64
115  RACAIAN    Int64
116  RACASN     Int64
117  RACBLK     Int64
118  RACNH      Int64
119  RACNUM     Int64
120  RACPI      Int64
121  RACSOR     Int64
122  RACWHT     Int64
123  RC         Union{Missing, Int64}
124  SCIENGP    Union{Missing, Int64}
125  SCIENGRLP  Union{Missing, Int64}
126  SFN        Union{Missing, Int64}
127  SFR        Union{Missing, Int64}
128  SOCP       String
129  VPS        Union{Missing, Int64}
130  WAOB       Int64
131  FAGEP      Int64
132  FANCP      Int64
133  FCITP      Int64
134  FCITWP     Int64
135  FCOWP      Int64
136  FDDRSP     Int64
137  FDEARP     Int64
138  FDEYEP     Int64
139  FDISP      Int64
140  FDOUTP     Int64
141  FDPHYP     Int64
142  FDRATP     Int64
143  FDRATXP    Int64
144  FDREMP     Int64
145  FENGP      Int64
146  FESRP      Int64
147  FFERP      Int64
148  FFODP      Int64
149  FGCLP      Int64
150  FGCMP      Int64
151  FGCRP      Int64
152  FHICOVP    Int64
153  FHINS1P    Int64
154  FHINS2P    Int64
155  FHINS3C    Union{Missing, Int64}
156  FHINS3P    Int64
157  FHINS4C    Union{Missing, Int64}
158  FHINS4P    Int64
159  FHINS5C    Union{Missing, Int64}
160  FHINS5P    Int64
161  FHINS6P    Int64
162  FHINS7P    Int64
163  FHISP      Int64
164  FINDP      Int64
165  FINTP      Int64
166  FJWDP      Int64
167  FJWMNP     Int64
168  FJWRIP     Int64
169  FJWTRP     Int64
170  FLANP      Int64
171  FLANXP     Int64
172  FMARP      Int64
173  FMARHDP    Int64
174  FMARHMP    Int64
175  FMARHTP    Int64
176  FMARHWP    Int64
177  FMARHYP    Int64
178  FMIGP      Int64
179  FMIGSP     Int64
180  FMILPP     Int64
181  FMILSP     Int64
182  FOCCP      Int64
183  FOIP       Int64
184  FPAP       Int64
185  FPERNP     Int64
186  FPINCP     Int64
187  FPOBP      Int64
188  FPOWSP     Int64
189  FPRIVCOVP  Int64
190  FPUBCOVP   Int64
191  FRACP      Int64
192  FRELP      Int64
193  FRETP      Int64
194  FSCHGP     Int64
195  FSCHLP     Int64
196  FSCHP      Int64
197  FSEMP      Int64
198  FSEXP      Int64
199  FSSIP      Int64
200  FSSP       Int64
201  FWAGP      Int64
202  FWKHP      Int64
203  FWKLP      Int64
204  FWKWP      Int64
205  FWRKP      Int64
206  FYOEP      Int64
207  PWGTP1     Int64
208  PWGTP2     Int64
209  PWGTP3     Int64
210  PWGTP4     Int64
211  PWGTP5     Int64
212  PWGTP6     Int64
213  PWGTP7     Int64
214  PWGTP8     Int64
215  PWGTP9     Int64
216  PWGTP10    Int64
217  PWGTP11    Int64
218  PWGTP12    Int64
219  PWGTP13    Int64
220  PWGTP14    Int64
221  PWGTP15    Int64
222  PWGTP16    Int64
223  PWGTP17    Int64
224  PWGTP18    Int64
225  PWGTP19    Int64
226  PWGTP20    Int64
227  PWGTP21    Int64
228  PWGTP22    Int64
229  PWGTP23    Int64
230  PWGTP24    Int64
231  PWGTP25    Int64
232  PWGTP26    Int64
233  PWGTP27    Int64
234  PWGTP28    Int64
235  PWGTP29    Int64
236  PWGTP30    Int64
237  PWGTP31    Int64
238  PWGTP32    Int64
239  PWGTP33    Int64
240  PWGTP34    Int64
241  PWGTP35    Int64
242  PWGTP36    Int64
243  PWGTP37    Int64
244  PWGTP38    Int64
245  PWGTP39    Int64
246  PWGTP40    Int64
247  PWGTP41    Int64
248  PWGTP42    Int64
249  PWGTP43    Int64
250  PWGTP44    Int64
251  PWGTP45    Int64
252  PWGTP46    Int64
253  PWGTP47    Int64
254  PWGTP48    Int64
255  PWGTP49    Int64
256  PWGTP50    Int64
257  PWGTP51    Int64
258  PWGTP52    Int64
259  PWGTP53    Int64
260  PWGTP54    Int64
261  PWGTP55    Int64
262  PWGTP56    Int64
263  PWGTP57    Int64
264  PWGTP58    Int64
265  PWGTP59    Int64
266  PWGTP60    Int64
267  PWGTP61    Int64
268  PWGTP62    Int64
269  PWGTP63    Int64
270  PWGTP64    Int64
271  PWGTP65    Int64
272  PWGTP66    Int64
273  PWGTP67    Int64
274  PWGTP68    Int64
275  PWGTP69    Int64
276  PWGTP70    Int64
277  PWGTP71    Int64
278  PWGTP72    Int64
279  PWGTP73    Int64
280  PWGTP74    Int64
281  PWGTP75    Int64
282  PWGTP76    Int64
283  PWGTP77    Int64
284  PWGTP78    Int64
285  PWGTP79    Int64
286  PWGTP80    Int64

If I save this via the save function, it results in the output file being 8.5 GB (the original .csv is 4.3 GB). Then, I try to apply a simple filter such as this:

filtered = filter(row -> row.WKHP > 40, acs)

I let this run for about 10 - 15 minutes and it doesnโ€™t finish. I noticed that my memory usage skyrockets as soon as I start loading the data and it doesnโ€™t drop much after the data is loaded but I tried to use the Lazy package, as shown in the docs, but I got the same end result.

I see two issues here: size and speed

First, when I run:

acs = loadtable("C:\\Users\\Max\\Desktop\\psam_pusa.csv", 
    colparsers = vcat(String, repeat([Union{Missing,Int16}], 95), String, repeat([Union{Missing,Int16}], 30),
       String, repeat([Union{Missing,Int16}], 158)))

it yields a 2.5GB file. I think that is a bug worth reporting to JuliaDB or TextParse.
I found a second issue there, because with Int64s instead of the Int16 above the file is only 2.49GB big, which means that something is not stored correctly.

About your filter:

filtered = filter(row -> row.WKHP > 40, acs) should have thrown an error, and it definitely did, but maybe you did not see it. row.WKHP may be Missing and > 40 yields Missing, but a Bool is needed.

I like writing queries using JuliaDBMeta:

@time filtered = @filter acs ismissing(:WKHP) ? false : :WKHP > 40

About the memory consumption, I find the same results with the 2.5GB data set: I load it acs = load("path/file") and 10GB of previously free memory of my 16GB hardware fills up. Running the query takes 5 seconds the first time, 10/17/25/40 seconds if I run it again and again, while them memory writes to the page file like crazy. That seems like a memory leak.

@piever, @joshday, @shashi any thoughts on that?

I am running JuliaDB 0.12 and Julia 1.1.0

This benchmark suggests that filter has a significant performance issue with Union{T,Missing} columns. You could try to convert the column to DataValue and see whether you get better performance that way.

In general a binary storage can take up more space than CSV. For example a small one digit integer might take up two bytes in a CSV (one for the digit, one for a separator character), but 8 in a binary format (64 bit). Not sure whether that is what is going on here, though.

2 Likes

Thanks, David. You will be pleased to know that I am able to perform the operations above with Queryverse.jl and it works fine ; )

It may be a bit off topic, but if anyone is interested, I can filter the exact same data set in Queryverse like this:

using Queryverse

acs = load("psam_pusa.csv", type_detect_rows = 2000)

This yields the following:

4691835x286 CSV file
RT โ”‚ SERIALNO      โ”‚ DIVISION โ”‚ SPORDER โ”‚ PUMA โ”‚ REGION โ”‚ ST โ”‚ ADJINC  โ”‚ PWGTP โ”‚ AGEP โ”‚ CIT โ”‚ CITWP โ”‚ COW โ”‚ DDRS โ”‚ DEAR โ”‚ DEYE โ”‚ DOUT โ”‚ DPHY โ”‚ DRAT โ”‚ DRATX โ”‚ DREM โ”‚ ENG โ”‚ FER โ”‚ GCL โ”‚ GCM โ”‚ GCR โ”‚ HINS1 โ”‚ HINS2 โ”‚ HINS3 โ”‚ HINS4 โ”‚ HINS5
โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€
P  โ”‚ 2013000000084 โ”‚ 6        โ”‚ 1       โ”‚ 2600 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 13    โ”‚ 19   โ”‚ 1   โ”‚ #NA   โ”‚ #NA โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ 2   โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ 1     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000154 โ”‚ 6        โ”‚ 1       โ”‚ 2500 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 11    โ”‚ 55   โ”‚ 1   โ”‚ #NA   โ”‚ 1   โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ 2   โ”‚ #NA โ”‚ #NA โ”‚ 1     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000154 โ”‚ 6        โ”‚ 2       โ”‚ 2500 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 13    โ”‚ 56   โ”‚ 1   โ”‚ #NA   โ”‚ 6   โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ 2   โ”‚ #NA โ”‚ #NA โ”‚ 1     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000154 โ”‚ 6        โ”‚ 3       โ”‚ 2500 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 30    โ”‚ 21   โ”‚ 1   โ”‚ #NA   โ”‚ #NA โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ 1     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000154 โ”‚ 6        โ”‚ 4       โ”‚ 2500 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 15    โ”‚ 21   โ”‚ 1   โ”‚ #NA   โ”‚ #NA โ”‚ 1    โ”‚ 2    โ”‚ 2    โ”‚ 1    โ”‚ 1    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ 2   โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ 1     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000156 โ”‚ 6        โ”‚ 1       โ”‚ 1700 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 93    โ”‚ 63   โ”‚ 1   โ”‚ #NA   โ”‚ 3   โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ 2   โ”‚ #NA โ”‚ #NA โ”‚ 1     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000160 โ”‚ 6        โ”‚ 1       โ”‚ 2200 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 5     โ”‚ 61   โ”‚ 1   โ”‚ #NA   โ”‚ #NA โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 1    โ”‚ 1    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ 1   โ”‚ 5   โ”‚ 1   โ”‚ 2     โ”‚ 2     โ”‚ 1     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000160 โ”‚ 6        โ”‚ 2       โ”‚ 2200 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 11    โ”‚ 20   โ”‚ 1   โ”‚ #NA   โ”‚ 1   โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 2
P  โ”‚ 2013000000160 โ”‚ 6        โ”‚ 3       โ”‚ 2200 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 2     โ”‚ 12   โ”‚ 1   โ”‚ #NA   โ”‚ #NA โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ #NA โ”‚ 2     โ”‚ 2     โ”‚ 2     โ”‚ 1     โ”‚ 2
P  โ”‚ 2013000000231 โ”‚ 6        โ”‚ 1       โ”‚ 2400 โ”‚ 3      โ”‚ 1  โ”‚ 1061971 โ”‚ 10    โ”‚ 78   โ”‚ 1   โ”‚ #NA   โ”‚ 2   โ”‚ 2    โ”‚ 1    โ”‚ 2    โ”‚ 2    โ”‚ 2    โ”‚ #NA  โ”‚ #NA   โ”‚ 2    โ”‚ #NA โ”‚ #NA โ”‚ 2   โ”‚ #NA โ”‚ #NA โ”‚ 2     โ”‚ 1     โ”‚ 1     โ”‚ 1     โ”‚ 2
... with 4691825 more rows, and 255 more columns: HINS6, HINS7, INTP, JWMNP, JWRIP, JWTR, LANX, MAR, MARHD, MARHM, MARHT, MARHW, MARHYP, MIG, MIL, MLPA, MLPB, MLPCD, MLPE, MLPFG, MLPH, MLPI, MLPJ, MLPK, NWAB, NWAV, NWLA, NWLK, NWRE, OIP, PAP, RELP, RETP, SCH, SCHG, SCHL, SEMP, SEX, SSIP, SSP, WAGP, WKHP, WKL, WKW, WRK, YOEP, ANC, ANC1P, ANC2P, DECADE, DIS, DRIVESP, ESP, ESR, FOD1P, FOD2P, HICOV, HISP, INDP, JWAP, JWDP, LANP, MIGPUMA, MIGSP, MSP, NAICSP, NATIVITY, NOP, OC, OCCP, PAOC, PERNP, PINCP, POBP, POVPIP, POWPUMA, POWSP, PRIVCOV, PUBCOV, QTRBIR, RAC1P, RAC2P, RAC3P, RACAIAN, RACASN, RACBLK, RACNH, RACNUM, RACPI, RACSOR, RACWHT, RC, SCIENGP, SCIENGRLP, SFN, SFR, SOCP, VPS, WAOB, FAGEP, FANCP, FCITP, FCITWP, FCOWP, FDDRSP, FDEARP, FDEYEP, FDISP, FDOUTP, FDPHYP, FDRATP, FDRATXP, FDREMP, FENGP, FESRP, FFERP, FFODP, FGCLP, FGCMP, FGCRP, FHICOVP, FHINS1P, FHINS2P, FHINS3C, FHINS3P, FHINS4C, FHINS4P, FHINS5C, FHINS5P, FHINS6P, FHINS7P, FHISP, FINDP, FINTP, FJWDP, FJWMNP, FJWRIP, FJWTRP, FLANP, FLANXP, FMARP, FMARHDP, FMARHMP, FMARHTP, FMARHWP, FMARHYP, FMIGP, FMIGSP, FMILPP, FMILSP, FOCCP, FOIP, FPAP, FPERNP, FPINCP, FPOBP, FPOWSP, FPRIVCOVP, FPUBCOVP, FRACP, FRELP, FRETP, FSCHGP, FSCHLP, FSCHP, FSEMP, FSEXP, FSSIP, FSSP, FWAGP, FWKHP, FWKLP, FWKWP, FWRKP, FYOEP, PWGTP1, PWGTP2, PWGTP3, PWGTP4, PWGTP5, PWGTP6, PWGTP7, PWGTP8, PWGTP9, PWGTP10, PWGTP11, PWGTP12, PWGTP13, PWGTP14, PWGTP15, PWGTP16, PWGTP17, PWGTP18, PWGTP19, PWGTP20, PWGTP21, PWGTP22, PWGTP23, PWGTP24, PWGTP25, PWGTP26, PWGTP27, PWGTP28, PWGTP29, PWGTP30, PWGTP31, PWGTP32, PWGTP33, PWGTP34, PWGTP35, PWGTP36, PWGTP37, PWGTP38, PWGTP39, PWGTP40, PWGTP41, PWGTP42, PWGTP43, PWGTP44, PWGTP45, PWGTP46, PWGTP47, PWGTP48, PWGTP49, PWGTP50, PWGTP51, PWGTP52, PWGTP53, PWGTP54, PWGTP55, PWGTP56, PWGTP57, PWGTP58, PWGTP59, PWGTP60, PWGTP61, PWGTP62, PWGTP63, PWGTP64, PWGTP65, PWGTP66, PWGTP67, PWGTP68, PWGTP69, PWGTP70, PWGTP71, PWGTP72, PWGTP73, PWGTP74, PWGTP75, PWGTP76, PWGTP77, PWGTP78, PWGTP79, PWGTP80

I can then filter it like this (note that Iโ€™ve added a filter criterion that doesnโ€™t appear in the operation above):

acs |>
    @filter(_.ESR == 1 || _.ESR == 2) |>
    @filter(_.WKHP > 40) |>
    DataFrame

Which results in the following output:

530509ร—286 DataFrame. Omitted printing of 262 columns
โ”‚ Row    โ”‚ RT     โ”‚ SERIALNO      โ”‚ DIVISION โ”‚ SPORDER โ”‚ PUMA  โ”‚ REGION โ”‚ ST    โ”‚ ADJINC  โ”‚ PWGTP โ”‚ AGEP  โ”‚ CIT   โ”‚ CITWP   โ”‚ COW    โ”‚ DDRS   โ”‚ DEAR  โ”‚ DEYE  โ”‚ DOUT   โ”‚ DPHY   โ”‚ DRAT    โ”‚ DRATX   โ”‚ DREM   โ”‚ ENG     โ”‚ FER     โ”‚ GCL    โ”‚
โ”‚        โ”‚ String โ”‚ Int64         โ”‚ Int64    โ”‚ Int64   โ”‚ Int64 โ”‚ Int64  โ”‚ Int64 โ”‚ Int64   โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚ Int64โฐ  โ”‚ Int64โฐ โ”‚ Int64โฐ โ”‚ Int64 โ”‚ Int64 โ”‚ Int64โฐ โ”‚ Int64โฐ โ”‚ Int64โฐ  โ”‚ Int64โฐ  โ”‚ Int64โฐ โ”‚ Int64โฐ  โ”‚ Int64โฐ  โ”‚ Int64โฐ โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1      โ”‚ P      โ”‚ 2013000000312 โ”‚ 6        โ”‚ 1       โ”‚ 700   โ”‚ 3      โ”‚ 1     โ”‚ 1061971 โ”‚ 10    โ”‚ 59    โ”‚ 1     โ”‚ missing โ”‚ 1      โ”‚ 2      โ”‚ 2     โ”‚ 2     โ”‚ 2      โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚
โ”‚ 2      โ”‚ P      โ”‚ 2013000000776 โ”‚ 6        โ”‚ 2       โ”‚ 1400  โ”‚ 3      โ”‚ 1     โ”‚ 1061971 โ”‚ 6     โ”‚ 58    โ”‚ 1     โ”‚ missing โ”‚ 1      โ”‚ 2      โ”‚ 2     โ”‚ 2     โ”‚ 2      โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚
โ‹ฎ
โ”‚ 530507 โ”‚ P      โ”‚ 2017001530497 โ”‚ 9        โ”‚ 1       โ”‚ 307   โ”‚ 4      โ”‚ 15    โ”‚ 1011189 โ”‚ 23    โ”‚ 35    โ”‚ 1     โ”‚ missing โ”‚ 1      โ”‚ 2      โ”‚ 2     โ”‚ 2     โ”‚ 2      โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚
โ”‚ 530508 โ”‚ P      โ”‚ 2017001533163 โ”‚ 9        โ”‚ 1       โ”‚ 306   โ”‚ 4      โ”‚ 15    โ”‚ 1011189 โ”‚ 6     โ”‚ 63    โ”‚ 4     โ”‚ 1965    โ”‚ 6      โ”‚ 2      โ”‚ 2     โ”‚ 2     โ”‚ 2      โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚ 2       โ”‚ missing โ”‚ 2      โ”‚
โ”‚ 530509 โ”‚ P      โ”‚ 2017001533953 โ”‚ 9        โ”‚ 1       โ”‚ 305   โ”‚ 4      โ”‚ 15    โ”‚ 1011189 โ”‚ 7     โ”‚ 68    โ”‚ 4     โ”‚ 1998    โ”‚ 6      โ”‚ 2      โ”‚ 2     โ”‚ 2     โ”‚ 2      โ”‚ 2      โ”‚ missing โ”‚ missing โ”‚ 2      โ”‚ 1       โ”‚ missing โ”‚ 2      โ”‚

Itโ€™s not super fast on my machine (I didnโ€™t measure), but it does the job in just a few minutes.

1 Like

Great :slight_smile:

I do think it is worth trying JuliaDB with DataValue in that case. My understanding is that the JuliaDB filter implementation is very similar to the Query.jl implementation. Query.jl uses DataValue in these cases, and I wouldnโ€™t be surprised if you get similar performance from JuliaDB if you also use DataValue there.

The filtering algorithm in JuliaDB iterates row, so if Query is much faster this is probably a Union{T, Missing} performance problem. Note that JuliaDBMeta has optimized macros for JuliaDB tables, so you could also try:

using JuliaDBMeta: @filter
@filter acs (:ESR == 1 || :ESR == 2) && (:WKHP > 40)

provided that acs is a distributed or in memory table from JuliaDB. This should perform a bit better as it is done in parallel on distributed data and only iterates fields that you need (so :ESR and :WKHP) rather than all the 262 fields you have.

2 Likes

I havenโ€™t yet delved into the world of distributed/parallel programming. Can someone provide the steps as to how I would load a very large CSV file as a distributed table and perform filtering operations on the data in parallel??

I do think it is worth trying JuliaDB with DataValue in that case. My understanding is that the JuliaDB filter implementation is very similar to the Query.jl implementation. Query.jl uses DataValue in these cases, and I wouldnโ€™t be surprised if you get similar performance from JuliaDB if you also use DataValue there.

I tried this and ended up with the same result : (

If you have a single large file there are two ways to go about what you want:

  • Read the file into JuliaDB with loadtable in one go and call Dagger.distribute on that table. The only question is whether your memory can take it. If loadtable fails, maybe consider TextParse.jl or CSV.jl which will yield a DataFrame which you can turn into a JuliaDB table.

  • Split the CSV file first. Write a short piece of code that splits your large file into many small files and use loadtables distributed functionality.

1 Like

Read the file into JuliaDB with loadtable in one go and call Dagger.distribute on that table.

Thanks so much, Iโ€™m going to give this a try!

1 Like