Really, really big arrays and OutOfMemoryError()

clibassi · November 21, 2019, 11:54pm

Hi all,

I’m a relative newbie to Julia and trying to do some record linkage in Julia. In that process, I need to bring two large datasets (1.8million rows and 800K rows respectively) into an array to loop through and do comparisons of the strings in those datasets. I began this process by attempting to initialize an empty array that can store all the comparisons (ie a 1.8m x 800k array). When I attempt to do this I get an OutOfMemoryError(). I have tried initializing it at a few different sizes, and only after reducing the size considerably am I able to create the array (examples below). I’m wondering if what I’m doing is just foolish or if there is another type of array I should be using to accomplish my goal. Many thanks!

baggepinnen · November 22, 2019, 12:00am

How much memory do you have? 18000*900000*3=49GB

baggepinnen · November 22, 2019, 12:03am

What comparison do you need to do? I think the solution has to be context dependent.

heliosdrm · November 22, 2019, 12:39am

Perhaps you may consider not creating such a large array. You might read the strings of your datasets from the source files one at each time (or in reasonably sized batches), make the comparison(s), and write the results in an output file incrementally, without keeping the whole data sets in memory.

File IO is pretty efficient in Julia, so don’t be wary of reading and writing files frequently. In cases like this it may be much faster than loading and handling big datasets in memory.

clibassi · November 24, 2019, 4:50pm

I’m running on a machine with 24GB RAM. And the comparison I am trying to do is Levenshtein distance. I think I may just approach it by doing some blocking to cut down the number of comparisons by a lot.

nilshg · November 24, 2019, 4:57pm

Old question of mine covering similar ground (but not well discoverable from the title):

Why do these two functions benchmark the same?

Sukera · November 24, 2019, 5:44pm

Your immediate problem of not having enough memory may be solved using memory mapping, but the question is if you really need all those distances at the same time and whether or not you can use problem specific knowledge to avoid calculating them at all. Performance might be a problem though. For us to be able to give advice in that direction, we’re going to need some more information about what you’re doing with that data

Topic		Replies	Views
Julia run using terminal for 1GB dataset showing out of memory error General Usage question	18	5030	August 31, 2017
Julia Execution get out of memory error General Usage	3	4705	August 5, 2017
Use of Memory-mapped I/O General Usage memory , memory-allocation	9	2966	September 5, 2019
Using JuliaDB to create larger than memory datasets and work with them? General Usage	3	1063	October 15, 2019
OutOfMemoryError()_solved by sparse matrix General Usage question	3	865	March 14, 2020

Really, really big arrays and OutOfMemoryError()

Related topics