Storage format for lots of strings

DoktorMike · June 5, 2024, 6:34pm

Sorry to hijack this thread but I have another problem also with scale and was wondering if anyone has dealt with this scale before. I basically have 180 billion strings, SMILES, that I would like to store and work with efficiently. I was thinking PostgreSQL partition tables originally but it’s not clear how to do the partitioning when dealing with strings. Any recommendations would be appreciated as it’s quite expensive to do a bad design with these amounts of molecules.

Satvik · June 5, 2024, 6:43pm

Can you give an example of your use cases? E.g. do you expect to read a lot of them at once? How often do you expect to read them, from how many different connections? Is there a natural grouping, based on some property of the strings? Do you expect to be uploading a lot more regularly?

DoktorMike · June 5, 2024, 7:34pm

This would be a one time upload with the purpose of accessing it frequently. Each string is a small molecule which have properties that I would also like to calculate. The molecules have between 1 to 17 atoms. The majority of the Volume is on the larger molecules. I need to search this database often looking for specific properties.

Satvik · June 5, 2024, 7:53pm

I don’t know much about this domain, but the way I would approach the problem is by creating one method one method of mapping the strings to integers (e.g. a string_ids table). Text can take up a lot of space, so depending on your use case you might want to store the actual text offline and do lookups later.

Then I’d have another table that maps those ids to the properties you would search by. The second table could have columns like smile_id, num_atoms, etc. I would aggressively optimize the data types of this table to minimize storage space – e.g. a 32bit int should work for the id, and 8bit for the number of columns.

Since you’re not uploading frequently, you can add a lot of indices and partition based on these columns. A good partition will really depend on the domain and your access patterns, e.g. if you frequently look at molecules with the same number of atoms together, you could shard based on that.

stevengj · June 5, 2024, 8:01pm

Could you give an example of typical strings? Are you just storing which atoms are present and in what quantities (ala "C8H10N4O2"), or is there more infomation present in the string about the molecular structure? This may be the sort of thing that you could devise a specialized binary encoding for, which would make it much easier to query and to store.

DoktorMike · June 5, 2024, 8:27pm

Typical strings would be: Nc1ccc(cc1)c2nc3ccc(O)cc3s2 and O=C1CCCCCN1 etc. The format is called SMILES and is a common way to represent small organic molecules. One downside is that there are several ways to represent a molecule using the SMILES format. So maybe this should be converted to a more unique representation like InChI which is also a string. The string actually has a 2D structure implicit in it.

I really like the idea of a specialized binary encoding. There are a few like “molecular fingerprints” that are common. I’m just not sure if I lose information by using them.

jling · June 5, 2024, 9:20pm

Arrow.jl is again a good choice, because they store Strings as “OffsetList”

Basically all the characters are stored in the same storage unit and then separately an “offset” vector is saved. This has good storage efficiency and runtime performance.

The in-memory analogy is ArraysOfArrays.jl

j_u · June 5, 2024, 9:50pm

If your workloads are related to time series, as was the topic of the original thread you posted in, and you’re dealing with it on a daily basis while seeking a standard open-source solution, I would stand by my original recommendation for QuestDB. With its ability to store strings as VARCHAR and support for filesystem-wide compression, along with optional high-performance distributed filesystems and plotting solutions, it offers an efficient setup. Plus, it won’t turn you into a tensor trying to manage it all. However, if your data isn’t time series-related and you’re not seeking pure bliss but still interested in a database, I’d suggest looking into ClickHouse instead of the usual suspects like Postgres. Alternatively, if you’re not into columnar data formats like Arrow or Parquet but exploring specialized binary encoding and crave something unique, consider giving ADIOS2 a try or even a weekend trip to Redwood Shores. Perhaps it may open up exciting possibilities and a touch of magic to data management standard and non-standard workloads. Let me add a SMILE. :- )

Topic		Replies	Views
How do you store your data before and after processing with Julia? Data	35	7184	March 1, 2021
Storing lots of time series Data hdf5 , mmap , timeseries	13	807	June 5, 2024
Storing huge amount of data efficiently Performance performance , jld2 , numerics , io , arrow	15	2676	February 24, 2023
Repartitioning 2TB of csv into parquets Data big-data	21	4669	June 25, 2020
Best practices for local exploratory data analysis Data	15	1548	October 16, 2024

Storage format for lots of strings

Related topics