Vector Database in Julia: bad idea?

gvdr · April 12, 2023, 12:12am

Floating an idea:

has anybody tried to build a vector database (think Pinecone or Weaviate) in pure Julia?

If not, is there a specific reason why it’s a bad idea?

Eric_Sadit_Tellez · April 12, 2023, 4:52am

I don’t know if there is any alternative to these vector databases, but several approximate nearest-neighbor algorithms exist around the ecosystem. For instance,

https://github.com/sadit/SimilaritySearch.jl

its SearchGraph structure is similar to HNSW, but with auto-tuning (easy/automatic hyperparameter setup), it supports incremental construction (insertions) and some similarity-based operations (single and batched nn search, allknn, etc.).

In particular, I think that Julia is a very interesting language for creating a similar search engine, e.g., different metric functions (Distances.jl) and fast user-defined distance functions.

It would be interesting to know the minimal requirements to create a vector database and see if achieving it with a reasonable effort is possible.

gvdr · April 12, 2023, 10:08pm

Totally agree with you. Both in terms of expansion potential (in Julia it becomes trivial to define ad hoc algos for the embeddings) and performance Julia would seem a natural choice.

Instead, I see most of the vector DBs going for Go (pun not intended).

Maybe I’ll give it a go. A use case less explored, to start with, would be storing nodes of a graph as embedded points in a latent space.

But I’m a mathematician with zero experience in building databases, so every help and suggestion is welcome

Ronis_BR · April 12, 2023, 11:55pm

I have no idea what’s behind a Vector Database. However, my experience can help

Some time ago, we did not have a pure Julia SGP4 orbit propagation. We had one package that interfaces with a Python library (GitHub - crbinz/SGP4.jl: Julia wrapper of the Python SGP4 library). It was working pretty well! However, after I reimplemented the SGP4 in Julia, the speed gain was enormous (at least using the Julia version back in the day). It reached the speed of the SGP4 C library, which is just amazing.

Hence, we may not have a pure Julia implementation of a Vector Database because nobody needed to go through the endeavor. However, if you want to try, it can really pay off big time

Palli · April 13, 2023, 11:24am

FYI: There is HNSW.jl available, and HNSW is I believe the key feature of Pinecone. I’m rather new to this, so are you saying the other package is better?

It’s certainly possible and I’m not sure it’s a “bad” idea, just not needed; is there a real need to reinvent the wheel in Julia? There is already Pinecone.jl to access that DB. I’m assuming that database doesn’t run in-process. That’s usually a good thing. For in-process databases, such as SQLite (and DuckDB), written in C, you are taking a security risk, or really trusting the DB is bug-free, or at least has no memory-corrupting bugs. If you do this in Julia (and it’s in-process), then because of bounds-checking be default, it’s less of a risk, then the DB must not disable them (or if really not screw up), but it goes goth ways, all your other code has also that same constraint. [Well you can enable bounds-checking globally, then you would be safe, unless you include e.g. some C code, JLLs.]

While Googling I got this result:

Web Pinecone.jl is a Julia API for the Pinecone vector database. ML Q&A The Q&A tool takes discussions and docs from some of the best Python ML libraries and collates their …

Clicking on the link I no longer see the Julia package mentioned on that page, but at least it was (and maybe still is elsewhere), so I assume it’s good. I see the package author mentioned it here, and he seems very close with the company:

This is an example of trying out VGG from within DeepFace, which has a vector length of 2,622

I’ve seen elsewhere that GPT-3 has over 1500 dimensions, not sure about in latest ChatGPT/GPT-4, but at least people go higher than that. Is there some inherent limit in the database, or a reason to go much higher? I’m just curious what’s the highers number of dimensions people use, and the pros and cons of many dims.

Eric_Sadit_Tellez · April 13, 2023, 2:56pm

The algorithm behind SimilarySearch.jl (SearchGraph) is competitive and, in some cases, improves than FAISS’s HNSW. Note that we also compared it with Google’s SCANN, which is very fast.

In any case, the auto-tuning gives an advantage for inexperienced users, i.e., the user only declares the objective recall, and the algorithm will try to achieve and maintain it along the life of the index. Note that always declaring maximum recall is not a good deal since it trades between speed and accuracy. The same characteristic reduces the indexing time since the index is only made once.

Perhaps a new vector database is too much work, but maybe not if we isolate functionality and reuse the ecosystem. Perhaps most people only need to store datasets in hdf5 and then use a regular similarity search algorithm. I never used Pinecone or Weaviate, but I perform benchmarks frequently with FAISS, SCANN, and others, so I don’t have the complete picture here.

Eric_Sadit_Tellez · April 13, 2023, 5:00pm

On the other hand, maybe people want to use/consume it from other languages. Therefore, a tutorial on preparing data, storing vectors in hdf5 (or something similar), indexing, and exposing search functions with a REST/JSON API could be enough.

It could be interesting to have an all-in-one solution, as others pointed out, it can be much more complicated and perhaps require a whole-team full-time effort, and that requires a clearer motivation.

simsurace · April 13, 2023, 5:56pm

Thanks for me mentioning this library. Have you tried composing it with e.g. GitHub - baggepinnen/DynamicAxisWarping.jl: Dynamic Time Warping (DTW) and related algorithms in Julia, at Julia speeds in order to search and cluster time series dara?

Eric_Sadit_Tellez · April 14, 2023, 1:52pm

No, I haven’t used DTW (nor the package). It would be interesting to see if we can take advantage of composing. I understand the DTW is not a proper metric, and some tricks are needed. Perhaps, using the ExhaustiveSearch struct (brute force search instead of indexing) with multithreading can be enough to see if we can obtain some interesting results.

Of course, I was not exhaustive, but I found the following paper that shows the benefit of using metric indexes for the task.

simsurace · April 15, 2023, 6:59am

Thanks for the thoughts and the paper. That‘s something I overlooked, it changes the picture.

jling · October 29, 2023, 9:53am

Is it a prerequisite to have a good embeddings models runnable in Julia? I guess technically it can just take some vector computed by other models, but practically I can’t imagine Julia beating C++/Rust in this type of narrow, user-agnostic tasks (a few numbers in a few numbers out). So if we have a good stack of other things to go around the vector DB it would may more sense to choose Julia to begin with

rdavis120 · October 29, 2023, 1:37pm

I’m not sure if this would suit your purposes but bringing LanceDB into Julia might be faster and it integrates with other open source tools through arrow. It’s written in rust and is an embedded database rather than a service, but only has python and JavaScript clients now.

brett_knoss · October 31, 2023, 6:06pm

Is LanceDB serverless like SQLite?

rdavis120 · October 31, 2023, 6:21pm

Yes. The format is explained here.

brett_knoss · October 31, 2023, 6:47pm

It looks interesting. For a lot of things it might not have an advantage over relational DBs, but I can see where it could be useful for machine learning. I’m wondering if Julia should have the ablility to call Rust (and possibly Zig) libraries, the same way it can call C or Fortran libraries.

gvdr · October 31, 2023, 10:35pm

Well, calling Rust should be possible just with ccall(), right?

brett_knoss · November 9, 2023, 9:31pm

OK, I’ll try it and see.

tecosaur · December 1, 2023, 7:43am

I’m pretty interested in this idea. SimilaritySearch.jl seems like a pretty good component of a vector DB, what I’m struggling to find a nice solution to is getting text embeddings and running queries with it.

cpfiffer · May 20, 2024, 7:56pm

Just saw this one. I’m slowly trying to build a graph + vector DB from the ground up.

This is relatively well solved with PromptingTools.jl, I’ve been able to use Ollama to generate my own embeddings, but it supports OpenAI, nomic, etc.

I’m mostly messing around with the storage layer right now, i.e. how to serialize relationships, nodes, and edges. I have a very simple vector search built in but SimilaritySearch.jl looks awesome and I hadn’t noticed it before!

If anyone’s interested I’m considering scaling it up as a potential commercial enterprise and/or simple learning project. Happy to take people on, I suspect it’d be a good open source platform.

The goals are

Implement the GQL standard (similar to cypher)
Support a first-class Julia interface, i.e. you should be able to use it with no query language at all
High horizontal scalability
Ultra-simple, no-nonsense, distributed vector search
Do it from scratch because I like learning things that way

I do think that Julia is a very good candidate for this kind of thing – making the internals of the database operate on type of the type system seems super flexible, especially for heterogeneously typed nodes/edges/etc.

gvdr · May 21, 2024, 5:00am

Count me in

Topic		Replies	Views
Vector databases General Usage vector-search	5	605	June 20, 2024
The naming of JuliaDB.jl Offtopic	47	4570	June 7, 2017
ANN: JuliaDB.jl Community	40	9856	November 13, 2018
[ANN] Faiss.jl, similarity search General Usage package , announcement	18	1835	March 27, 2022
Resurrecting universal database API Data	60	4759	August 17, 2018

Vector Database in Julia: bad idea?

Related topics