Vector Database in Julia: bad idea?

Floating an idea:

has anybody tried to build a vector database (think Pinecone or Weaviate) in pure Julia?

If not, is there a specific reason why it’s a bad idea?

1 Like

I don’t know if there is any alternative to these vector databases, but several approximate nearest-neighbor algorithms exist around the ecosystem. For instance,

https://github.com/sadit/SimilaritySearch.jl

its SearchGraph structure is similar to HNSW, but with auto-tuning (easy/automatic hyperparameter setup), it supports incremental construction (insertions) and some similarity-based operations (single and batched nn search, allknn, etc.).

In particular, I think that Julia is a very interesting language for creating a similar search engine, e.g., different metric functions (Distances.jl) and fast user-defined distance functions.

It would be interesting to know the minimal requirements to create a vector database and see if achieving it with a reasonable effort is possible.

Totally agree with you. Both in terms of expansion potential (in Julia it becomes trivial to define ad hoc algos for the embeddings) and performance Julia would seem a natural choice.

Instead, I see most of the vector DBs going for Go (pun not intended).

Maybe I’ll give it a go. A use case less explored, to start with, would be storing nodes of a graph as embedded points in a latent space.

But I’m a mathematician with zero experience in building databases, so every help and suggestion is welcome :hugs:

I have no idea what’s behind a Vector Database. However, my experience can help :slight_smile:

Some time ago, we did not have a pure Julia SGP4 orbit propagation. We had one package that interfaces with a Python library (GitHub - crbinz/SGP4.jl: Julia wrapper of the Python SGP4 library). It was working pretty well! However, after I reimplemented the SGP4 in Julia, the speed gain was enormous (at least using the Julia version back in the day). It reached the speed of the SGP4 C library, which is just amazing.

Hence, we may not have a pure Julia implementation of a Vector Database because nobody needed to go through the endeavor. However, if you want to try, it can really pay off big time :slight_smile:

2 Likes

FYI: There is HNSW.jl available, and HNSW is I believe the key feature of Pinecone. I’m rather new to this, so are you saying the other package is better?

It’s certainly possible and I’m not sure it’s a “bad” idea, just not needed; is there a real need to reinvent the wheel in Julia? There is already Pinecone.jl to access that DB. I’m assuming that database doesn’t run in-process. That’s usually a good thing. For in-process databases, such as SQLite (and DuckDB), written in C, you are taking a security risk, or really trusting the DB is bug-free, or at least has no memory-corrupting bugs. If you do this in Julia (and it’s in-process), then because of bounds-checking be default, it’s less of a risk, then the DB must not disable them (or if really not screw up), but it goes goth ways, all your other code has also that same constraint. [Well you can enable bounds-checking globally, then you would be safe, unless you include e.g. some C code, JLLs.]

While Googling I got this result:

Web Pinecone.jl is a Julia API for the Pinecone vector database. ML Q&A The Q&A tool takes discussions and docs from some of the best Python ML libraries and collates their …

Clicking on the link I no longer see the Julia package mentioned on that page, but at least it was (and maybe still is elsewhere), so I assume it’s good. I see the package author mentioned it here, and he seems very close with the company:

This is an example of trying out VGG from within DeepFace, which has a vector length of 2,622

I’ve seen elsewhere that GPT-3 has over 1500 dimensions, not sure about in latest ChatGPT/GPT-4, but at least people go higher than that. Is there some inherent limit in the database, or a reason to go much higher? I’m just curious what’s the highers number of dimensions people use, and the pros and cons of many dims.

The algorithm behind SimilarySearch.jl (SearchGraph) is competitive and, in some cases, improves than FAISS’s HNSW. Note that we also compared it with Google’s SCANN, which is very fast.

In any case, the auto-tuning gives an advantage for inexperienced users, i.e., the user only declares the objective recall, and the algorithm will try to achieve and maintain it along the life of the index. Note that always declaring maximum recall is not a good deal since it trades between speed and accuracy. The same characteristic reduces the indexing time since the index is only made once.

Perhaps a new vector database is too much work, but maybe not if we isolate functionality and reuse the ecosystem. Perhaps most people only need to store datasets in hdf5 and then use a regular similarity search algorithm. I never used Pinecone or Weaviate, but I perform benchmarks frequently with FAISS, SCANN, and others, so I don’t have the complete picture here.

2 Likes

On the other hand, maybe people want to use/consume it from other languages. Therefore, a tutorial on preparing data, storing vectors in hdf5 (or something similar), indexing, and exposing search functions with a REST/JSON API could be enough.

It could be interesting to have an all-in-one solution, as others pointed out, it can be much more complicated and perhaps require a whole-team full-time effort, and that requires a clearer motivation.

Thanks for me mentioning this library. Have you tried composing it with e.g. GitHub - baggepinnen/DynamicAxisWarping.jl: Dynamic Time Warping (DTW) and related algorithms in Julia, at Julia speeds in order to search and cluster time series dara?

No, I haven’t used DTW (nor the package). It would be interesting to see if we can take advantage of composing. I understand the DTW is not a proper metric, and some tricks are needed. Perhaps, using the ExhaustiveSearch struct (brute force search instead of indexing) with multithreading can be enough to see if we can obtain some interesting results.

Of course, I was not exhaustive, but I found the following paper that shows the benefit of using metric indexes for the task.

Thanks for the thoughts and the paper. That‘s something I overlooked, it changes the picture.

Is it a prerequisite to have a good embeddings models runnable in Julia? I guess technically it can just take some vector computed by other models, but practically I can’t imagine Julia beating C++/Rust in this type of narrow, user-agnostic tasks (a few numbers in a few numbers out). So if we have a good stack of other things to go around the vector DB it would may more sense to choose Julia to begin with

I’m not sure if this would suit your purposes but bringing LanceDB into Julia might be faster and it integrates with other open source tools through arrow. It’s written in rust and is an embedded database rather than a service, but only has python and JavaScript clients now.

Is LanceDB serverless like SQLite?

Yes. The format is explained here.

It looks interesting. For a lot of things it might not have an advantage over relational DBs, but I can see where it could be useful for machine learning. I’m wondering if Julia should have the ablility to call Rust (and possibly Zig) libraries, the same way it can call C or Fortran libraries.

Well, calling Rust should be possible just with ccall(), right?

OK, I’ll try it and see.

I’m pretty interested in this idea. SimilaritySearch.jl seems like a pretty good component of a vector DB, what I’m struggling to find a nice solution to is getting text embeddings and running queries with it.

1 Like

Just saw this one. I’m slowly trying to build a graph + vector DB from the ground up.

This is relatively well solved with PromptingTools.jl, I’ve been able to use Ollama to generate my own embeddings, but it supports OpenAI, nomic, etc.

I’m mostly messing around with the storage layer right now, i.e. how to serialize relationships, nodes, and edges. I have a very simple vector search built in but SimilaritySearch.jl looks awesome and I hadn’t noticed it before!

If anyone’s interested I’m considering scaling it up as a potential commercial enterprise and/or simple learning project. Happy to take people on, I suspect it’d be a good open source platform.

The goals are

  • Implement the GQL standard (similar to cypher)
  • Support a first-class Julia interface, i.e. you should be able to use it with no query language at all
  • High horizontal scalability
  • Ultra-simple, no-nonsense, distributed vector search
  • Do it from scratch because I like learning things that way

I do think that Julia is a very good candidate for this kind of thing – making the internals of the database operate on type of the type system seems super flexible, especially for heterogeneously typed nodes/edges/etc.

4 Likes

Count me in

1 Like