Author:

Alain Rodriguez

Published on:

September 18, 2024

altumcode-dC6Pb2JdAqs-unsplash

Apache Cassandra: Is it a good database for AI? – Part 2


Introduction


In the first part of the article, we explored the general requirements of AI-specific databases. Now, let’s evaluate how well Apache Cassandra 5.0, the popular open source NoSQL database used by tech leading companies like Apple, Netflix, and Spotify, meets these demands.

Specifically, we’ll delve into Cassandra’s capabilities, strengths, and limitations to determine if it is suitable for modern AI applications.


Is Apache Cassandra a Good Vector Database?


Apache Cassandra is a mature database that has been around for over 15 years. It’s widely recognised for its stability and performance, with significant bug reductions over time. 


It is also a popular choice for NoSQL, distributed databases in production around the world. Cassandra has been adopted across various industries from IT and advertising to healthcare, transport and finance, providing its reliability in critical use cases. Its design, combined with contributions from major companies and skilled professionals, has made Cassandra one of the most performant distributed databases available today, meeting the needs of modern applications.


Scalability: Designed for linear scalability from the start, Cassandra excels at handling large-scale deployments. As the number of nodes in a cluster increases, so does its throughput. Clusters with 500 to 1,000 nodes are not uncommon, and with recent improvements, even larger setups are possible. Cassandra’s internode communications have become asynchronous, improving its scalability further. 


Currently, most Cassandra setups require at least three nodes, and careful planning to manage throughput variations due to the system’s limited ability to dynamically adapt. However, Cassandra 5.0 aims to introduce significant improvements in elasticity, including easier and quicker autoscaling. These improvements will allow Cassandra to better support smaller setups and applications with fluctuating throughput, reducing the system’s inertia and improving its adaptability. This is particularly beneficial for smaller companies or those with variable workloads, as it will not only enhance usability but also lead to significant cost reductions.


Reliability: Cassandra’s architecture eliminates single points of failure, with all nodes being equivalent. Properly configured, Cassandra replicates data sufficiently to ensure no loss of availability, even if a node fails or a rack goes down. Some clusters have operated for months with downed nodes without users noticing, showcasing Cassandra’s robustness. Additionally,  Cassandra requires no downtime for updates or changes, including upgrades, ensuring high availability. For example, DataStax with its Cassandra managed service, ‘Astra’, guarantees between 99.9% and 99.999% availability.


Performances: Cassandra excels in write operations, completing them in microseconds with near-optimal efficiency, requiring no disk access beyond the commit log’s sequential write. This makes it one of the most efficient systems for writes on current hardware. Read performance, while generally efficient, can vary depending on factors such as cluster load, data size, and hardware. Typically, read times range from millisecond (or maybe below) to a few tens of milliseconds, which is fast enough for most use cases, including those in AI.


Cassandra is also highly efficient in other important areas:

  • Security: With SSL encryption available for all communications and built-in authorisation and authentication features, Cassandra meets the security demands of the world’s largest companies. Auditing features are also available.
  • Cost effectiveness: Cassandra can run on commodity hardware, making it an affordable option even for large datasets. It uses available resources and scales linearly, requiring minimal personnel to manage even millions or billions of records.
  • Ease of use: Cassandra’s peer to peer architecture simplifies its use by treating all nodes as equals. However, the learning curve can be steep, and some operations require significant operator knowledge. Open-source solutions like ‘tlp-medusa’ for backups and ‘tlp-repairs’ for anti entropy repairs, along with dashboards publicly available, or commercial tools like DataStax’s OpsCenter or AxonOps, have made managing Cassandra easier over time.

These features contribute to Cassandra’s overall popularity and make it a solid choice for many applications. But how does it perform when it comes to AI-specific tasks, particularly managing and searching embedding vectors?

Vector Capabilities in Cassandra


Vector Type Available: Cassandra 5.0 introduces a new vector data type, enabling the storage and manipulation of vector data. This feature is crucial for AI applications that rely on vector embeddings. Here is an example of the syntax. You can see this and more from the documentation:
https://cassandra.apache.org/doc/latest/cassandra/reference/vector-data-type.html


Vector Searches:
Cassandra 5.0 also adds vector searches capabilities. You can create an index on a vector column to enable vector searches using the following code:

The similarity function can be: DOT_PRODUCT, COSINE, or EUCLIDEAN, which are the most common operations for similarity searches on embedding vectors.

When selecting data, you can specify the similarity function, the vector column, and the vector to compare it with:

There are slightly different ways to consider similarity between two vectors If you are interested in the mathematical foundations behind these operations, you can explore more here:

Performant Vector Indexing Capabilities: Distributed systems typically complicate indexing, and Cassandra has historically struggled with secondary indexes, leading many users to avoid them in favour of search engines like Lucene, SoIr, or ElasticSearch for better indexing, including text searches. Previous attempts at improving indexing in Cassandra, such as MaterialisedViews and SASI indexes, were more efficient but still had limitations. However, Cassandra 5.0 introduces a revolutionary indexing system called Storage Attached Indexes (SAI), which is a game changer for vector searches and other use cases. These SAIs are not only faster and easier to use but also update in real-time without requiring an index rebuild before queries, significantly improving performance.  Cassandra 5.0 also introduces the vector type to the open-source version of Apache Cassandra, which was previously only available in the commercial DataStax versions (DataStax Enterprise – DSE – or Astra).

With these improvements, Cassandra’s indexing capabilities now surpass many specialised vector databases, enabling AI features and likely changing how data is modelled in Cassandra. Furthermore, SAI’s impressive performance allows it to scale better than most specialised vector databases, making it suitable for a broader range of use cases.

For more information about SAI, you can visit: https://cassandra.apache.org/doc/latest/cassandra/developing/cql/indexing/sai/sai-concepts.html.

Limitations

Despite its advancements, Cassandra still has some limitations:

    • Dissimilarity Searches: Cassandra does not support least-similar (dissimilarity) searches, limiting its applicability in certain AI use cases.

    • Vector Search Limitations: A single request can fetch a maximum of the 1.000 nearest neighbours. This should not be a problem for most of the use cases.

    • Approximate Nearest Neighbour (ANN): Cassandra uses ANN for vector searches, which generally produces near-exact results but allows for better scalability.

    • Performance with Updates: Vector searches work best on tables where the item_vector column is not frequently overwritten or deleted. In cases where this column changes, search performance may degrade.

Conclusion

Apache Cassandra is a robust and versatile database, suitable for many applications, including AI.

While it’s not a one-size-fits-all solution, Cassandra has proven itself in various demanding scenarios. While it has limitations, and there are cases where it might not be the best choice, it remains one of the most reliable distributed databases on the market, especially at scale. The opposite idea of inventing a new database for each specific need is not practical either. With the introduction of efficient indexing through Storage-Attached Indexes (SAI) and vector search capabilities in version 5.0, Cassandra is poised to become one of the leading databases for vector searches. Even though Cassandra 5.0 has not yet been officially released, it is available for testing, and the commercial DataStax versions (DSE and Astra) already include these features for immediate production use. 

If you haven’t explored Cassandra yet, it’s worth considering for your generative AI use cases – and many others. The power, resilience, and extensive community support behind Cassandra make it a strong candidate. For those already using Cassandra, the transition to leveraging these new capabilities for AI should be straightforward and beneficial, sparing you the need to learn new databases. Apache Cassandra 5.0 is positioning itself as a top contender in the realm of AI databases, offering powerful features that can handle the demanding needs of modern AI applications.

Scroll to Top