Introduction
The world is now taking a strong turn towards Artificial intelligence (AI), specifically through Generative AI (GenAI). The ability to store large amounts of data is crucial for the development and deployment of AI. With the ever-increasing volume of data being generated, it is essential to have a reliable and scalable storage solution that can handle this influx of embedding vectors – the data type used by AIs.
Before we explore how Apache Cassandra, a popular open-source NoSQL database used by tech leading companies like Apple, Netflix, and Spotify, suits AI storage needs, we will start with a general overview of AI and its storage requirements, This will ensure everyone is on the same page regarding these cutting edge topics and technologies.
Understanding AI and Its Requirements
The Rise of AI
The concepts of AI date back to the early days of computing, around the 1950s and 1960s. For decades, Machine Learning (ML) has been employed in various fields such as natural language processing, computer vision, and speech recognition.
Despite its long history, AI remained relatively unknown until the emergence of GenAI, particularly with the introduction of ChatGPT, which quickly reached hundreds of millions of users. This boom in popularity can be attributed to the effective marketing and the efficiency of Generative Pre-trained Transformer (GPT), Generative Adversarial Network (GAN), Variational Autoencoder (VAE), and other Deep Learning (DL) techniques. Tools like ChatGPT (for generating text using GPT), MidJourney and DALL-E (for generating images through GANs and VAEs), and other video generation (with GANs) have significantly contributed to this rise.
For more information about these topics, you can check our previous article:
Unleashing the Power of Generative AI: Understanding Transformers – Parser Digital
AI has now matured to a point where it is practical and useful in real-world applications and continues to evolve quickly.
Key AI Vocabulary and Process
The AI field has its jargon, full of abbreviations as you can see. Here are some key definitions to help you grasp the big picture of AI processes:
-
- Artificial Intelligence (AI): An umbrella term for computer software imitating human cognition to perform complex tasks and learn from them.
-
- Machine Learning (ML): A subfield of AI that involves programs with algorithms that learn from the data they are fed, improve their performances, and make decisions without specific instructions.
-
- Generative AI (GenAI): A subset of AI, sometimes considered complementary to Machine Learning, involving programs that generate data according to training provided, as the name suggests. These algorithms can create data similar to what they were trained on, often in response to a specific request, generally formulated in a human natural language (e.g.,. English, Spanish, etc).
-
- Embedding: Embeddings transform words, sentences, images, and other data into numerical representations that capture their important properties, meanings, and relationships. By mapping various data types as points within a multidimensional space, embeddings facilitate the clustering of similar data points. These numerical representations allow machines to understand and process this data.
-
- Vector: Embeddings and vectors can be used interchangeably in the context of ‘vector embeddings’. The use of ‘embeddings’ emphasises the idea of representing data in a meaningful and structured way, while ‘vectors’ refers to the numerical representation itself. While database professionals may refer to ‘vectors’, AI or ML engineers might use ‘embeddings’ or ‘vector embeddings’.
-
- Similarity Search: Once data is stored as embedding vectors, one of the most common operations is to search for the ‘nearest neighbours’. This means finding data points that are closest in the multidimensional space, often used in recommendation systems (e.g., “We don’t have product X, but product Y is very similar and is available”).
The Needs and Use Cases for AI
AI is indispensable for various applications today. Here are some key use cases:
Recommendations: Companies like Netflix or Spotify have been using ML for some time now. At Netflix, no human employee is personally reviewing your viewing history to suggest new content. Manually curating recommendations for every user would require more employees than customers, resulting in an unsustainable business model. Moreover, manual recommendations would be impractical and inefficient. In contrast, AI delivers precise suggestions at scale, highlighting its necessity for providing efficient and personalised recommendations.
AI Virtual Assistants: Virtual assistants like Apple Siri, Amazon Alexa or OpenAI’s ChatGPT accept voice commands, interpret what you say, and respond to your request in a meaningful and efficient way. This is radically different from the first generations of virtual assistants that were closer to simple chatbots than actual assistants. They now rely heavily on AI for voice recognition and natural language processing to understand voice commands and respond appropriately.
Language Translation: Tools like Google Translate, or DeepL for translating live conversations are widely used today. This is another great example of what AI is used for, enabling seamless communication across languages.
Anomaly Detection: AI can be used for anomaly detection, such as identifying issues in monitored systems or detecting fraud.
Classification: AI is also used for automatically sorting large amounts of data. Commons examples include spam detection or facial recognition.
All these real-world examples show that AI is a current and practical technology, not just a hypothetical, futuristic topic. While AI is a broad and rapidly evolving field, these examples give a good overview of its uses.
As mentioned, GenAI has recently gained significant attention, building upon existing Machine Learning techniques. It is used to generate texts, images, videos, 3D models, and more. You might be familiar with chatGPT already, if not, you can give it a try here: https://chat.openai.com/. While ChatGPT responses may contain inaccuracies or biases, especially on complex or nuanced queries because large language models like this are still under development, it is already and generally quite impressive in its capabilities.
As many other use cases and new AI fields will emerge, we can anticipate that they will rely on embedding vectors (or other forms of multi-dimensional representation) and various data. For all this to work, for the AI algorithms to understand and respond to requests in seconds (or faster), the AI software must rely on efficient databases with capabilities that can adapt to specific AI needs.
Find more examples in our previous articles
How to use ChatGPT prompts to enhance AI conversation – Parser Digital
Database Capabilities Required for AI
A vector database dedicated to AI is a storage system that saves and allows querying of multi-dimensional vector data efficiently and accurately. The database is expected to index and store vectors and perform vector operations, such as similarity search (also called “nearest neighbours” search).
New vector databases and vector indexes have emerged in the last few years offering the capability to store vectors and perform vector searches. Some older, well-known databases like Cassandra or PostgreSQL, have added new features or extensions to handle vectors for common AI or ML use cases.
The question is, how do we choose the right database? What criteria should we consider when selecting a vector database?
Here are the specific AI (or ML) needs:
Vector Type Available: To store vectors, a “Vector” type is needed. It is mostly a collection of numbers, or more precisely an array of floats. This is the easy part, but the need for a specific vector type is mostly due to the type of operations that will be run on vectors.
Vector Searches: The database must be able to perform similarity searches, which is a foundation for most AI fields. When queried, a vector database must quickly identify the “nearest neighbours” or, the other way around, the most distant embedding vectors (for anomaly detection, for example).
Performant Vector Indexing Capabilities: To achieve good performance on similarity searches, the database requires efficient indexing capabilities.
That being said, vectors are mostly yet another data type, and other than the specific needs mentioned above, a vector database shares many requirements common to all kinds of data type we commonly store:
Scalability: When I started working a bit more than a decade ago, I was told that only social networks, advertising, and adult industries needed to use distributed systems and be able to scale up (and down). Nowadays, the need for a scalable database is widespread and common among all kinds of industries and company sizes. It is specially important for AI use cases as they generally rely on a huge amount of data.
Reliability: The costs resulting from problems in a production system are often expensive. It can impact the image of an organisation, Service Licence Agreements, or direct sales that cannot happen as they should. This need is once again common across companies. AI use cases are no exception, for the AI software to work, it needs the data to be available and consistent.
Performances: For most applications, the latency or throughput while accessing the data is important. When you ask something to chatGPT, you want the answer to come within seconds, not in 30 minutes. In some cases, it is even critical, such as in the case of autopilots, where a delay of a few seconds can have devastating impacts.
We could also mention the need for security, cost-effectiveness , ease of use, etc.
This might be getting a bit long and redundant, but you probably understand the idea. Vector databases are mostly databases with all the commonly needed capabilities to satisfy the needs of modern applications. Plus, they need to handle vectors properly, allow efficient similarity searches on those vectors, and have an efficient indexing system.
Many databases were created specifically for handling vectors, but are immature or neglected in some other aspects not directly related to vectors, such as scalability and reliability. This impacts vectors as well with the lack of real-time indexing or downtimes, for example.
Does Apache Cassandra Fulfil Requirements?
Well, this is for part 2!
But as a quick teaser: Apache Cassandra is now commonly accepted as an efficient, mainstream, mature, secure, scalable, and reliable database. It is used in production by many companies of all sizes and widely adopted. However, Apache Cassandra 4 lacks two main features to be a great choice for vectors:
-
- Efficient Indexing
-
- Handle vectors and similarity searches on vectors
The good news is that Apache Cassandra 5.0 brings precisely those features with Vector type support, dot product, cosine and euclidean distance similarity searches functions, and Storage-Attached Indexing (SAI).
In part 2 of this article, we will explore how Apache Cassandra performs in these crucial areas and evaluate how well it meets the requirements to become an outstanding “AI database”.