Artificial intelligence (AI) processing rests on the use of vectorised data. In other words, AI turns real-world information into data that can be used to gain insight, searched for and manipulated.

Vector databases are at the heart of this, because that’s how data created by AI modelling is stored and from where it is accessed during AI inference.

In this article, we look at vector databases and how vector data is used in AI and machine learning. We examine high-dimension data, vector embedding, the storage challenges of vector data and the suppliers that offer vector database products. 

What is high-dimension data?

Vector data is a sub-type of so-called high-dimension data. This is data – to simplify significantly – where the number of features or values of a datapoint far exceed the samples or data points collected.

Low-dimension data – i.e. not many values for each data point – has been more common historically. High-dimension data arises as the ability to capture large amounts of information becomes possible. Contemporary AI that processes speech or images with many possible attributes, contexts, etc, provides a good example.

What are vectors?

Vectors are one of a number of types of data in which quantities are referred to by single or more complex arrangements of numbers.

So, in mathematics, a scalar is a single number, such as 5 or 0.5, while a vector is a one-dimensional array of numbers, such as [0.5, 5]. Then a matrix extends this into two dimensions, such as:

[[0.5, 5],

 [5, 0.5],

 [0.5, 5]].

Finally, tensors extend this concept into three or more dimensions. A 3D tensor could represent colours in an image (based on values for red, green and blue), while a 4D tensor could add the dimension of time by stringing together or stacking 3D tensors in a video use case.

Tensors add further dimensions and are multi-dimensional arrays of numbers that can represent complex data. That’s why they have lent themselves to use in AI and machine learning and deep learning frameworks such as TensorFlow and PyTorch.

What is vector embedding?

In AI, tensors are used to store and manipulate data. Tensor-based frameworks provide tools to create tensors and perform computations on them.

For example, a ChatGPT request in natural language is parsed and processed for word meaning, semantic context and so on, and then represented in multi-dimensional tensor format. In other words, the real-world subject is converted to something on which mathematical operations can be carried out. This is called vector embedding.

To gain answers to the query, the numerical (albeit complex) result of parsing and processing can be compared to tensor-based representations of existing – i.e., already vector-embedded data – and an answer supplied. You can transfer that basic concept – ingest and represent; compare and respond – to any AI use case, such as images or buyer behaviour. 

What is a vector database?

Vector databases store high-dimensional vector data. Data points are stored in clusters based on similarity.

Vector databases deliver the kind of speed and performance needed for generative AI use cases. Gartner has said that by 2026, more than 30% of enterprises will have adopted vector databases to build foundation models with relevant business data.

While traditional relational databases are built on rows and columns, datapoints in a vector database take the form of vectors in a number of dimensions. Traditional databases are the classic manifestation of structured data. Each column represents a variable with each row a value of that.

Meanwhile, vector databases can handle values on values that exist along multiple continua represented via vectors. So, they don’t have to stick to pre-set variables but can represent the kind of characteristics one might find in what we think of as unstructured data – shades of colours, the layout of pixels in an image and what they may represent when interpreted as a whole, for example.

It isn’t impossible to transform unstructured data sources into a traditional relational database to prepare it for AI, but it’s not a trivial matter.

The difference is apparent in search on traditional databases and vector databases. On a SQL database, you search for explicit, definite values, such as keywords, or numerical values and you rely on exact matches to retrieve results you want.

Vector search represents data in a less precise way. There may be no exact match but if modelled effectively it will return results that relate to the thing being looked for and may result from hidden patterns and relationships that a traditional database would not be able to infer.

What are the storage challenges of vector databases?

AI modelling involves writing vector embeddings into a vector database for very large quantities of often nonmathematical data, like words, sounds or images. AI inference then compares vector-embedded data using the model and newly supplied queries.

This is carried out by very high performance processors, most notably by graphical processing units (GPUs) that offload very large quantities processing from server CPUs.

Vector databases can be subject to extreme I/O demands – especially during modelling – and will need the capability to scale massively and potentially offer portability of data between locations to enable the most efficient processing.

Vector databases can be indexed to accelerate searches and can measure the distance between vectors to provide results based on similarity.

That facilitates tasks such as recommendation systems, semantic search, image recognition and natural language processing tasks. 

Who supplies vector databases?

Proprietary and open source database products include those from DataStax, Elastic, Milvus, Pinecone, Singlestore and Weaviate.

There are also vector database and database search extensions to existing databases, such as PostgreSQL’s open source pgvector, provision of vector search in Apache Cassandra, and vector database capability in Redis.

There are also platforms with vector database capabilities integrated, such as IBM watsonx.data.

Meanwhile, the hyperscaler cloud providers – AWS, Google Cloud and Microsoft Azure – provide vector database and search in their own offerings as well as from third parties via their marketplaces.



Source link

Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *