Skip to main content

Vector Database

Next, we need to decide how to store the information sourced in the previous step. RAG is built around being able to retrieve similar information. Unfortunately, storing unstructured text in a traditional database does not facilitate the access patterns needed. Instead, we will use a vector database and store the text as a collection of vector embeddings. Storing data this way allows for similarity searches. This will help our RAG system only retrieve information that is useful when asked specific questions.

For the vector database, we will use Pinecone which is a managed, cloud-based vector database. Within Pinecone, we can create indexes to store the vector embeddings. Indexes in Pinecone are similar to indexes in traditional databases. However, while traditional database indexes are optimized to look up exact matches, vector indexes allow for similarity searches based on the distance between high-dimension embeddings.

The next step in our pipeline will be creating this Pinecone resource to manage the index.

Pinecone

The PineconeResource will need the ability to create an index if it does not already exist and retrieve that index so we can upload our embeddings. Creating an index is relatively simple using the Pinecone client. We just need to provide a name, the dimensions (how big the embedding will be), and a metric type (how the distance will be compared across vector embeddings). Finally, there is the cloud infrastructure that Pinecone will use to store the data (we will default to AWS):

class PineconeResource(dg.ConfigurableResource):
pinecone_api_key: str = Field(description="Pinecone API key")
openai_api_key: str = Field(description="OpenAI API key")

def setup_for_execution(self, context: dg.InitResourceContext) -> None:
self._pinecone = Pinecone(api_key=self.pinecone_api_key) # type: ignore

def create_index(self, index_name: str, dimension: int = 1536):
if index_name not in self._pinecone.list_indexes().names():
self._pinecone.create_index(
name=index_name,
dimension=dimension,
metric="cosine",
spec={"serverless": {"cloud": "aws", "region": "us-east-1"}},
)

def get_index(self, index_name: str, namespace: Optional[str] = None):
index = self._pinecone.Index(index_name)
if namespace:
return index, {"namespace": namespace}
return index, {}

Like our other resources, we will initialize the PineconeResource so it can be used by our Dagster assets:

pinecone_resource = PineconeResource(
pinecone_api_key=dg.EnvVar("PINECONE_API_KEY"), openai_api_key=dg.EnvVar("OPENAI_API_KEY")
)

Between our sources and vector database resources, we can now extract data and upload embeddings in our Dagster assets.

Next steps