Messing with Vectors

At a high level, vector DBs are storage systems for unstructured data types. That’s broad. It also doesn’t seem like it would generate hype. Why do people care? One of the main use cases for vectors is to store embeddings that are the foundation of ML models. Particularly, LLMs. If it can power ChatGPT, it can surely power anything.

Pinecone is one of the leaders in the space. They’ve just raised a monster Series B and they’ve really set up developers for success in highlighting some of the many use cases for AI. How can I leverage this for my purposes?


These fellas just sold for $53M by Mark Zucc’s Meta

Ever use the slack plugin for giphy? I feel like it never does anything close to what I want it to. I’ll build a better one with Pinecone. Shall we? We’ll get started with our setup in a jupyter notebook and do some basics. First, install your basics

$ pip install -U pandas pinecone-client sentence-transformers tqdm

Then, let’s set up our index in Pinecone. Note: you’ll be limited to a single index on Pinecone’s free setup. Kind of a bummer, but I only need one ~~bullet~~ index.

from IPython.display import HTML
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
# Load dataset to a pandas dataframe
df = pd.read_csv(
    "~/Downloads/TGIF-Release-master/data/tgif-v1.0.tsv",
    delimiter="\t",
    names=['url', 'description']
)
print(df.head())


import pinecone

# Connect to pinecone environment
pinecone.init(
    api_key="NICETRY",
    environment="WISEGUY"
)

index_name = 'gif-search'

# check if the gif-search exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=384,
        metric="cosine"
    )

# Connect to gif-search index we created
index = pinecone.Index(index_name)

Splendid, splendid. Now, it’s time to get our transformer, create embeddings and store them. With our data set transformed, we’ll then be able to query for recommended gifs based off of an input.

from sentence_transformers import SentenceTransformer

# Initialize retriever with SentenceTransformer model 
retriever = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
retriever

from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end]
    # generate embeddings for batch
    emb = retriever.encode(batch['description'].tolist()).tolist()
    # get metadata
    meta = batch.to_dict(orient='records')
    # create IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

    
# check that we have all vectors in index
index.describe_index_stats()

You may notice the above step takes quite a while to load and run. Why’s that? These models and data have massive amounts of data behind them. By offloading these embeddings to a DB, we’re able to not need to keep any of this in memory or perform any calculations off of raw DB values. By having a hyper optimized store, we’re in a better spot here with a vector DB.

In order to calculate what is the “most similar” / “recommended” gif, we’ll need to transform our queryinto an embedding and then calculate the cosine similarity within the N dimensional space of the existing vectors. The DB will return us this value based on a KNN (10) method and we’ll then display this value to the user.

def search_gif(query):
    # Generate embeddings for the query
    xq = retriever.encode(query).tolist()
    # Compute cosine similarity between query and embeddings vectors and return top 10 URls
    xc = index.query(xq, top_k=10,
                    include_metadata=True)
    result = []
    for context in xc['matches']:
        url = context['metadata']['url']
        result.append(url)
    return result

def display_gif(urls):
    figures = []
    for url in urls:
        figures.append(f'''
            <figure style="margin: 5px !important;">
              <img src="{url}" style="width: 120px; height: 90px" >
            </figure>
        ''')
    return HTML(data=f'''
        <div style="display: flex; flex-flow: row wrap; text-align: center;">
        {''.join(figures)}
        </div>
    ''')

Let’s give it a go for a basic prompt. Techie on his laptop

teeech

Not too bad. Everything seems to make perfect sense here. Not too techie, but I think that should do. How about something more specific?

teeech

Wow, now that’s just awful. Looks like, despite their recent sale at a $262M Loss, Giphy still has some value to be found within every good slack troll’s plugin repetoire.

That being said, well done to the folksat Pinecone for making this so accessible.

Messing with Vectors

Dipping one's toes into the newest VC pheomenon of Vector Databases

Messing with Vectors

Dipping one's toes into the newest VC pheomenon of Vector Databases