Wednesday, 7 January 2026

Embeddings and Vector Representation

 

Embeddings convert text into numbers (vectors) that capture meaning.

Think of it like a GPS coordinate:
- "dog" → [0.2, 0.8, 0.1, ...] (1536 numbers)
- "cat" → [0.3, 0.7, 0.2, ...] (close to "dog"!)
- "car" → [0.9, 0.1, 0.8, ...] (far from "dog")

Similar meanings = Similar vectors!


OpenAI Embeddings

from langchain_openai import OpenAIEmbeddings

# Initialize
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create embedding for a query
query = "What is machine learning?"
query_vector = embeddings.embed_query(query)

print(f"Query: {query}")
print(f"Vector dimensions: {len(query_vector)}")
print(f"First 5 values: {query_vector[:5]}")

# Embed multiple documents
docs = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "The weather is sunny today"
]

doc_vectors = embeddings.embed_documents(docs)
print(f"\nEmbedded {len(doc_vectors)} documents")

Query: What is machine learning? Vector dimensions: 1536 First 5 values: [-0.002476818859577179, -0.012755980715155602, -0.006645360495895147, -0.03157883137464523, 0.028759293258190155] Embedded 3 documents

Google Gemini Embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings

Can get the updated top embedding models from here https://huggingface.co/spaces/mteb/leaderboard the Unknown ones are non-open source

Cosine Similarity is usually used for identifying similarity between 2 vectors.


No comments:

Post a Comment

Building a ReAct Agent with LangGraph & LangSmith

In this post, I walk through building a ReAct (Reasoning + Acting) agent using LangGraph and Groq's openai/gpt-oss-120b model, where the...