Generating Vector Embeddings in MySQL - VillageSQL Server for MySQL

VillageSQL is a drop-in replacement for MySQL with extensions.

All examples in this guide work on VillageSQL. Install Now →

Vector embeddings represent text as arrays of numbers that capture semantic meaning. Two pieces of text about the same topic have similar vectors even if they share no words. MySQL has no built-in way to generate embeddings. VillageSQL’s vsql_ai extension adds ai_embedding(), which calls an embedding model and returns the vector as a JSON array you can store and query.

What Embeddings Are For

An embedding model converts text to a fixed-length vector of floats. The distance between two vectors reflects semantic similarity:

"dog" and "puppy" → similar vectors
"dog" and "database index" → distant vectors

This enables semantic search (find rows that mean the same thing as a query), clustering, and recommendation — none of which keyword search handles well.

With VillageSQL: ai_embedding()

INSTALL EXTENSION vsql_ai;

ai_embedding(provider, model, api_key, text) calls an embedding model and returns a JSON array string.

SET @key = 'your-openai-api-key';

SELECT ai_embedding('openai', 'text-embedding-3-small', @key, 'Machine learning is fascinating');
-- Returns: [0.02646778, 0.019067757, -0.05332306, ...]

Storing embeddings

Store the JSON array in a JSON column:

CREATE TABLE documents (
    id       INT AUTO_INCREMENT PRIMARY KEY,
    title    VARCHAR(255),
    content  TEXT,
    embedding JSON
);

SET @key = 'your-openai-api-key';
SET SESSION max_execution_time = 120000;

INSERT INTO documents (title, content, embedding) VALUES
    ('Intro to ML',
     'Machine learning is a subset of artificial intelligence',
     ai_embedding('openai', 'text-embedding-3-small', @key,
                  'Machine learning is a subset of artificial intelligence')),
    ('Neural Networks',
     'Deep learning uses multi-layer neural networks',
     ai_embedding('openai', 'text-embedding-3-small', @key,
                  'Deep learning uses multi-layer neural networks'));

Backfilling embeddings on existing rows

ALTER TABLE documents ADD COLUMN embedding JSON;

SET @key = 'your-openai-api-key';
SET SESSION max_execution_time = 300000;

UPDATE documents
SET embedding = ai_embedding('openai', 'text-embedding-3-small', @key, content)
WHERE embedding IS NULL
LIMIT 50;  -- Process in batches

Choosing a Model

Provider	Model	Dimensions	Use when
`'openai'`	`text-embedding-3-small`	1536	Best value — good accuracy, lower cost
`'openai'`	`text-embedding-3-large`	3072	Higher accuracy needed
`'google'`	`gemini-embedding-001`	3072	Google ecosystem
`'local'`	any Ollama model (e.g., `nomic-embed-text`)	varies	Private data; no API key required

Use the same model for all embeddings in a table. Embeddings from different models are not comparable — mixing them produces meaningless similarity scores. Anthropic does not currently provide an embedding API.

Semantic Search

Neither MySQL nor VillageSQL provides a native vector distance function. Full similarity search — finding the nearest vectors in a large table — runs in application code. The typical architecture: store embeddings in MySQL, fetch them, compute cosine similarity in your application, and return the top matches.

-- Generate a query embedding
SET @query_embedding = ai_embedding('openai', 'text-embedding-3-small', @key, 'neural networks');

-- Retrieve stored embeddings to process in application code
SELECT id, title, embedding FROM documents;

Pass the results to a vector similarity library (NumPy, scikit-learn, Faiss) to find nearest neighbors. For production-scale search, consider a dedicated vector database alongside MySQL.

What to Embed

The quality of your embeddings depends on what you embed. Some patterns:

-- Embed the full content for document search
SET embedding = ai_embedding('openai', 'text-embedding-3-small', @key, content);

-- Embed a specific field for product search
SET embedding = ai_embedding('openai', 'text-embedding-3-small', @key, product_description);

-- Embed concatenated fields for richer context
SET embedding = ai_embedding('openai', 'text-embedding-3-small', @key,
    CONCAT(product_name, '. ', category, '. ', product_description));

Keep the text you embed under a few thousand tokens. Longer text gets truncated by the embedding model.

Frequently Asked Questions

Can I compare embeddings from different models?

No. Embeddings from text-embedding-3-small and gemini-embedding-001 live in different vector spaces — their similarity scores are meaningless when compared against each other. Standardize on one model per column.

How much storage do embeddings take?

A JSON array of 1536 floats is roughly 10–15 KB per row as a text string. For large tables, this adds up. Consider whether you need all embeddings stored or just the ones you’ll query against.

Does ai_embedding() support batch input?

No — each call processes one text input. For bulk embedding, loop through rows in your application and call UPDATE in batches.

What happens if the text is too long?

The embedding model truncates input at its token limit (typically 8191 tokens for OpenAI models). The embedding is computed on the truncated text. For long documents, embed a summary or the first few paragraphs rather than the full text.

Troubleshooting

Problem	Solution
`FUNCTION ai_embedding does not exist`	Run `INSTALL EXTENSION vsql_ai`
`ai_embedding()` returns NULL	Check API key; verify model name; check `max_execution_time`
Similarity search returns wrong results	Confirm all embeddings use the same model; check that embeddings weren’t generated with different content than what you’re searching
Embeddings take too long to generate	Use LIMIT batches; embeddings are ~1–3 seconds each per API call

VillageSQL is a drop-in replacement for MySQL with extensions.

​What Embeddings Are For

​With VillageSQL: ai_embedding()

​Storing embeddings

​Backfilling embeddings on existing rows

​Choosing a Model

​Semantic Search

​What to Embed

​Frequently Asked Questions

​Can I compare embeddings from different models?

​How much storage do embeddings take?

​Does ai_embedding() support batch input?

​What happens if the text is too long?

​Troubleshooting

​See also