Strange as it may look after trying ChatGPT,
modern AI's don't "speak" English, or any other human language.
Vectors
Sequences of numbers, usually called arrays in programming, and "vectors" in mathematics.
i.e. like this,[123, 4, 56] or [0.34, 0.23, 0.01, 0.98], just much longer
Digital computers "speak" numbers, and numbers only.
In particular GPU and NPU components used to process data,
are optimized for very fast processing such data.
So all "text", images and other content needs to be "translated" to such arrays of numbers,
processed, and results translated back to human-understandable form.
There are two different concepts related to this translation that look similar
while have different purpose: "tokens" and "embeddings".
while have different purpose: "tokens" and "embeddings".
To make any sense of AI APIs, one needs to understand those concepts.
Tokens
are about usage and money. It is about "quantity" of text sent to and received from AI API.
Depending on API used, there are limits and costs about how much content, i.e. text,
can be sent at once, and price is calculated based on this. A token is an integer number, and
can be sent at once, and price is calculated based on this. A token is an integer number, and
usually is mapped to a few characters that is usually a small word, or part of larger word.
For example "Hello World!" is converted to [9906, 4435, 0] = ["Hello", "World", "!"]and "Sequences" is [1542, 45045] = ["Se", "quences"]
To measure and estimate const of AI API calls we can use libraries, i.e OpenAI tiktoken
openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. @GitHub (Python)
openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. @GitHub (Python)
tiktoken - npm (JavaScript)
import { encoding_for_model } from 'tiktoken'
const encoder = encoding_for_model('gpt-3.5-turbo');
const words = encoder.encode(prompt);
Embeddings
are about "meaning" and "similarity" of content.
Through process of "training", AI "models" are able to map content to array of floating point numbers, usually between 0 and 1, and to compare such arrays, called vectors to find "similarity" or "distance".
For example, "Hello World" can be converted to vector of about 1500 numbers like this
"input": "Hello World!",
"embedding": [
-0.0030342294,
-0.056672804,
0.029482627,
0.042976152,
-0.040828794,
-0.025202423,
-0.012789831,
0.035228256,
-0.031571947,
"input": "Hello World!",
"embedding": [
-0.0030342294,
-0.056672804,
0.029482627,
0.042976152,
-0.040828794,
-0.025202423,
-0.012789831,
0.035228256,
-0.031571947,
...
import OpenAI from 'openai'
const openai = new OpenAI();
export async function createEmbeddings(input: string| string[]) {
return await openai.embeddings.create({
input: input,
model: 'text-embedding-3-small'
})
}
Since the embeddings capture the semantic meaning of the questions, it is possible to compare different embeddings and see how different or similar they are. Thanks to this, you can get the most similar embedding to a query, which is equivalent to finding the most similar FAQ. Check out our semantic search tutorial for a more detailed explanation of how this mechanism works."
Calculating Similarity between Embeddings
export function calcDotProduct(a: number[], b: number[]) {
return a.map((value, index) => value * b[index]).reduce((a, b) => a + b, 0);
}
const product = dotProduct(a, b);
const aMagnitude = Math.sqrt(a.map(value => value * value).reduce((a, b) => a + b, 0));
const bMagnitude = Math.sqrt(b.map(value => value * value).reduce((a, b) => a + b, 0));
return product / (aMagnitude * bMagnitude);
}
The Building Blocks of LLMs: Vectors, Tokens and Embeddings @ TheNewStack
a vector is a single-dimensional array, in this case of numbers only
In LLMs, vectors are used to represent text or data in a numerical form that the model can understand and process. This representation is known as an embedding. Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents.
Vector Databases
(online only, free tier available)
PostgreSQL + PV Vector extension
Milvus
Weaviate
Faiss
Vespa
Redis
No comments:
Post a Comment