DraganSr: AI LLMs: Tokens, Embeddings, Vector Databases

Sunday, May 26, 2024

AI LLMs: Tokens, Embeddings, Vector Databases

Strange as it may look after trying ChatGPT,
modern AI's don't "speak" English, or any other human language.

Vectors

Sequences of numbers, usually called arrays in programming, and "vectors" in mathematics.

i.e. like this,[123, 4, 56] or [0.34, 0.23, 0.01, 0.98], just much longer

Digital computers "speak" numbers, and numbers only.

In particular GPU and NPU components used to process data,
are optimized for very fast processing such data.

So all "text", images and other content needs to be "translated" to such arrays of numbers,

processed, and results translated back to human-understandable form.

There are two different concepts related to this translation that look similar
while have different purpose: "tokens" and "embeddings".

To make any sense of AI APIs, one needs to understand those concepts.

Tokens

are about usage and money. It is about "quantity" of text sent to and received from AI API.

Depending on API used, there are limits and costs about how much content, i.e. text,
can be sent at once, and price is calculated based on this. A token is an integer number, and

usually is mapped to a few characters that is usually a small word, or part of larger word.

For example "Hello World!" is converted to [9906, 4435, 0] = ["Hello", "World", "!"]
and "Sequences" is [1542, 45045] = ["Se", "quences"]

OpenAI Platform: Tokenizer

To measure and estimate const of AI API calls we can use libraries, i.e OpenAI tiktoken

openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. @GitHub (Python)

tiktoken - npm (JavaScript)

import { encoding_for_model } from 'tiktoken'
const encoder = encoding_for_model('gpt-3.5-turbo');
const words = encoder.encode(prompt);

Embeddings

are about "meaning" and "similarity" of content.

Through process of "training", AI "models" are able to map content to array of floating point numbers, usually between 0 and 1, and to compare such arrays, called vectors to find "similarity" or "distance".

For example, "Hello World" can be converted to vector of about 1500 numbers like this
"input": "Hello World!",
"embedding": [
-0.0030342294,
-0.056672804,
0.029482627,
0.042976152,
-0.040828794,
-0.025202423,
-0.012789831,
0.035228256,
-0.031571947,

...

Embeddings - OpenAI API

import OpenAI from 'openai'
const openai = new OpenAI();
export async function createEmbeddings(input: string| string[]) {
    return await openai.embeddings.create({
        input: input,
        model: 'text-embedding-3-small'
    })
}

Getting Started With Embeddings

"An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded

Since the embeddings capture the semantic meaning of the questions, it is possible to compare different embeddings and see how different or similar they are. Thanks to this, you can get the most similar embedding to a query, which is equivalent to finding the most similar FAQ. Check out our semantic search tutorial for a more detailed explanation of how this mechanism works."

LLM AI Embeddings | Microsoft Learn

Word embedding - Wikipedia

Calculating Similarity between Embeddings

export function calcDotProduct(a: number[], b: number[]) {
    return a.map((value, index) => value * b[index]).reduce((a, b) => a + b, 0);
}

function calcCosineSimilarity(a: number[], b: number[]) {
    const product = dotProduct(a, b);
    const aMagnitude = Math.sqrt(a.map(value => value * value).reduce((a, b) => a + b, 0));
    const bMagnitude = Math.sqrt(b.map(value => value * value).reduce((a, b) => a + b, 0));
    return product / (aMagnitude * bMagnitude);
}

The Building Blocks of LLMs: Vectors, Tokens and Embeddings @ TheNewStack

a vector is a single-dimensional array, in this case of numbers only

Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process.

In LLMs, vectors are used to represent text or data in a numerical form that the model can understand and process. This representation is known as an embedding. Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents.