Sunday, May 26, 2024

AI course: RAG with Llamaindex

 DLAI - Building Agentic RAG with Llamaindex

What Is Retrieval-Augmented Generation aka RAG | NVIDIA Blogs

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.

LlamaIndex, Data Framework for LLM Applications

LlamaIndex is the leading data framework for building LLM applications

AI LLMs: Tokens, Embeddings, Vector Databases

Strange as it may look after trying ChatGPT,
modern AI's don't "speak" English, or any other human language.


Sequences of numbers, usually called arrays in programming, and "vectors" in mathematics.

i.e. like this,[123, 4,  56] or [0.34, 0.23, 0.01, 0.98], just much longer

Digital computers "speak" numbers, and numbers only.

In particular GPU and NPU components used to process data,
are optimized for very fast processing such data.

So all "text", images and other content needs to be "translated" to such arrays of numbers, 
processed, and results translated back to human-understandable form.

There are two different concepts related to this translation that look similar
while have different purpose: "tokens" and "embeddings".
To make any sense of AI APIs, one needs to understand those concepts. 


are about usage and money. It is about "quantity" of text sent to and received from AI API.
Depending on API used, there are limits and costs about how much content, i.e. text,
can be sent at once, and price is calculated based on this. A token is an integer number, and 
usually is mapped to a few characters that is usually a small word, or part of larger word.
For example "Hello World!" is converted to [9906, 4435, 0] = ["Hello", "World", "!"]
and "Sequences" is [1542, 45045] = ["Se", "quences"]

To measure and estimate const of AI API calls we can use libraries, i.e OpenAI tiktoken 

openai/tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI's models. @GitHub (Python)

tiktoken - npm (JavaScript)

import { encoding_for_model } from 'tiktoken'
const encoder = encoding_for_model('gpt-3.5-turbo');
const words = encoder.encode(prompt);


are about "meaning" and "similarity" of content.

Through process of "training", AI "models" are able to map content to array of floating point numbers, usually between 0 and 1, and to compare such arrays, called vectors to find "similarity" or "distance".

For example, "Hello World" can be converted to vector of about 1500 numbers like this
"input": "Hello World!",
"embedding": [

import OpenAI from 'openai'
const openai = new OpenAI();
export async function createEmbeddings(input: string| string[]) {
    return await openai.embeddings.create({
        input: input,
        model: 'text-embedding-3-small'

"An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded

Since the embeddings capture the semantic meaning of the questions, it is possible to compare different embeddings and see how different or similar they are. Thanks to this, you can get the most similar embedding to a query, which is equivalent to finding the most similar FAQ. Check out our semantic search tutorial for a more detailed explanation of how this mechanism works."

Calculating Similarity between Embeddings

export function calcDotProduct(a: number[], b: number[]) {
    return, index) => value * b[index]).reduce((a, b) => a + b, 0);
function calcCosineSimilarity(a: number[], b: number[]) {
    const product = dotProduct(a, b);
    const aMagnitude = Math.sqrt( => value * value).reduce((a, b) => a + b, 0));
    const bMagnitude = Math.sqrt( => value * value).reduce((a, b) => a + b, 0));
    return product / (aMagnitude * bMagnitude);

The Building Blocks of LLMs: Vectors, Tokens and Embeddings @ TheNewStack

a vector is a single-dimensional array, in this case of numbers only

Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process.

In LLMs, vectors are used to represent text or data in a numerical form that the model can understand and process. This representation is known as an embedding. Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents.

Vector Databases

Chroma: the AI-native open-source embedding database

EV fast charging startup funded by Google

Google funded company plan to beat Tesla with 1,000's of 500kW chargers - YouTube

Gravity to add 500 kW EV charger trees on streets, targets Tesla @electrek

NY-based startup and EV infrastructure specialist Gravity has launched a new line of universal EV charger “trees” it hopes will bring convenient charging sessions curbside on city streets. The deployment will start modestly, but Gravity is targeting a street charging network that is ” more expansive than Tesla’s current Supercharger network.”