Cómo mejorar los LLM con RAG | por Shaw Talebi

Importaciones

Comenzamos instalando e importando las bibliotecas de Python necesarias.

!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
# if not running on Colab ensure transformers is installed too

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

Configurar la base de conocimientos

Podemos configurar nuestra base de conocimientos definiendo nuestro modelo de incrustación, tamaño de fragmento y superposición de fragmentos. Aquí usamos el parámetro ~33M. bge-small-es-v1.5 Modelo de incrustación de BAAI, que está disponible en el centro Hugging Face. Otras opciones de modelo de incrustación están disponibles en este tabla de clasificación de incrustación de texto.

# import any embedding model on HF hub
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")Settings.llm = None # we won't use LlamaIndex to set up LLM
Settings.chunk_size = 256
Settings.chunk_overlap = 25

A continuación, cargamos nuestros documentos fuente. Aquí tengo una carpeta llamada “artículos”, que contiene versiones en PDF de 3 artículos de Medium en los que escribí colas gordas. Si ejecuta esto en Colab, debe descargar la carpeta de artículos desde el repositorio de GitHub y cárguelo manualmente en su entorno Colab.

Para cada archivo en esta carpeta, la siguiente función leerá el texto del PDF, lo dividirá en fragmentos (según la configuración definida anteriormente) y almacenará cada fragmento en una lista llamada documentos.

documents = SimpleDirectoryReader("articles").load_data()

Dado que los blogs se descargaron directamente como archivos PDF desde Medium, se parecen más a una página web que a un artículo bien formateado. Por lo tanto, algunos fragmentos pueden incluir texto no relacionado con el artículo, por ejemplo, encabezados de páginas web y recomendaciones de artículos de Medium.

En el bloque de código siguiente, refine los fragmentos de los documentos, eliminando la mayoría de los fragmentos antes o después del contenido de un artículo.

print(len(documents)) # prints: 71
for doc in documents:
if "Member-only story" in doc.text:
documents.remove(doc)
continueif "The Data Entrepreneurs" in doc.text:
documents.remove(doc)
if " min read" in doc.text:
documents.remove(doc)
print(len(documents)) # prints: 61

Finalmente, podemos almacenar los fragmentos refinados en una base de datos vectorial.

index = VectorStoreIndex.from_documents(documents)

Configurar un recuperador

Con nuestra base de conocimientos implementada, podemos crear un recuperador usando LlamaIndex. VectorIndexRetreiver(), que devuelve los 3 fragmentos más similares a una consulta de usuario.

# set number of docs to retreive
top_k = 3# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
)

A continuación, definimos un motor de consultas que utiliza el recuperador y la consulta para devolver un conjunto de fragmentos relevantes.

# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

Usar motor de consultas

Ahora, con nuestra base de conocimientos y nuestro sistema de recuperación configurados, usémoslos para devolver fragmentos relevantes para una consulta. Aquí, pasaremos la misma pregunta técnica que le hicimos a ShawGPT (el respondedor de comentarios de YouTube) del artículo anterior.

query = "What is fat-tailedness?"
response = query_engine.query(query)

El motor de consulta devuelve un objeto de respuesta que contiene el texto, los metadatos y los índices de los fragmentos relevantes. El bloque de código siguiente devuelve una versión más legible de esta información.

# reformat response
context = "Context:\n"
for i in range(top_k):
context = context + response.source_nodes[i].text + "\n\n"print(context)

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
rare events drive the aggregate statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 – 20).
This maps directly to the idea of Mediocristan vs Extremistan discussed
earlier. The image below visualizes different distributions across this
conceptual landscape [2].print("mean kappa_1n = " + str(np.mean(kappa_dict[filename])))
print("")
Mean κ (1,100) values from 1000 runs for each dataset. Image by author.
These more stable results indicate Medium followers are the most fat-tailed,
followed by LinkedIn Impressions and YouTube earnings.
Note: One can compare these values to Table III in ref [3] to better understand each
κ value. Namely, these values are comparable to a Pareto distribution with α
between 2 and 3.
Although each heuristic told a slightly different story, all signs point toward
Medium followers gained being the most fat-tailed of the 3 datasets.
Conclusion
While binary labeling data as fat-tailed (or not) may be tempting, fat-
tailedness lives on a spectrum. Here, we broke down 4 heuristics for
quantifying how fat-tailed data are.
Pareto, Power Laws, and Fat Tails
What they don’t teach you in statistics
towardsdatascience.com
Although Pareto (and more generally power law) distributions give us a
salient example of fat tails, this is a more general notion that lives on a
spectrum ranging from thin-tailed (i.e. a Gaussian) to very fat-tailed (i.e.
Pareto 80 – 20).
The spectrum of Fat-tailedness. Image by author.
This view of fat-tailedness provides us with a more flexible and precise way of
categorizing data than simply labeling it as a Power Law (or not). However,
this begs the question: how do we define fat-tailedness?
4 Ways to Quantify Fat Tails

Agregar RAG a LLM

Empezamos descargando el modelo afinado desde el centro Hugging Face.

# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="main")
config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Como punto de partida, podemos ver cómo el modelo responde a la pregunta técnica sin ningún contexto de los artículos. Para hacer esto, creamos una plantilla de mensaje usando una función lambda, que recibe un comentario del espectador y devuelve un mensaje para el LLM. Para obtener más detalles sobre el origen de este mensaje, consulte la Artículo anterior de esta serie.

# prompt (no context)
intstructions_string = f"""ShawGPT, functioning as a virtual data science \
consultant on YouTube, communicates in clear, accessible language, escalating \
to technical depth upon request. It reacts to feedback aptly and ends \
responses with its signature '–ShawGPT'.ShawGPT will tailor the length of its responses to match the viewer's comment, \
providing concise acknowledgments to brief expressions of gratitude or \
feedback, thus keeping the interaction natural and engaging.
Please respond to the following comment.
"""
prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''
comment = "What is fat-tailedness?"
prompt = prompt_template(comment)
print(prompt)

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.Please respond to the following comment.
What is fat-tailedness? 
[/INST]

A continuación, podemos pasar este mensaje al modelo usando el siguiente código.

model.eval()inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), 
max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

Aquí está la respuesta del modelo (sin contexto).

Great question!Fat-tailedness is a statistical property of a distribution. In simple terms, 
it refers to the presence of extreme outliers or heavy tails in the 
distribution.
For instance, consider the distribution of heights in a population. A normal 
distribution would have most people clustered around an average height with a 
few people deviating slightly from the mean. However, in a fat-tailed 
distribution, you would observe a larger number of people being significantly 
taller or shorter than the average.
In finance, fat-tailedness is often used to describe the presence of extreme 
market moves or outliers in financial returns. It's an important concept in 
risk management and financial modeling.
I hope this explanation helps! Let me know if you have any questions.
–ShawGPT

Aunque el estilo y el formato de la respuesta son excelentes, la explicación del modelo sobre la cola gruesa difiere de cómo la definí en mi video y serie de blogs.

Veamos qué sucede con la respuesta del modelo cuando incluimos el contexto apropiado. Para hacer esto, creamos otra plantilla de aviso, que también puede tomar el contexto del sistema RAG.

# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.{context}
Please respond to the following comment. Use the context above if it is helpful.
{comment}
[/INST]
"""

A continuación, pasamos el mensaje con contexto del motor de consultas y el comentario de vista al modelo.

prompt = prompt_template_w_context(context, comment)inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

Aquí está la nueva respuesta (con contexto).

Great question!Fat-tailedness is a property of a distribution that describes how much the 
presence of extreme values (rare events) affects the overall statistics of the 
distribution.
In the context of the video, the author discusses how Mediocristan and 
Extremistan distributions differ in terms of fat-tailedness. Mediocristan 
distributions have a relatively small impact from extreme values, while 
Extremistan distributions are heavily influenced by extreme values.
The author also mentions that fat-tailedness is a more general concept than 
specific distributions like Pareto or Power Laws. Instead, it lives on a 
spectrum ranging from thin-tailed (Gaussian) to very fat-tailed (Pareto 80-20).
I hope that helps clarify things a bit! Let me know if you have any questions.
–ShawGPT

Esto captura mucho mejor mi explicación de las colas gordas que la respuesta sin contexto e incluso resalta los conceptos específicos de Mediocristán y Extremistán.

Aquí, ofrecí una introducción a RAG para principiantes y compartí un ejemplo concreto de cómo implementarlo usando LlamaIndex. RAG nos permite mejorar un sistema LLM con conocimiento actualizable y específico del dominio.

Si bien gran parte del revuelo reciente sobre la IA se ha centrado en la creación de asistentes de IA, una innovación poderosa (aunque menos popular) proviene de las incrustaciones de texto (es decir, las cosas que solíamos recuperar). En el próximo artículo de esta serie, exploraré incrustaciones de texto con más detalle, incluido cómo se pueden utilizar para búsqueda semántica y tareas de clasificación.

Más sobre LLM 👇

Cómo mejorar los LLM con RAG | por Shaw Talebi

ByEquipo de 7 minutos

Importaciones

Configurar la base de conocimientos

Configurar un recuperador

Usar motor de consultas

Agregar RAG a LLM

Modelos de lenguajes grandes (LLM)

By Equipo de 7 minutos

Related Post

NVIDIA lanza Ising: la primera familia de modelos abiertos de IA cuántica para sistemas híbridos cuánticos-clásicos

xAI lanza las API independientes de voz a texto y de texto a voz de Grok, dirigidas a desarrolladores de voz empresarial

Anthropic lanza Claude Opus 4.7: una importante actualización para codificación agente, visión de alta resolución y tareas autónomas a largo plazo

You missed

Suecia registra una de las tasas de tabaquismo más bajas de Europa en medio del debate político « Euro Weekly News

Eddie Murphy sobre el nacimiento de su hijo Eric, el bebé de Jasmin, la hija de Martin Lawrence

Experimento inquietante refuerza el caso de que las langostas sienten dolor después de todo: ScienceAlert

Los conservadores quieren pagar a la gente para que tenga más hijos