Manejo de bucles de retroalimentación en sistemas de recomendación: Deep Bayesian Bandits | por Sachin Hosmani | Jul, 2024

Comprender los fundamentos de la exploración y los bandidos bayesianos profundos para abordar los ciclos de retroalimentación en los sistemas de recomendación

Imagen de ChatGPT-4o

Los modelos de sistemas de recomendación suelen estar entrenados para optimizar la interacción del usuario, como los clics y las compras. La intención bienintencionada detrás de esto es favorecer los artículos con los que el usuario ha interactuado anteriormente. Sin embargo, esto crea un ciclo de retroalimentación que con el tiempo puede manifestarse como el “problema de inicio en frío”. En pocas palabras, los artículos que históricamente han sido populares para un usuario tienden a seguir siendo los favoritos del modelo. Por el contrario, los artículos nuevos pero muy relevantes no reciben mucha exposición. En este artículo, presento técnicas de exploración desde lo básico y, por último, explico Deep Bayesian Bandits, un algoritmo muy eficaz descrito en un artículo. papel por Guo, Dalin, et al. [1].

Utilicemos un sistema de recomendación de anuncios sencillo como ejemplo a lo largo de este artículo.

Un sistema de recomendación de anuncios simple de tres componentes. Imagen del autor

Es un sistema de tres componentes

  • Recuperación:un componente para recuperar eficientemente candidatos para la clasificación
  • Clasificación:una red neuronal profunda que predice la tasa de clics (CTR) como la puntuación de un anuncio dado a un usuario
    score = predict_ctr(user features, ad features)
  • Subasta:un componente que
    – recupera anuncios candidatos para el usuario
    – los califica utilizando el modelo de clasificación
    – selecciona el anuncio con mayor puntuación y lo devuelve*

En este artículo nos centraremos exclusivamente en el modelo de clasificación.

*Los sistemas de subasta del mundo real también tienen en cuenta el monto de la oferta del anuncio, pero lo ignoramos para simplificar.

Arquitectura del modelo de clasificación

El modelo de clasificación es una red neuronal profunda que predice la tasa de clics (CTR) de un anuncio, según el usuario y las características del anuncio. Para simplificar, propongo a continuación una red neuronal profunda totalmente conectada, pero se podría enriquecer con técnicas como redes amplias y profundas, DCN y DeepFM sin perder la aplicabilidad de los métodos que explico en este artículo.

Red neuronal profunda con clasificador binario que predice pCTR. Imagen del autor

Datos de entrenamiento

El modelo de clasificación se entrena con datos que incluyen clics como etiquetas binarias y la concatenación de características del usuario y del anuncio. El conjunto exacto de características utilizadas no es importante para este artículo, pero he asumido que algunas características relacionadas con la marca del anunciante están presentes para ayudar al modelo a conocer la afinidad del usuario hacia las marcas.

Datos de entrenamiento con características de muestra. Imagen del autor

Imaginemos que hemos entrenado con éxito nuestro modelo de clasificación en nuestro conjunto de datos de clics en anuncios y que el modelo ha aprendido que a una de nuestras usuarias, Jane, le encanta comprar bolsos de la empresa de bolsos “Vogue Voyage”. Pero hay una nueva empresa de bolsos, “Radiant Clutch”, en el mercado y vende bolsos fantásticos. Sin embargo, a pesar de que “Radiant Clutch” realiza campañas publicitarias para llegar a usuarios como Jane, Jane nunca ve sus anuncios. Esto se debe a que nuestro modelo de clasificación ha aprendido con tanta firmeza que a Jane le gustan los bolsos de “Vogue Voyage”, que solo se le muestran sus anuncios. A veces, Jane hace clic en ellos y, cuando el modelo se entrena aún más en estos nuevos clics, esto solo refuerza la creencia del modelo. Esto se convierte en un círculo vicioso que hace que algunos artículos permanezcan en la oscuridad.

El circuito de retroalimentación en acción, que provoca el problema del arranque en frío: las bolsas de Radiant Clutch no tienen ninguna posibilidad. Imagen del autor, miniaturas generadas con ChatGPT-4o

Si reflexionamos sobre esto, nos daremos cuenta de que la modelo no hizo nada malo al enterarse de que a Jane le gustan los bolsos de “Vogue Voyage”. Pero el problema es simplemente que a la modelo no se le está dando la oportunidad de conocer los intereses de Jane en los bolsos de otras empresas.

Exploración vs explotación

Este es un buen momento para presentar el equilibrio entre exploración y explotación.

Explotación:Durante la subasta de anuncios, una vez que obtenemos nuestras predicciones de CTR de nuestro modelo de clasificación, simplemente seleccionamos el anuncio con la puntuación más alta. Esta es una estrategia de explotación al 100% porque actuamos completamente según nuestro mejor conocimiento actual para lograr la mayor recompensa inmediata.

Exploración:Lo que ha faltado en nuestro enfoque es la voluntad de asumir algún riesgo y mostrar un anuncio incluso si no se le ha asignado la puntuación más alta. Si lo hiciéramos, el usuario podría hacer clic en él y el modelo de clasificación, al actualizarse con estos datos, aprendería algo nuevo sobre él. Pero si nunca asumimos el riesgo, el modelo nunca aprenderá nada nuevo. Esta es la motivación detrás de la exploración.

Exploración y explotación son cuestiones que hay que tratar de equilibrar. Si no se explora lo suficiente, nos encontraríamos con el problema del arranque en frío, y si se explora demasiado, se correría el riesgo de mostrar anuncios muy irrelevantes a los usuarios, lo que haría que perdiéramos su confianza y su dinero.

Ahora que hemos preparado el escenario para la exploración, profundicemos en algunas técnicas concretas para la exploración controlada.

Política ε-codiciosa

La idea aquí es simple. En nuestro servicio de subastas, cuando tenemos las puntuaciones de todos los anuncios candidatos, en lugar de tomar solo el anuncio con la puntuación más alta, hacemos lo siguiente:

  1. seleccione un número aleatorio r en [0, 1)
  2. if r < ε, select a random ad from our candidates (exploration)
  3. else, select the top-scored ad (exploitation)

where ε is a constant that we carefully select in [0, 1) knowing that the algorithm will explore with ε probability and exploit with 1 — ε probability.

Exploration with ε probability: pick any candidate ad at random. Image by author
Exploitation with 1 — ε probability: pick the highest CTR ad. Image by author

This is a very simple yet powerful technique. However, it can be too naive because when it explores, it completely randomly selects an ad. Even if an ad has an absurdly low pCTR prediction that the user has repeatedly disliked in the past, we might still show the ad. This can be a bit harsh and can lead to a serious loss in revenue and user trust. We can certainly do better.

Upper confidence bound (UCB)

Our motivation for exploration was to ensure that all ad candidates have an opportunity to be shown to the user. But as we give some exposure to an ad, if the user still doesn’t engage with it, it becomes prudent to cut future exposure to it. So, we need a mechanism by which we select the ad based on both its score estimate and also the amount of exposure it has already received.

Imagine our ranking model could produce not just the CTR score but also a confidence interval for it*.

*how this is achieved is explained later in the article

The model predicts a confidence interval along with the score. Image by author

Such a confidence interval is typically inversely proportional to the amount of exposure the ad has received because the more an ad is shown to the user, the more user feedback we have about it, which reduces the uncertainty interval.

Increased exposure to an ad leads to a decrease in the confidence interval in the model’s score prediction. Image by author

During auction, instead of selecting the ad with the greatest pCTR, we select the ad with the highest upper confidence bound. This approach is called UCB. The philosophy here is “Optimism in the face of uncertainty”. This approach effectively takes into account both the ad’s score estimate and also the uncertainty around it.

UCB in action: Ad-1 wins auction at first on account of its large confidence interval, but as the model learns about it, its UCB falls leading to Ad-2 winning auction. Image by author

Thompson sampling

The UCB approach went with the philosophy of “(complete) optimism in the face of uncertainty”. Thompson sampling softens this optimism a little. Instead of using the upper confidence bound as the score of an ad, why not sample a score in the posterior distribution?

For this to be possible, imagine our ranking model could produce not just the CTR and the confidence interval but an actual score distribution*.

*how this is achieved is explained later in the article

The model can predict a distribution of scores for one ad. Image by author

Then, we just sample a score from this distribution and use that as the score during auction.

Ad-1 wins auction due to a high sampled score from its wide distribution. Image by author
Ad-1 has received exposure and the model has lesser uncertainty about it. Ad-2 wins auction due to its higher score distribution mass. Image by author
Ad-2’s score distribution stdev further shrinks as it gets more exposure. Image by author

For the UCB and Thompson sampling techniques to work, we must update our models as often as possible. Only then will it be able to update its uncertainty estimates in response to user feedback. The ideal setup is a continuous learning setup where user feedback events are sent in near-real time to the model to update its weights. However, periodically statefully updating the weights of the model is also a viable option if continuous learning infrastructure is too expensive to set up.

A high-level continuous learning setup utilizing streaming infrastructure. Image by author, thumbnail generated by ChatGPT-4o

In the UCB and Thompson sampling approaches, I explained the idea of our model producing not just one score but an uncertainty measure as well (either as a confidence interval or a distribution of scores). How can this be possible? Our DNN can produce just one output after all! Here are the approaches discussed in the paper.

Bootstrapping

Bootstrapping in statistics simply means sampling with replacement. What this means for us is that we apply bootstrapping on our training dataset to create several closely related but slightly different datasets and train a separate model with each dataset. The models learned would thereby be slight variants of each other. If you have studied decision trees and bagging, you would already be familiar with the idea of training multiple related trees that are slight variants of each other.

Bootstrapped datasets are used to train separate models, resulting in a distribution of scores. Image by author

During auction, for each ad, we get one score from each bootstrapped model. This gives us a distribution of scores which is exactly what we wanted for Thompson sampling. We can also extract a confidence interval from the distribution if we choose to use UCB.

The biggest drawback with this approach is the sheer computational and maintenance overhead of training and serving several models.

Multi-head bootstrapping

To mitigate the costs of several bootstrapped models, this approach unifies the several models into one multi-head model with one head for each output.

Multi-head model. Image by author

The key cost reduction comes from the fact that all the layers except the last are shared.

Training is done as usual on bootstrapped subsets of data. While each bootstrapped subset of data should be used to update the weights of all the shared layers, care must be taken to update the weight of just one output head with a subset of data.

Constrained influence of each bootstrapped subset of data on one head during backprop. Image by author

Stochastic Gradient descent (SGD)

Instead of using separate bootstrapped datasets to train different models, we can just use one dataset, but train each model with SGD with random weight initialization thus utilizing the inherent stochasticity offered by SGD. Each model trained thus becomes a variant of the other.

Multi-head SGD

In the same way, using a multi-head architecture brought down the number of models trained with bootstrapping to one, we can use a multi-head architecture with SGD. We just have to randomly initialize the weights at each head so that upon training on the whole dataset, each head is learned to be a slight variant of the others.

Forward-propagation dropout

Dropout is a well-known regularization technique where during model training, some of the nodes of a layer are randomly dropped to prevent chances of overfitting. We borrow the same idea here except that we use it during forward propagation to create controlled randomness.

We modify our ranking model’s last layer to introduce dropout. Then, when we want to score an ad, we pass it through the model several times, each time getting a slightly different score on account of the randomness introduced by dropout. This gives us the distribution and confidence interval that we seek.

The same model produces a distribution of scores through random dropout. Image by author

One significant disadvantage of this approach is that it requires several full forward passes through the network which can be quite costly during inference time.

Hybrid

In the hybrid approach, we perform a key optimization to give us the advantages of dropout and bootstrapping while bringing down the serving and training costs:

  • With dropout applied to just the last-but-one layer, we don’t have to run a full forward pass several times to generate our score distribution. We can do one forward pass until the dropout layer and then do several invocations of just the dropout layer in parallel. This gives us the same effect as the multi-head model where each dropout output acts like a multi-head output.

Also, with dropout deactivating one or more nodes randomly, it serves as a Bernoulli mask on the higher-order features at its layer, thus producing an effect equivalent to bootstrapping with different subsets of the dataset.

Unfortunately, there is no easy answer. The best way is to experiment under the constraints of your problem and see what works best. But if the findings from the authors of the Deep Bayesian Bandits paper are anything to go by,

  1. ε-greedy unsurprisingly gives the lowest CTR improvement due to its unsophisticated exploration, however, the simplicity and low-cost nature of it make it very alluring.
  2. UCB generally outperformed Thompson sampling.
  3. Bootstrap UCB gave the highest CTR return but was also the most computationally expensive due to the need to work with multiple models.
  4. The hybrid model which relied on dropout at the penultimate layer needed more training epochs to perform well and was on par with SGD UCB’s performance but at lower computational cost.
  5. The model’s PrAuc measured offline was inversely related to the CTR gain: this is an important observation that shows that offline performance can be easily attained by giving the model easier training data (for example, data not containing significant exploration) but that will not always translate to online CTR uplifts. This underscores the significance of robust online tests.

That said, the findings can be quite different for a different dataset and problem. Hence, real-world experimentation remains vital.

In this article, I introduced the cold-start problem created by feedback loops in recommender systems. Following the Deep Bayesian Bandits paper, we framed our ad recommender system as a k-arm bandit and saw many practical applications of reinforcement learning techniques to mitigate the cold-start problem. We also scratched the surface of capturing uncertainty in our neural networks which is a good segue into Bayesian networks.

[1] Guo, Dalin, et al. “Bandidos bayesianos profundos: exploración de recomendaciones personalizadas en línea”. Actas de la 14ª Conferencia de la ACM sobre sistemas de recomendación. 2020.