Better Relevance with Pairwise Ranking in XGBoost

Introduction

In pairwise ranking, the model learns the relative order of an item while considering other items in the search results. For example: In a given search query, if there are 10 results, but only 2 of them got a conversion event (clicks, reply, like, subscribe, comment etc), so the model will learn patterns to rank higher the two conversion items given the rest of the results.

In this tutorial, we’ll learn to use XGBoost to build pairwise ranking models. Internally, xgboost uses a combination of pairwise and pointwise loss to train ranking models.

Pointwise vs Pairwise Loss

To understand how pairwise loss is calculated, lets first look at a simpler loss function: pointwise loss. In pointwise model, each sample is considered independent of each other. To measure pointwise loss, we use squared loss as shown:

\[f(y, s) = \sum_{i=1}^{n}(y_{i} – s_{i})^{2}\]

where yirefers to the actual relevance label, sirefers to the predicted score and n refers to all the items in the dataset. In case of pairwise, instead of absolute prediction, the loss function considers the relevance of a pair as shown:

\[f(y, s)=\sum_{y_{i}>y_{j}}^{}\log_{2}\left( 1 + e^-{\Theta}^{\left( s_{i}-s_{j} \right)} \right)\]

where yi, yj refers to the actual relevance label, si_,sj refers to the predicted score of the items, Θ is a hyper-parameter function. As you have already guessed:

When yi > yj and si - sj > 0, the model incurs a smaller loss.
When yi > yj and si - sj < 0, the model incurs a greater loss.

the above equation is nothing but the logistic loss with a slight modification where pairwise score difference is taken instead of the classic weighted equation ( bo + b1x + ... ).

What are these pairs though? Lets understand it using an example.

Creating pairs with Lambdamart

Lambdamart is a tree based version (builds ensemble of weak trees) of Lambdarank. Basically, lambdarank introduces a rank based metric i.e ΔNDCG to the loss function to generate the optimised ranking of items. It takes into account the difference in NDCG generated by swapping the items within a query.

With NDCG, the loss function becomes:

\[f(y, s)=\sum_{y_{i}>y_{j}}^{}\log_{2}\left( 1 + e^-{\Theta}^{\left( s_{i}-s_{j} \right)} \right)*\left|\Delta\text{NDCG}_{ij}\right|\]

where ΔNDCG is calculated by swapping the items within a given query.

To explain how the algorithm works, I’m going to use a data set which looks as follows:

label	item_name	item_id	position	query_id	feature_0	feature_1	feature_2
0	adidas shoes	323	1	0	0.034	0.120	0.109
0	adidas x2	876	2	0	0.107	0.001	0.612
1	nike pegasus	988	3	0	0.987	0.543	0.766
1	protein shaker	632	1	1	0.102	0.231	0.221

The algorithm works like this:

For each query_id
Calculate the ideal DCG (this is calculated based on sorting the label column. Highest relevance comes first and so on ..)
Swap each item in the query and calculate the change in DCG (i.e. new DCG after swapping minus the ideal DCG). This change will be negative or positive depending on, if the swap leads to improvement or not.
Calculate the average gain in DCG for each item, after making all the swaps.
The average gain calculated is used as the label in the model.

Fortunately, all this heavy lifting happens internally inside XGBoost. Lets train a model using the data shown above. I’ve installed xgboost using: pip install xgboost==2.0.1 . The latest version also contain logic to handle position bias i.e. clicks happening on lower position are more relevant than higher positions, hence we should weight them higher.

import pandas as pd
import xgboost as xgb
from sklearn import metrics

features = ['feature_0', 'feature_1', 'feature_2']

ranker = xgb.XGBRanker(
        tree_method="hist",
        objective="rank:ndcg",
        lambdarank_unbiased=True,
        lambdarank_num_pair_per_sample=10,
        eval_metric=["ndcg@1", "ndcg@8"],
        lambdarank_bias_norm=2,
        n_estimators=1000)

ranker.fit(
    traindf[features],
    traindf['label'],
    qid=traindf['query_qid'],
    eval_set=[(traindf[features], traindf['label']),(testdf[features], testdf['user_feedback'])],
    eval_qid=[traindf['query_qid'], testdf['query_qid']],
    verbose=50, 
    early_stopping_rounds=30)

testdf['pred'] = ranker.predict(testdf[features])

In the above code:

traindf and testdf are two dataframes containing data as shown above.
objective="rank:ndcg": since ndcg is not differentiable, xgboost uses lambdamart to directly optimise ndcg.
lambdarank_unbiased : handles positions bias
lambdarank_num_pair_per_sample : creates the pairs for each sample. For example: if a query has 3 items, and we set lambdarank_num_pair_per_sample=2, then for that query, we’ll end up with 6 pairs. Tuning this parameter affects the model performance.
qid: refers to the column which contains the integer encoded query_id.

Once we have generated the predictions, we can evaluate the model using NDCG@k metrics. In our example dataset, we’ll use k=3:

score = testdf.groupby('qid').apply(lambda x: metrics.ndcg_score([x['label']], [x['weighted_pred']], k=3)).mean()

print(score)
0.413

NDCG metric is commonly used for ranking tasks, since it appreciates correct predictions at lower rank than at higher rank. For example: an item correctly ranked at position 1 will boost NDCG score higher than an item correctly predicted at rank 10. The metric discounts the gain by position of an item.

Summary

In this article, we discussed the basics of pairwise ranking and lambamart. XGBoost provides an out of box implementation of lambdamart algorithm while handling position bias. You can read more about XGBoost learning to rank from its official documentation.

Name*

Email*

Website

3 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Massa

1 year ago

Thanks for the insights. I find it quite helpful.

Shanu Singhal

11 months ago

Hi Manish ! Thanks for the explanation but I have a doubt. Let say I am having user-product purchase pair data where user features are constant for each product bought. Since you are labelling the purchase as 1 or 0. Then how it is learning on the same features for different labels for a single query_id(user_id in my case).

Manish Saraswat

Author

Reply to Shanu Singhal

9 months ago

Hi Shanu, if I understand correctly, with user_id in place, the feature vector will be concatenation of user_features + item_features. Like you said, user_features of a user will be constant but the item_features will change according to each item. So lets say user_features are 5, item_features are 7, the total feature vector used in model training will be of size 12.

Better Relevance with Pairwise Ranking in XGBoost

Introduction

Pointwise vs Pairwise Loss

Creating pairs with Lambdamart

Summary

By Manish Saraswat

Related Post

How to generate better images with GAN using Pytorch?

From Simple to Variational Autoencoders using Pytorch

Simple guide on Autoencoders using Pytorch

You missed

How to generate better images with GAN using Pytorch?

From Simple to Variational Autoencoders using Pytorch

Simple guide on Autoencoders using Pytorch

How to pursue Masters in Data Science & Analytics in Germany?