Introduction
In pairwise ranking, the model learns the relative order of an item while considering other items in the search results. For example: In a given search query, if there are 10 results, but only 2 of them got a conversion event (clicks, reply, like, subscribe, comment etc), so the model will learn patterns to rank higher the two conversion items given the rest of the results.
In this tutorial, we’ll learn to use XGBoost to build pairwise ranking models. Internally, xgboost uses a combination of pairwise and pointwise loss to train ranking models.
Pointwise vs Pairwise Loss
To understand how pairwise loss is calculated, lets first look at a simpler loss function: pointwise loss. In pointwise model, each sample is considered independent of each other. To measure pointwise loss, we use squared loss as shown:
\[f(y, s) = \sum_{i=1}^{n}(y_{i} – s_{i})^{2}\]where yi
refers to the actual relevance label, si
refers to the predicted score and n
refers to all the items in the dataset. In case of pairwise, instead of absolute prediction, the loss function considers the relevance of a pair as shown:
where yi
, yj
refers to the actual relevance label, si
, sj
refers to the predicted score of the items, Θ is a hyper-parameter function. As you have already guessed:
- When
yi > yj
andsi - sj > 0
, the model incurs a smaller loss. - When
yi > yj
andsi - sj < 0
, the model incurs a greater loss.
the above equation is nothing but the logistic loss with a slight modification where pairwise score difference is taken instead of the classic weighted equation ( bo + b1x + ...
).
What are these pairs though? Lets understand it using an example.
Creating pairs with Lambdamart
Lambdamart is a tree based version (builds ensemble of weak trees) of Lambdarank. Basically, lambdarank introduces a rank based metric i.e ΔNDCG
to the loss function to generate the optimised ranking of items. It takes into account the difference in NDCG generated by swapping the items within a query.
With NDCG, the loss function becomes:
\[f(y, s)=\sum_{y_{i}>y_{j}}^{}\log_{2}\left( 1 + e^-{\Theta}^{\left( s_{i}-s_{j} \right)} \right)*\left|\Delta\text{NDCG}_{ij}\right|\]where ΔNDCG
is calculated by swapping the items within a given query.
To explain how the algorithm works, I’m going to use a data set which looks as follows:
label | item_name | item_id | position | query_id | feature_0 | feature_1 | feature_2 |
0 | adidas shoes | 323 | 1 | 0 | 0.034 | 0.120 | 0.109 |
0 | adidas x2 | 876 | 2 | 0 | 0.107 | 0.001 | 0.612 |
1 | nike pegasus | 988 | 3 | 0 | 0.987 | 0.543 | 0.766 |
1 | protein shaker | 632 | 1 | 1 | 0.102 | 0.231 | 0.221 |
The algorithm works like this:
- For each
query_id
- Calculate the ideal DCG (this is calculated based on sorting the
label
column. Highest relevance comes first and so on ..) - Swap each item in the query and calculate the change in DCG (i.e. new DCG after swapping minus the ideal DCG). This change will be negative or positive depending on, if the swap leads to improvement or not.
- Calculate the average gain in DCG for each item, after making all the swaps.
- The average gain calculated is used as the label in the model.
Fortunately, all this heavy lifting happens internally inside XGBoost. Lets train a model using the data shown above. I’ve installed xgboost using: pip install xgboost==2.0.1
. The latest version also contain logic to handle position bias i.e. clicks happening on lower position are more relevant than higher positions, hence we should weight them higher.
import pandas as pd
import xgboost as xgb
from sklearn import metrics
features = ['feature_0', 'feature_1', 'feature_2']
ranker = xgb.XGBRanker(
tree_method="hist",
objective="rank:ndcg",
lambdarank_unbiased=True,
lambdarank_num_pair_per_sample=10,
eval_metric=["ndcg@1", "ndcg@8"],
lambdarank_bias_norm=2,
n_estimators=1000)
ranker.fit(
traindf[features],
traindf['label'],
qid=traindf['query_qid'],
eval_set=[(traindf[features], traindf['label']),(testdf[features], testdf['user_feedback'])],
eval_qid=[traindf['query_qid'], testdf['query_qid']],
verbose=50,
early_stopping_rounds=30)
testdf['pred'] = ranker.predict(testdf[features])
In the above code:
traindf
andtestdf
are two dataframes containing data as shown above.objective="rank:ndcg"
: since ndcg is not differentiable, xgboost uses lambdamart to directly optimise ndcg.lambdarank_unbiased
: handles positions biaslambdarank_num_pair_per_sample
: creates the pairs for each sample. For example: if a query has 3 items, and we setlambdarank_num_pair_per_sample=2
, then for that query, we’ll end up with 6 pairs. Tuning this parameter affects the model performance.qid
: refers to the column which contains the integer encodedquery_id
.
Once we have generated the predictions, we can evaluate the model using NDCG@k metrics. In our example dataset, we’ll use k=3
:
score = testdf.groupby('qid').apply(lambda x: metrics.ndcg_score([x['label']], [x['weighted_pred']], k=3)).mean()
print(score)
0.413
NDCG metric is commonly used for ranking tasks, since it appreciates correct predictions at lower rank than at higher rank. For example: an item correctly ranked at position 1 will boost NDCG score higher than an item correctly predicted at rank 10. The metric discounts the gain by position of an item.
Summary
In this article, we discussed the basics of pairwise ranking and lambamart. XGBoost provides an out of box implementation of lambdamart algorithm while handling position bias. You can read more about XGBoost learning to rank from its official documentation.
Thanks for the insights. I find it quite helpful.
Hi Manish ! Thanks for the explanation but I have a doubt. Let say I am having user-product purchase pair data where user features are constant for each product bought. Since you are labelling the purchase as 1 or 0. Then how it is learning on the same features for different labels for a single query_id(user_id in my case).
Hi Shanu, if I understand correctly, with user_id in place, the feature vector will be concatenation of user_features + item_features. Like you said, user_features of a user will be constant but the item_features will change according to each item. So lets say user_features are 5, item_features are 7, the total feature vector used in model training will be of size 12.