Facebook Gradient boosting 梯度提升 separate the positive and negative labeled points using a single line 梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

https://www.quora.com/Why-do-people-use-gradient-boosted-decision-trees-to-do-feature-transform

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

Why is linearity/non-linearity important?
Most of our classification models try to find a single line that separates the two sets of point. I say that they find a line (but a line makes sense for 2 dimensional space), but in higher dimensional space, "a line" is referred to as a Hyperplane. But, for the moment, we will work with 2 dimensional spaces since they are easy to visualize and hence we can simply think that the classifiers try to find lines in this 2D space which will separate the set of points. Now, since the above set of points cannot be separated by a single line, which means most of our classifiers will not work on the above dataset.

How to solve?
There are two ways to solve:

  1. One way is to explicitly use non-linear classifiers. Non linear classifiers are classifiers that do not try to separate the set of points with a single line but uses either a non-linear separator or a set of linear separators (to make a piece-wise non-linear separator).
  2. Another way is to transform the input space in such a way that the non-linearity is eliminated and we get a linear feature space.

Second Method
Let us try to find a transformation for the given set of points such that the non linearity is removed. If we carefully see the set of points given to us, we would note that the the label is negative exactly when one of the dimension is negative. If both dimensions are positive or negative, the label given is positive. Therefore, let x=(x1,x2)x=(x1,x2) be a 2D point and let ff be a function that transforms this 2D point to another 2D point as follows: f(x)=(x1,x1x2)f(x)=(x1,x1x2). Let us see what happens to the set of points given to us, when we pass it through the above transformation:

【升维 超线性 分割类】

Oh look at that! Now the two sets of points can be separated by a line! This tells us that now, all our linear classification models will work on this transformed space.

Is it easy to find such transformations?
No, while we were lucky to find a transformation for the above case, in general, it is not so easy to find such a transformation. However, in the above case, I mapped the points from 2D to 2D, but one can map these 2D points to some higher dimensional space as well. There is a theorem called Cover's theorem which states that if you map the points to sufficiently large and high dimensional space, then with high probability, the non-linearity would be erased and that the points will be separated by a line in that high dimensional space.

Evaluating boosted decision trees for billions of users

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

【Facebook点击预测 predicting the probability of clicking a notification】

We trained a boosted decision tree model for predicting the probability of clicking a notification using 256 trees, where each of the trees contains 32 leaves. Next, we compared the CPU usage for feature vector evaluations, where each batch was ranking 1,000 candidates on average. The batch size value N was tuned to be optimal based on the machine L1/L2 cache sizes. We saw the following performance improvements over the flat tree implementation:

  • Plain compiled model: 2x improvement.
  • Compiled model with annotations: 2.5x improvement.
  • Compiled model with annotations, ranges and common/categorical features: 5x improvement.

The performance improvements were similar for different algorithm parameters (128 or 512 trees, 16 or 64 leaves).

Facebook

Facebook's paper gives empirical results which show that stacking a logistic regression (LR) on top of gradient boosted decision trees (GBDT) beats just directly using the GBDT on their dataset. Let me try to provide some intuition on why that might be happening.

Facebook's paper gives empirical results which show that stacking a logistic regression (LR) on top of gradient boosted decision trees (GBDT) beats just directly using the GBDT on their dataset. Let me try to provide some intuition on why that might be happening.

Facebook Gradient boosting   梯度提升 separate the positive and negative labeled points using a single line  梯度提升决策树 Gradient Boosted Decision Trees (GBDT)

First, let's assume that GBDT had learnt the weight of the tree TT to be WTWT. Let's also assume that the output of a single tree's leaf ll (say determined by averaging all training observations in that leaf) is PlPl. Also, let's assume l(x)l(x)to be an indicator variable which is 11 iff observation xx falls into region identified by leaf ll. We can then write:

GBDT(x)=GBDT(x)=∑TWT∗(∑l∈TPl∗l(x))∑TWT∗(∑l∈TPl∗l(x))

If we denote by T(l)T(l) the tree to which leaf ll belongs to, we can rewrite it as:

GBDT(x)=∑lWT(l)∗Pl∗l(x)GBDT(x)=∑lWT(l)∗Pl∗l(x)

If we use WlWl to denote WT(l)∗PlWT(l)∗Pl , we can rewrite the previous equation as:

GBDT(x)=∑lWl∗l(x)GBDT(x)=∑lWl∗l(x)

Also, by construction, a LR stacked on GBDT has following hypothesis:

Stacked(x)=∑lCl∗l(x)Stacked(x)=∑lCl∗l(x)

So both of these are doing a linear weighted sum on binary features derived based on inputs belonging to various leaves. So why is LR + GBDT better than just using GBDT alone?

【堆积LR + GBDT性能优于GBDT 】

LR is a convex optimization problem so it can actually choose the weights ClClof each feature optimally. In GBDT on the other hand, weights WlWl aren't guaranteed to be optimal. In fact, since Wl=WT(l)∗PlWl=WT(l)∗Pl and WT(l)WT(l) are chosen greedily (e.g using line search) and PlPl are chosen based only on the local neighborhood, it's quite conceivable that WlWl aren't globally optimal after all.  So overall, stacked LR + GBDT will never perform worse than directly using GBDT and can actually outperform it for certain datasets.

But there is another really important advantage of using the stacked model.

Facebook's paper also shows that data freshness is a very important factor for them and that prediction accuracy degrades as the delay between training and test set increases. It's unrealistic/costly to do online training of GBDT whereas online LR is much easier to do. By stacking an online LR over periodically computed batch GBDT, they are able to retain much of the benefits of online training in a pretty cheap way.

Overall, it's a pretty clever idea that works really well in real-world.

上一篇:[LintCode] Interleaving Positive and Negative Numbers


下一篇:Lintcode: Interleaving Positive and Negative Numbers 解题报告