MLR is a sub-regional linear model that is widely used in advertising ctr estimates. MLR adopts the divide and conquer strategy: firstly divide the feature space into multiple local intervals, then fit a linear model in each local interval, and the output is the weighted sum of multiple linear models. These two steps are to minimize the loss function. For the goal, learn at the same time. For details, see the Large Scale Piece-wise Linear Model (LS-PLM). The MLR algorithm has three distinct advantages:
- Nonlinear:Choosing enough partitions, the MLR algorithm can fit arbitrarily complex nonlinear functions.
- Scalability:Similar to the LR algorithm, the MLR algorithm has a good scalability for massive samples and ultra-high dimensional models.
- Sparsity:MLR algorithm with
,
regular terms can get good sparsity.
Note: is the partition function,
is the parameter of fitting function
. For a given sample x, our prediction function model
has two parts. The first part
divides the feature space into m regions, and the second part
gives the predicted value for each region. The function
ensures that the model satisfies the definition of the probability function.
MLR algorithm model uses softmax as the partition function
,Sigmoid function as a fitting function
,and:
,gets the model of MLR as follows:
The schematic diagram of the MLR model is as follows,
This model can be understood from two perspectives:
- The MLR can be regarded as a three-layer neural network with threshold. There are k sigmoid neurons in the hidden layer. The output of each neuron has a gate. The output value of softmax is the switch of the gate.
- The MLR can be regarded as an ensemble model, which is composed of k simple sigmoid models. The output value of softmax is the combination coefficient.
In many cases, a sub-model needs to be built on a part of the data, and then predicted by multiple models. MLR uses softmax to divide the data (soft division) and predict it with a unified model. Another advantage of MLR is that it can be characterized. Combination, some features are active for sigmoid, and other features are active for softmax, multiplying them is equivalent to making feature combinations at lower levels.
Note: Since the output value of sigmoid model is between 0 and 1, and the output value of softmax is between 0 and 1 and normalized, the combined value is also between 0 and 1 (when all sigmoid values are 1, the maximum value can be obtained, of course, in other cases, the combined sum is 1), which can be regarded as a probability.
For the sample (x, y), the cross entropy loss function is:
Note: Under normal circumstances, cross entropy manifests itself as, The meaning of
is given, and the probability at y = 1, if
represents the probability of Y at given x (i.e., y is not only the probability of y = 1), the expression of cross entropy is as follows:
In this way, the derivative for a single sample is,
Gradient:
-
Model Storage:
- The model parameters of MLR algorithm are: soft Max function parameters:
,Sigmoid function parameters:
. Where
、
is an N-dimensional vector,N is the dimension of the data, that is, the number of features. A matrix of two m*N dimensions is used to represent a softmax matrix and a sigmodi matrix, respectively.
- The truncated values of softmax function and sigmoid function are represented by two m*1 dimension matrices.
- The model parameters of MLR algorithm are: soft Max function parameters:
-
Model Calculation:
- MLR model is trained by gradient descent method, and the algorithm is carried out iteratively. At the beginning of each iteration, worker pulls up the latest model parameters from PS, calculates the gradient with its own training data, and pushes the gradient to PS.
- PS receives all the gradient values pushed by the worker, takes the average, and updates the PSModel.
The format of data is set by "ml. data. type" parameter, and the number of data features, that is, the dimension of feature vectors, is set by "ml. feature. num" parameter.
MLR on Angel supports "libsvm" and "dummy" data formats as follows:
- dummy format
Each line of text represents a sample in the format of "y index 1 Index 2 index 3...". Among them: the ID of index feature; y of training data is the category of samples, which can take 1 and -1 values; y of prediction data is the ID value of samples. For example, the text of a positive sample [2.0, 3.1, 0.0, 0.0, -1, 2.2] is expressed as "10145", where "1" is the category and "0145" means that the values of dimension 0, 1, 4 and 5 of the eigenvector are not zero. Similarly, samples belonging to negative classes [2.0, 0.0, 0.1, 0.0, 0.0, 0.0] are represented as "-102".
- libsvm format
Each line of text represents a sample in the form of "y index 1: value 1 index 2: value 1 index 3: value 3...". Among them: index is the characteristic ID, value is the corresponding eigenvalue; y of training data is the category of samples, and can take 1 and - 1 values; y of prediction data is the ID value of samples. For example, the text of a positive sample [2.0, 3.1, 0.0, 0.0, -1, 2.2] is expressed as "10:2.01:3.14:-15:2.2", where "1" is the category and "0:2.0" means the value of the zero feature is 2.0. Similarly, samples belonging to negative classes [2.0, 0.0, 0.1, 0.0, 0.0, 0.0] are represented as "-10:2.02:0.1".
Several steps must be done before editing the submitting script and running.
- confirm Hadoop and Spark have ready in your environment
- unzip sona--bin.zip to local directory (SONA_HOME)
- upload sona--bin directory to HDFS (SONA_HDFS_HOME)
- Edit $SONA_HOME/bin/spark-on-angel-env.sh, set SPARK_HOME, SONA_HOME, SONA_HDFS_HOME and ANGEL_VERSION
Here's an example of submitting scripts, remember to adjust the parameters and fill in the paths according to your own task.
#test description
actionType=train or predict
jsonFile=path-to-jsons/mixedlr.json
modelPath=path-to-save-model
predictPath=path-to-save-predict-results
input=path-to-data
queue=your-queue
HADOOP_HOME=my-hadoop-home
source ./bin/spark-on-angel-env.sh
export HADOOP_HOME=$HADOOP_HOME
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.instances=10 \
--conf spark.ps.cores=2 \
--conf spark.ps.memory=10g \
--jars $SONA_SPARK_JARS \
--files $jsonFile \
--driver-memory 20g \
--num-executors 20 \
--executor-cores 5 \
--executor-memory 30g \
--queue $queue \
--class org.apache.spark.angel.examples.JsonRunnerExamples \
./lib/angelml-$SONA_VERSION.jar \
jsonFile:./mixedlr.json \
dataFormat:libsvm \
data:$input \
modelPath:$modelPath \
predictPath:$predictPath \
actionType:$actionType \
numBatch:500 \
maxIter:2 \
lr:4.0 \
numField:39