认为crf里的解码tag方法有误。我修改后的请看一下。 #703

iterate7 · 2017-12-04T08:42:01Z

/**
* 维特比后向算法标注
*
* @param table
*/
public void tag(Table table)
{
int size = table.size();
if (size == 0) return;
int tagSize = id2tag.length;
double[][] net = new double[size][tagSize];
for (int i = 0; i < size; ++i)
{
LinkedList<double[]> scoreList = computeScoreList(table, i); //一个char对应的状态BMSE的概率。double[]，特征个数！
for (int tag = 0; tag < tagSize; ++tag)//4个状态！
{
net[i][tag] = computeScore(scoreList, tag);
}
}

    if (size == 1)
    {
        double maxScore = -1e10;
        int bestTag = 0;
        for (int tag = 0; tag < net[0].length; ++tag)
        {
            if (net[0][tag] > maxScore)
            {
                maxScore = net[0][tag];
                bestTag = tag;
            }
        }
        table.setLast(0, id2tag[bestTag]);
        return;
    }

    //viterbi的核心步骤！
    //note: 改进了viterbi解码的算法。发现原来的net结构不够清晰。
    int[][] from = new int[size][tagSize];//path
    double pro[][] = new double[size][tagSize];
    for (int i = 1; i < size; ++i)
    {
        for (int now = 0; now < tagSize; ++now)
        {
            double maxScore = -1e10;
            for (int pre = 0; pre < tagSize; ++pre)
            {
                double score = pro[i-1][pre] + matrix[pre][now] + net[i][now];//?net代替了发射概率，肯定是不准确的。
                if (score > maxScore)
                {
                    maxScore = score;
                    from[i][now] = pre;
                    pro[i][now] = maxScore; 
                }
            }
        }
    }
	   
	
 
	
  double maxScore = -1e10;
  int maxTag = 0;
  for (int tag = 0; tag < net[size - 1].length; ++tag)
  {
      if (pro[size - 1][tag] > maxScore)
      {
          maxScore = pro[size - 1][tag];
          maxTag = tag;
      }
  }
	table.setLast(size-1, id2tag[maxTag]);
	for (int i = size - 1; i >=1 ; --i)
	{
      table.setLast(i-1, id2tag[from[i][maxTag]]);
      maxTag = from[i][maxTag];
	 }
	
    // 反向回溯最佳路径

// double maxScore = -1e10;
// int maxTag = 0;
// for (int tag = 0; tag < net[size - 1].length; ++tag)
// {
// if (net[size - 1][tag] > maxScore)
// {
// maxScore = net[size - 1][tag];
// maxTag = tag;
// }
// }
//
// table.setLast(size - 1, id2tag[maxTag]);
// maxTag = from[size - 1][maxTag];
// for (int i = size - 2; i > 0; --i)
// {
// table.setLast(i, id2tag[maxTag]);
// maxTag = from[i][maxTag];
// }
// table.setLast(0, id2tag[maxTag]);
}

The text was updated successfully, but these errors were encountered:

YuTingLiu · 2017-12-05T15:24:51Z

我的问题是：比使用CRF++测试的得分低，不知道是不是算法的原因

iterate7 · 2017-12-06T00:13:03Z

我也遇到类似问题，所以才修改了crf的解码方法。目前测试的几个案例（和crf++分词不同的句子），用修改后的解码方法发现都一样了。所以才发出来一起看看。

hankcs · 2017-12-08T18:52:02Z

感谢指正，当时写viterbi的时候确实遗漏了记录最后两个时刻的16条路径的最大得分。现在已经用滚动数组实现，更省内存。欢迎测试，如果有句子的结果与你的或CRF++不一致，欢迎报告。

TylunasLi · 2018-01-04T15:50:22Z

我今天在SIGHAN2005 pku上测试了一下，修改之前的代码和CRF++0.58 viterbi解码一致，新代码F值反而降低了0.1%，检查原因是句子开头maxScoreAt[0] 值全部为0。所以这次修改后评分不升反降了？

hankcs · 2018-01-05T01:03:45Z

有意思，我也发现修改后有部分句子变差了。

maxScoreAt是个滚动数组，下标0不是句子开头。如果句子长度是偶数的话，反而代表结尾的分值。无论如何，都不应该为0的。此处有bug，你能上传一下模型和触发代码吗？
标准的viterbi的确是这么写的，跟CRF++不同之处有可能是我对S和E的特别约束导致的。麻烦试试这段有没有差别：

    public void tag(Table table)
    {
        int size = table.size();
        if (size == 1)
        {
            table.setLast(0, "S");
            return;
        }
        double[][] net = new double[size][4];
        for (int i = 0; i < size; ++i)
        {
            LinkedList<double[]> scoreList = computeScoreList(table, i);
            for (int tag = 0; tag < 4; ++tag)
            {
                net[i][tag] = computeScore(scoreList, tag);
            }
        }
        int[][] from = new int[size][4];
        double[][] maxScoreAt = new double[2][4]; // 滚动数组
        int curI = 0;
        for (int i = 1; i < size; ++i)
        {
            curI = i & 1;
            int preI = 1 - curI;
            for (int now = 0; now < 4; ++now)
            {
                double maxScore = -1e10;
                for (int pre = 0; pre < 4; ++pre)
                {
                    double score = maxScoreAt[preI][pre] + matrix[pre][now] + net[i][now];
                    if (score > maxScore)
                    {
                        maxScore = score;
                        from[i][now] = pre;
                        maxScoreAt[curI][now] = maxScore;
                    }
                }
                net[i][now] = maxScore;
            }
        }
        // 反向回溯最佳路径
        double maxScore = -1e10;
        int maxTag = 0;
        for (int tag = 0; tag < 4; ++tag)
        {
            if (maxScoreAt[curI][tag] > maxScore)
            {
                maxScore = maxScoreAt[curI][tag];
                maxTag = tag;
            }
        }
        table.setLast(size - 1, id2tag[maxTag]);
        maxTag = from[size - 1][maxTag];
        for (int i = size - 2; i > 0; --i)
        {
            table.setLast(i, id2tag[maxTag]);
            maxTag = from[i][maxTag];
        }
        table.setLast(0, id2tag[maxTag]);
    }

TylunasLi · 2018-01-05T15:48:27Z

我的意思是对前两个标签预测的时候，没有带上net[0]的score，导致效果变差了，这样改了一下：

        double[][] maxScoreAt = new double[2][4]; // 滚动数组
        System.arraycopy(net[0],0,maxScoreAt[0],0,4); // 初始preI=0,  maxScoreAt[preI][pre] = net[0][pre]

hankcs · 2018-01-05T16:40:34Z

太好了，发现了个新bug。麻烦提个PR如何？

hankcs added a commit that referenced this issue Dec 8, 2017

改正CRF的Tag方法：#703

8e338c6

hankcs added the bug label Dec 8, 2017

hankcs closed this as completed Dec 8, 2017

TylunasLi pushed a commit to TylunasLi/HanLP that referenced this issue Dec 30, 2017

改正CRF的Tag方法：hankcs#703

bea909c

hankcs added a commit that referenced this issue Jan 13, 2018

改正CRF的Tag方法：#703 (comment)

968f1e1

hankcs added a commit that referenced this issue Jan 13, 2018

按照 #703 (comment) 的思路修改CRFModel的tag方法，以解决：#745

ce02d03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

认为crf里的解码tag方法有误。我修改后的请看一下。 #703

认为crf里的解码tag方法有误。我修改后的请看一下。 #703

iterate7 commented Dec 4, 2017

YuTingLiu commented Dec 5, 2017

iterate7 commented Dec 6, 2017

hankcs commented Dec 8, 2017

TylunasLi commented Jan 4, 2018

hankcs commented Jan 5, 2018

TylunasLi commented Jan 5, 2018 •

edited

Loading

hankcs commented Jan 5, 2018

认为crf里的解码tag方法有误。我修改后的请看一下。 #703

认为crf里的解码tag方法有误。我修改后的请看一下。 #703

Comments

iterate7 commented Dec 4, 2017

YuTingLiu commented Dec 5, 2017

iterate7 commented Dec 6, 2017

hankcs commented Dec 8, 2017

TylunasLi commented Jan 4, 2018

hankcs commented Jan 5, 2018

TylunasLi commented Jan 5, 2018 • edited Loading

hankcs commented Jan 5, 2018

TylunasLi commented Jan 5, 2018 •

edited

Loading