Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stopword词典加载问题 #530

Closed
cicido opened this issue May 12, 2017 · 4 comments
Closed

stopword词典加载问题 #530

cicido opened this issue May 12, 2017 · 4 comments

Comments

@cicido
Copy link

cicido commented May 12, 2017

我现在用的是hanlp 1.3.0版本. 在分析CoreStopWordDictionary.java发现以下词典加载语句:
dictionary = new StopWordDictionary(new File(HanLP.Config.CoreStopWordDictionaryPath));

之前的核心词典,用户自定义词典等均采用以下方式。以核心词典为例:CoreDictionary.java
br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8"));
是采用IOUtil的统一接口。
而StopWordDictionary直接使用了File来做,造成了不统一。是否考虑对CoreStopWordDictionary建立统一性?
因为我自己定义的JarIOAdapter.java:
public class JarIOAdapter implements IIOAdapter
{
@OverRide
public InputStream open(String path) throws FileNotFoundException
{
/*
采用第一行的方式加载资料会在分布式环境报错
改用第二行的方式
*/
//return ClassLoader.getSystemClassLoader().getResourceAsStream(path);
return JarIOAdapter.class.getClassLoader().getResourceAsStream(path);
}

@Override
public OutputStream create(String path) throws FileNotFoundException
{
    return new FileOutputStream(path);
}

}
这里是实现代码与词典数据的分离,单独把hanlp.properties与data目录做成一个jar。但由于CoreStopDictionary.java读文件接口不统一,导致读不到停用词典文件。
作者是否有意把代码与词典数据分成两个jar包,我这边已差不多完成,可以提交代码

@cicido
Copy link
Author

cicido commented May 12, 2017

通过分析代码,真正的问题发生在MDAG.java中
public MDAG(File dataFile) throws IOException
{
BufferedReader dataFileBufferedReader = new BufferedReader(new InputStreamReader(IOAdapter == null ?
new FileInputStream(dataFile) :
//IOAdapter.open(dataFile.getAbsolutePath())
IOAdapter.open(dataFile.getPath())
, "UTF-8"));

将原来的IOAdapter.open(dataFile.getAbsolutePath())改成 IOAdapter.open(dataFile.getPath())即可

@hankcs
Copy link
Owner

hankcs commented May 14, 2017

感谢建议

  • 你使用的版本太旧了,最新版本已经是正确的了:https://github.com/hankcs/HanLP/blob/master/src/main/java/com/hankcs/hanlp/collection/MDAG/MDAG.java#L171
  • 分成两个jar包的提议是好的,但还需要再思考一下。portable的目的是让新手快速上路,maven用户快速部署;但老手一般都会自定义配置文件实现个性化的功能,配置文件因人而异,也不适合放到maven的jar包里面。换成两个jar包之后还可能会给新手造成麻烦,数据与程序版本容易不一致从而导致问题。
  • 你可以做成插件的形式,很多用户还是挺喜欢把数据放到jar里面去的。我会积极支持的,包括在wiki中推荐。
  • 任何意见,欢迎继续讨论

@cicido
Copy link
Author

cicido commented May 15, 2017

我的版本是1.3.2的,上面写成了1.3.0了,写错了。
另外我在前面的MDAG.java上写的就是
public MDAG(File dataFile) throws IOException
{
BufferedReader dataFileBufferedReader = new BufferedReader(new InputStreamReader(IOAdapter == null ?
new FileInputStream(dataFile) :
//IOAdapter.open(dataFile.getAbsolutePath())
IOAdapter.open(dataFile.getPath())
, "UTF-8"));
前面为了jar包形式加载词典数据,将IOAdapter.open(dataFile.getAbsolutePath())改成 IOAdapter.open(dataFile.getPath()).
整个流程我写在oschina上了:
https://my.oschina.net/u/940663/blog/898790

@hankcs
Copy link
Owner

hankcs commented May 16, 2017

感谢建议,以File参数构造MDAG的确与InputStream不兼容。现在已经改为直接由IOAdapter打开的InputStream读取,欢迎测试。
如果还有问题,欢迎重开issue。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants