Releases: polm/fugashi
v1.3.0: M1 Wheels! Finally!
This release addresses one of the longest standing issues, #55. Many thanks to @nikitalita figuring out how to cross-compile MeCab for wheels.
There are no other changes.
v1.2.1: Python 3.11 Support
This release adds wheels for Python 3.11, with no other changes.
v1.2.0: Add nbestToNodeList, drop Python 3.6 and earlier
This release of fugashi adds one new feature: Tagger.nbestToNodeList
returns the top N possible tokenizations of a string as node lists. Many thanks to @teowenshen for the implementation (#61).
This release also drops support for Python 3.6 and earlier versions. While the current source should still work with 3.5 and 3.6, wheels are not provided, and it is recommended you upgrade your Python version to one that has not reached end-of-life status. If you must use an older version, you can continue using v1.1.2.
v1.1.2: Python 3.10 Support, Cleaner Builds
This release adds long overdue wheels for Python 3.10. There are no changes in functionality or API.
On the backend, in addition to fixing issues with the 3.10 version number and quoting, the build process was cleaned up considerably. Many thanks to @lambdadog for the bugfixes and cleanup!
This release does not include wheels for M1 Macs - those may be working, but I've been unable to confirm it. See #55 for details or to help out.
v1.1.1: Bug Fixes and API Cleanup
This release has a number of stability and API improvements.
fugashi-build-dict
didn't work in its initial release, that has been fixed.- Calls to
parseToNode
no longer invalidate old node surfaces (#38) - Initialization errors now throw an Exception rather than printing output directly (https://github.com/explosion/spaCy/releases/tag/v3.0.7)
Note that the fix to #38 has a number of side effects that may need more extensive evaluation. In particular:
- memory use will grow very slowly over the life of a
Tagger
object - execution speed will be a bit slower, up to around 10%
It's expected that these will both be addressed before long; despite the issues, the current fix has been deemed suitable for a release because in the vast majority of use cases it will behave more correctly than the previous release.
Experimental Support for Dictionary Building Added
One feature fugashi hasn't had until now is the ability to build user dictionaries. This feature can be important for improving tokenization quality in many applications. This release adds fugashi-build-dict
, a wrapper for MeCab's mecab-dict-index
command. You can use it like this:
fugashi-build-dict -d [system-dic-dir] -u mydic.dic input.csv
If you're familiar with MeCab's user dictionary creation process nothing has changed, so any feedback on use or any errors you encounter would be appreciated. If you're not familiar with the dictionary process, just wait a bit - a guide should be released soon.
fugashi v1.0.0
fugashi v1.0 has arrived. 🎊
This release does not include any major changes to the code. The main purpose of this release is to make it clear that the API has reached a point where it can remain stable moving forward. While there will surely be more patches to clean things up or add minor features, I don't have any major changes planned.
This release does include one small change: previously, __repr__
marked UNKs. This behavior is useful in some situations, but it's easier to add it to generic behavior than take it out, so I removed it. Now you can (mostly) reconstruct the input with ''.join([str(nn) for nn in nodes])
.
Thanks for using fugashi, and if there's anything you'd like to see in it please feel free to open an issue.
Command line scripts and callable Taggers
This isn't a drastic release, but since I've been dragging out the patch numbers it seemed like a good time to bump the minor version. This is v0.2.0! 🎉
The first feature in this release is the addition of command line scripts. Since it's possible to install fugashi without MeCab, you might not have a command-line binary. This fixes that so you can use fugashi as a replacement for mecab. There's also the fugashi-info script, which is similar to mecab -D
in that it prints dictionary information. I hope it will be useful when dealing with bugs and installation issues.
The other feature is that Tagger instances are now callable. One of the best features of fugashi is it makes it much easier to work with MeCab nodes, but the function associated with that - parseToNodeList
- had an unfortunately long name. I didn't want to call it parse
since that already has meaning in MeCab, but giving it a different name felt odd... so I realized the easiest thing is to make the Tagger instance itself callable. Here's an example of the change this makes possible:
from fugashi import Tagger
tagger = Tagger()
# before
for word in tagger.parseToNodeList(text):
print(word.surface)
# after
for word in tagger(text):
print(word.surface)
Feels better, doesn't it? I imagine this will be particularly helpful for compact expressions like list comprehensions. And parseToNodeList
is still there, so existing code can be used unmodified.
Lately I've been working more on optimizing SudachiPy than fugashi, but there are still ease-of-use improvements to be made here, and if it works here it can be useful in other tokenizers too. If there's anything you'd like to see let me know.
Bundled UniDic Support
This release adds support for installing UniDic from PyPI, whether the easy-to-install unidic-lite
or the full-fledged unidic
package. Special thanks to @chezou for helping with testing on Windows, which had quoting issues due to backslashes in paths.
This release greatly simplifies installing and using fugashi. Assuming no major issues are found, the next release should be 1.0.0.
OSX Build Bugfix Release
This release includes a fix for builds on OSX. See #16 for details; thanks to @HiromuHota for the report and help with the fix.