You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
c17ddac
ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.
Enhancements and progress
#64
EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
4920e5c
CRF NER Benchmarking progress. CRF NER Documentation and official release coming soon
Bug fixes
Issue #41 <> d3b9086
Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
0840585
Corrected param names in DocumentAssembler
Issue #58 <> 5a53395
Deleted a left-over deprecated function which was misleading.
c02591b
Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis