GitHub - YeDeheng/s-tokenizer

This is a tokenizer for processing the software-specific natural language text in Stack Overflow. Such text is both social, e.g., people use ungrammatical and informal language in Stack Overflow like what they do in tweets, and domain-specific, e.g., printf() should be recognized as one token rather than 3 tokens printf, '(' and ')'.

This tokenizer is based on a Twitter tokenizer. Acknowledgement goes to Brendan O'Connor, Kevin Gimpel and Daniel Mills, who are the authors of the original Twitter tokenizer. This tokenizer modifies their Twitter one.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
lib		lib
__init__.py		__init__.py
input.txt		input.txt
mytokenizer.py		mytokenizer.py
output.txt		output.txt
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

YeDeheng/s-tokenizer

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages