What is Stanford.NLP.Segmenter?
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation.
The Stanford Word Segmenter currently supports
Chinese. The provided segmentation schemes have been found to work well for a variety of applications.
Stanford NLP group recommend at least
1Gof memory for documents that contain long sentences.
The segmenter is available for download, licensed under the
GNU General Public License(v2 or later). Source is included. The package includes components for command-line invocation and a Java API. The segmenter code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available.