Stanford.NLP.NET


Stanford Log-linear Part-Of-Speech Tagger for .NET

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'.

Several downloads are available. The basic download contains two trained tagger models for English. The full download contains three trained English tagger models, an Arabic tagger model, a Chinese tagger model, and a German tagger model. Both versions include the same source and other required files. The tagger can be retrained on any language, given POS-annotated training text for the language.

Part-of-speech name abbreviations: The English taggers use the Penn Treebank tag set. Here are some links to documentation of the Penn Treebank English POS tag set: 1993 Computational Linguistics article in PDF, AMALGAM page, Aoife Cahill's list. See the included README-Models.txt in the models directory for more information about the tagsets for the other languages.

The tagger is licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, running as a server, and a Java API. The tagger code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available. If you don't need a commercial license, but would like to support maintenance of these tools, Stanford NLP Group welcomes gift funding.

The Stanford POS Tagger library can be installed from NuGet:
PM> Install-Package Stanford.NLP.POSTagger

F# Sample of POS Tagging

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
34: 
35: 
36: 
37: 
38: 
#r "IKVM.OpenJDK.Core.dll"
#r "IKVM.OpenJDK.Util.dll"
#r "stanford-postagger-3.7.0.dll"

open java.io
open java.util
open edu.stanford.nlp.ling
open edu.stanford.nlp.tagger.maxent

// Path to the folder with models
let modelsDirectry =
    __SOURCE_DIRECTORY__  + @"..\..\data\paket-files\nlp.stanford.edu\stanford-postagger-full-2016-10-31\models"

// Loading POS Tagger
let tagger = MaxentTagger(modelsDirectry + "wsj-0-18-bidirectional-nodistsim.tagger")

let tagTexrFromReader (reader:Reader) =
    let sentances = MaxentTagger.tokenizeText(reader).toArray()

    sentances |> Seq.iter (fun sentence ->
        let taggedSentence = tagger.tagSentence(sentence :?> ArrayList)
        printfn "%O" (SentenceUtils.listToString(taggedSentence, false))
    )


// Text for tagging
let text = """A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language
and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although
generally computational applications use more fine-grained POS tags like 'noun-plural'."""

tagTexrFromReader <| new StringReader(text)
>
A/DT Part-Of-Speech/NNP Tagger/NNP -LRB-/-LRB- POS/NNP Tagger/NNP -RRB-/-RRB- is/VBZ a/DT piece/NN of/IN 
software/NN that/WDT reads/VBZ text/NN in/IN some/DT language/NN and/CC assigns/VBZ parts/NNS of/IN 
speech/NN to/TO each/DT word/NN -LRB-/-LRB- and/CC other/JJ token/JJ -RRB-/-RRB- ,/, such/JJ as/IN 
noun/JJ ,/, verb/JJ ,/, adjective/JJ ,/, etc./FW ,/, although/IN generally/RB computational/JJ 
applications/NNS use/VBP more/RBR fine-grained/JJ POS/NNP tags/NNS like/IN `/`` noun-plural/JJ '/'' ./.
val it : unit = ()

C# Sample of POS Tagging

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 
33: 
using java.io;
using java.util;
using edu.stanford.nlp.ling;
using edu.stanford.nlp.tagger.maxent;
using Console = System.Console;

namespace Stanford.NLP.POSTagger.CSharp
{
    class Program
    {
        static void Main()
        {
            var jarRoot = @"..\..\..\..\data\paket-files\nlp.stanford.edu\stanford-postagger-full-2016-10-31";
            var modelsDirectory = jarRoot + @"\models";

            // Loading POS Tagger
            var tagger = new MaxentTagger(modelsDirectory + @"\wsj-0-18-bidirectional-nodistsim.tagger");

            // Text for tagging
            var text = "A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text"
                       +"in some language and assigns parts of speech to each word (and other token),"
                       +" such as noun, verb, adjective, etc., although generally computational "
                       +"applications use more fine-grained POS tags like 'noun-plural'.";

            var sentences = MaxentTagger.tokenizeText(new StringReader(text)).toArray();
            foreach (ArrayList sentence in sentences)
            {
                var taggedSentence = tagger.tagSentence(sentence);
                Console.WriteLine(SentenceUtils.listToString(taggedSentence, false));
            }
        }
    }
}

Read more about Stanford POS Tagger on the official page.

Relevant posts

Fork me on GitHub