Simple CoreNLP

Package: Stanford.NLP.CoreNLP

This page is direct translation of the original Simple CoreNLP page

Simple CoreNLP

In addition to the fully-featured annotator pipeline interface to CoreNLP, Stanford provides a simple API for users who do not need a lot of customization. The intended audience of this package is users of CoreNLP who want "just use nlp" to work as fast and easily as possible, and do not care about the details of the behaviors of the algorithms.

An example usage is given below:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
24: 
#r "IKVM.OpenJDK.Core.dll"
#r "IKVM.OpenJDK.Util.dll"
#r "stanford-corenlp-3.8.0.dll"

open System
open java.util
open edu.stanford.nlp.simple

// Path to the folder with models extracted from `stanford-corenlp-3.8.0-models.jar`
let jarRoot = (__SOURCE_DIRECTORY__)+ @"/../../../data/paket-files/nlp.stanford.edu/"
                                    + @"stanford-corenlp-full-2017-06-09/models/"
System.IO.Directory.SetCurrentDirectory(jarRoot)

// Custom properties for annotators
let props = Properties()
props.setProperty("ner.useSUTime","0") |> ignore

let sent : Sentence = new Sentence("Lucy is in the sky with diamonds.")
let nerTags : List = sent.nerTags(props);
let firstPOSTag : string = sent.posTag(0);
> 
val sent : Sentence = Lucy is in the sky with diamonds.
val nerTags : List = seq ["PERSON"; "O"; "O"; "O"; ...]
val firstPOSTag : string = "NNP"

The API is included in the CoreNLP release from 3.6.0 onwards. Visit the download page to download CoreNLP; make sure to set current directory to folder with models!

Note: If you use Simple CoreNLP API, your current directory should always be set to the root folder of an unzipped model, since Simple CoreNLP loads models lazily. Read more about model loading

Advantages and Disadvantages

This interface offers a number of advantages (and a few disadvantages – see below) over the default annotator pipeline:

  • Intuitive Syntax Conceptually, documents and sentences are stored as objects, and have functions corresponding to annotations you would like to retrieve from them.

  • Lazy Computation Annotations are run as needed only when requested. This allows you to “change your mind” later in a program and request new annotations.

  • No NullPointerExceptions Lazy computation allows us to ensure that no function will ever return null. Items which may not exist are wrapped inside of an Optional to clearly mark that they may be empty.

  • Fast, Robust Serialization All objects are backed by protocol buffers, meaning that serialization and deserialization is both very easy and very fast. In addition to being easily readable from other languages, our experiments show this to be over an order of magnitude faster than the default Java serialization.

  • Maintains Thread Safety Like the CoreNLP pipeline, this wrapper is thread-safe.

In exchange for these advantages, users should be aware of a few disadvantages:

  • Less Customizability Although the ability to pass properties to annotators is supported, it is significantly more clunky than the annotation pipeline interface, and is generally discouraged.

  • Possible Nondeterminism There is no guarantee that the same algorithm will be used to compute the requested function on each invocation. For example, if a dependency parse is requested, followed by a constituency parse, we will compute the dependency parse with the Neural Dependency Parser, and then use the Stanford Parser for the constituency parse. If, however, you request the constituency parse before the dependency parse, we will use the Stanford Parser for both.

Usage

There are two main classes in the interface: Document and Sentence. Tokens are represented as array elements in a sentence; e.g., to get the lemma of a token, get the lemmas array from the sentence and index it at the appropriate index. A constructor is provided for both the Document and Sentence class. For the former, the text is treated as an entire document containing potentially multiple sentences. For the latter, the text is forced to be interpreted as a single sentence.

An example program using the interface is given below:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
open edu.stanford.nlp.simple

// Create a document. No computation is done yet.
let doc : Document = new Document("add your text here! It can contain multiple sentences.");
let sentences = doc.sentences().toArray()
for sentObj in sentences do  // Will iterate over two sentences
    let sent : Sentence = sentObj :?> Sentence
    // We're only asking for words -- no need to load any models yet
    Console.WriteLine("The second word of the sentence '{0}' is {1}", sent, sent.word(1));
    // When we ask for the lemma, it will load and run the part of speech tagger
    Console.WriteLine("The third lemma of the sentence '{0}' is {1}", sent, sent.lemma(2));
    // When we ask for the parse, it will load and run the parser
    Console.WriteLine("The parse of the sentence '{0}' is {1}", sent, sent.parse());
> 
The second word of the sentence 'add your text here!' is your
The third lemma of the sentence 'add your text here!' is text
The parse of the sentence 'add your text here!' is (ROOT (S (VP (VB add) (NP (PRP$ your) (NN text)) (ADVP (RB here))) (. !)))
The second word of the sentence 'It can contain multiple sentences.' is can
The third lemma of the sentence 'It can contain multiple sentences.' is contain
The parse of the sentence 'It can contain multiple sentences.' is (ROOT (S (NP (PRP It)) (VP (MD can) (VP (VB contain) (NP (JJ multiple) (NNS sentences)))) (. .)))

Supported Annotators

The interface is not guaranteed to support all of the annotators in the CoreNLP pipeline. However, most common annotators are supported. A list of these, and their invocation, is given below. Functionality is the plain-english description of the task to be performed. The second column lists the analogous CoreNLP annotator for that task. The implementing class and function describe the class and function used in this wrapper to perform the same tasks.

Functionality

Anootator in CoreNLP

Implementation class

Function

Tokenization

tokenize

Sentence

.words() / .word(int)

Sentence Splitting

ssplit

Document

.sentences() / .sentence(int)

Part of Speech Tagging

pos

Sentence

.posTags() / .posTag(int)

Lemmatization

lemma

Sentence

.lemmas() / .lemma(int)

Named Entity Recognition

ner

Sentence

.nerTags() / .nerTag(int)

Constituency Parsing

parse

Sentence

.parse()

Dependency Parsing

depparse

Sentence

.governor(int) / .incomingDependencyLabel(int)

Coreference Resolution

dcoref

Document

.coref()

Natural Logic Polarity

natlog

Sentence

.natlogPolarities() / natlogPolarity(int)

Open Information Extraction

openie

Sentence

.openie() / .openieTriples()

Miscellaneous Extras

Some potentially useful utility functions are implemented in the SentenceAlgorithms class. These can be called from a Sentence object with, e.g.:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
open edu.stanford.nlp.ie.machinereading.structure

let sent2 : Sentence = new Sentence("your text should go here");
sent2.algorithms().headOfSpan(new Span(0, 2));  // Should return 1
> 
val sent2 : Sentence = your text should go here
val it : int = 1

A selection of useful algorithms are:

  • headOfSpan(Span) Finds the index of the head word of the given span. So, for example, United States president Barack Obama would return Obama.

  • dependencyPathBetween(int, int) Returns the dependency path between the words at the given two indices. This is returned as a list of string objects, meant primarily as an input to a featurizer.

namespace System
namespace java
namespace java.util
namespace edu
namespace edu.stanford
namespace edu.stanford.nlp
namespace edu.stanford.nlp.simple
val jarRoot : string

Full name: CoreNLP.Simple.jarRoot
namespace System.IO
type Directory =
  static member CreateDirectory : path:string -> DirectoryInfo + 1 overload
  static member Delete : path:string -> unit + 1 overload
  static member EnumerateDirectories : path:string -> IEnumerable<string> + 2 overloads
  static member EnumerateFileSystemEntries : path:string -> IEnumerable<string> + 2 overloads
  static member EnumerateFiles : path:string -> IEnumerable<string> + 2 overloads
  static member Exists : path:string -> bool
  static member GetAccessControl : path:string -> DirectorySecurity + 1 overload
  static member GetCreationTime : path:string -> DateTime
  static member GetCreationTimeUtc : path:string -> DateTime
  static member GetCurrentDirectory : unit -> string
  ...

Full name: System.IO.Directory
IO.Directory.SetCurrentDirectory(path: string) : unit
val props : Properties

Full name: CoreNLP.Simple.props
Multiple items
type Properties =
  inherit Hashtable
  new : unit -> Properties + 1 overload
  member getProperty : key:string -> string + 1 overload
  member list : out:PrintStream -> unit + 1 overload
  member load : reader:Reader -> unit + 1 overload
  member loadFromXML : in:InputStream -> unit
  member propertyNames : unit -> Enumeration
  member save : out:OutputStream * comments:string -> unit
  member setProperty : key:string * value:string -> obj
  member store : out:OutputStream * comments:string -> unit + 1 overload
  member storeToXML : os:OutputStream * comment:string -> unit + 1 overload
  ...

Full name: java.util.Properties

--------------------
Properties() : unit
Properties(defaults: Properties) : unit
Properties.setProperty(key: string, value: string) : obj
val ignore : value:'T -> unit

Full name: Microsoft.FSharp.Core.Operators.ignore
val sent : Sentence

Full name: CoreNLP.Simple.sent
Multiple items
type Sentence =
  new : sentence:CoreMap -> Sentence + 5 overloads
  member after : unit -> List + 1 overload
  member algorithms : unit -> SentenceAlgorithms
  member asCoreLabels : [<ParamArray>] functions:Function[] -> List
  member asCoreMap : [<ParamArray>] functions:Function[] -> CoreMap
  member before : unit -> List + 1 overload
  member cased : unit -> Sentence
  member caseless : unit -> Sentence
  member characterOffsetBegin : unit -> List + 1 overload
  member characterOffsetEnd : unit -> List + 1 overload
  ...

Full name: edu.stanford.nlp.simple.Sentence

--------------------
Sentence(sentence: edu.stanford.nlp.util.CoreMap) : unit
Sentence(tokens: List) : unit
Sentence(proto: edu.stanford.nlp.pipeline.CoreNLPProtos.Sentence) : unit
Sentence(text: string) : unit
Sentence(sentence: edu.stanford.nlp.naturalli.SentenceFragment) : unit
Sentence(text: string, props: Properties) : unit
val nerTags : List

Full name: CoreNLP.Simple.nerTags
Multiple items
type List =
  member add : e:obj -> bool + 1 overload
  member addAll : c:Collection -> bool + 1 overload
  member clear : unit -> unit
  member contains : o:obj -> bool
  member containsAll : c:Collection -> bool
  member equals : o:obj -> bool
  member get : index:int -> obj
  member hashCode : unit -> int
  member indexOf : o:obj -> int
  member isEmpty : unit -> bool
  ...
  nested type __DefaultMethods

Full name: java.util.List

--------------------
type List<'T> =
  | ( [] )
  | ( :: ) of Head: 'T * Tail: 'T list
  interface IReadOnlyCollection<'T>
  interface IEnumerable
  interface IEnumerable<'T>
  member GetSlice : startIndex:int option * endIndex:int option -> 'T list
  member Head : 'T
  member IsEmpty : bool
  member Item : index:int -> 'T with get
  member Length : int
  member Tail : 'T list
  static member Cons : head:'T * tail:'T list -> 'T list
  static member Empty : 'T list

Full name: Microsoft.FSharp.Collections.List<_>
Sentence.nerTags() : List
Sentence.nerTags(props: Properties) : List
val firstPOSTag : string

Full name: CoreNLP.Simple.firstPOSTag
Multiple items
val string : value:'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

--------------------
type string = String

Full name: Microsoft.FSharp.Core.string
Sentence.posTag(index: int) : string
val doc : Document

Full name: CoreNLP.Simple.doc
Multiple items
type Document =
  new : proto:Document -> Document + 5 overloads
  member asAnnotation : unit -> Annotation
  member cased : unit -> Document
  member caseless : unit -> Document
  member coref : unit -> Map + 1 overload
  member docid : unit -> Optional
  member equals : o:obj -> bool
  member hashCode : unit -> int
  member json : [<ParamArray>] functions:Function[] -> string
  member jsonMinified : [<ParamArray>] functions:Function[] -> string
  ...

Full name: edu.stanford.nlp.simple.Document

--------------------
Document(proto: edu.stanford.nlp.pipeline.CoreNLPProtos.Document) : unit
Document(text: string) : unit
Document(ann: edu.stanford.nlp.pipeline.Annotation) : unit
Document(props: Properties, text: string) : unit
Document(props: Properties, ann: edu.stanford.nlp.pipeline.Annotation) : unit
Document(props: Properties, proto: edu.stanford.nlp.pipeline.CoreNLPProtos.Document) : unit
val sentences : obj []

Full name: CoreNLP.Simple.sentences
Document.sentences() : List
Document.sentences(props: Properties) : List
val sentObj : obj
val sent : Sentence
type Console =
  static member BackgroundColor : ConsoleColor with get, set
  static member Beep : unit -> unit + 1 overload
  static member BufferHeight : int with get, set
  static member BufferWidth : int with get, set
  static member CapsLock : bool
  static member Clear : unit -> unit
  static member CursorLeft : int with get, set
  static member CursorSize : int with get, set
  static member CursorTop : int with get, set
  static member CursorVisible : bool with get, set
  ...

Full name: System.Console
Console.WriteLine() : unit
   (+0 other overloads)
Console.WriteLine(value: uint64) : unit
   (+0 other overloads)
Console.WriteLine(value: uint32) : unit
   (+0 other overloads)
Console.WriteLine(value: string) : unit
   (+0 other overloads)
Console.WriteLine(value: float32) : unit
   (+0 other overloads)
Console.WriteLine(value: obj) : unit
   (+0 other overloads)
Console.WriteLine(value: int64) : unit
   (+0 other overloads)
Console.WriteLine(value: int) : unit
   (+0 other overloads)
Console.WriteLine(value: float) : unit
   (+0 other overloads)
Console.WriteLine(value: decimal) : unit
   (+0 other overloads)
Sentence.word(index: int) : string
Sentence.lemma(index: int) : string
Sentence.parse() : edu.stanford.nlp.trees.Tree
Sentence.parse(props: Properties) : edu.stanford.nlp.trees.Tree
namespace edu.stanford.nlp.ie
namespace edu.stanford.nlp.ie.machinereading
namespace edu.stanford.nlp.ie.machinereading.structure
val sent2 : Sentence

Full name: CoreNLP.Simple.sent2
Sentence.algorithms() : SentenceAlgorithms
Multiple items
type Span =
  new : [<ParamArray>] spans:Span[] -> Span + 1 overload
  member contains : otherSpan:Span -> bool + 1 overload
  member ``end`` : unit -> int
  member equals : other:obj -> bool
  member expandToInclude : otherSpan:Span -> unit
  member forEach : :Consumer -> unit
  member hashCode : unit -> int
  member isAfter : otherSpan:Span -> bool
  member isBefore : otherSpan:Span -> bool
  member iterator : unit -> Iterator
  ...

Full name: edu.stanford.nlp.ie.machinereading.structure.Span

--------------------
Span([<ParamArray>] spans: Span []) : unit
Span(s: int, e: int) : unit
Fork me on GitHub