can't make stanford pos tagger working in nltk

  • Last Update :
  • Techknowledgy :

Lot has changed since this solution.Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.

import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path

from nltk.tag.stanford
import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger = StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options = '-mx4096m'
# # # Setting higher memory limit
for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())

The best thing to do is simply to download the latest version of the Stanford POS tagger where the dependency problem is now fixed (March 2018).

wget https: //nlp.stanford.edu/software/stanford-postagger-full-2017-06-09.zip

Suggestion : 2

I'm trying to work with Stanford POS tagger within NLTK. I'm using the example shown here:,http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford,nltkpos-taggerpythonstanford-nlp

I'm able to load everything smoothly:

>>>
import os
   >>>
   from nltk.tag
import StanfordPOSTagger
   >>>
   os.environ['STANFORD_MODELS'] = '/path/to/stanford/folder/models')

>>>
st = StanfordPOSTagger('english-bidirectional-distsim.tagger', path_to_jar = '/path/to/stanford/folder/stanford-postagger.jar')

but at the first execution:

>>> st.tag('What is the airspeed of an unladen swallow ?'.split())

it gives me the following error:

Loading default properties from tagger /path/to/stanford/folder/models/english-bidirectional-distsim.tagger
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
at edu.stanford.nlp.io.IOUtils.<clinit>(IOUtils.java:41)
   at edu.stanford.nlp.tagger.maxent.TaggerConfig.<init>(TaggerConfig.java:146)
      at edu.stanford.nlp.tagger.maxent.TaggerConfig.<init>(TaggerConfig.java:128)
         at edu.stanford.nlp.tagger.maxent.MaxentTagger.main(MaxentTagger.java:1836)
         Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
         ... 4 more

         Traceback (most recent call last):
         File "<stdin>", line 1, in <module>
               File "/Users/miguelwon/anaconda/lib/python2.7/site-packages/nltk/tag/stanford.py", line 66, in tag
               return sum(self.tag_sents([tokens]), [])
               File "/Users/miguelwon/anaconda/lib/python2.7/site-packages/nltk/tag/stanford.py", line 89, in tag_sents
               stdout=PIPE, stderr=PIPE)
               File "/Users/miguelwon/anaconda/lib/python2.7/site-packages/nltk/internals.py", line 134, in java
               raise OSError('Java command failed : ' + str(cmd))
               OSError: Java command failed : [u'/usr/bin/java', '-mx1000m', '-cp', '/path/to/stanford/folder/stanford-postagger-full-2015-12-09/stanford-postagger.jar', 'edu.stanford.nlp.tagger.maxent.MaxentTagger', '-model', '/Users/miguelwon/Documents/Kaggel/RTE/stanford-postagger-full-2015-12-09/models/english-bidirectional-distsim.tagger', '-textFile', '/var/folders/vb/dy__dnps7qz35slpmfkc25g40000gn/T/tmpwieb0M', '-tokenize', 'false', '-outputFormatOptions', 'keepEmptySentences', '-encoding', 'utf8']

Lot has changed since this solution.Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.

import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path

from nltk.tag.stanford
import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger = StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options = '-mx4096m'
# # # Setting higher memory limit
for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())

Suggestion : 3

Make a copy of the jar file, into which we'll insert a tagger model: cp stanford-postagger.jar stanford-postagger-withModel.jar , You can insert one or more tagger models into the jar file and give options to load a model from there. Here are detailed instructions. ,Start in the home directory of the unpacked tagger download, You can specify input files in a few different formats. This is part of the trainFile property. To learn more about the formats you can use and what other the options mean, look at the javadoc for MaxentTagger.

$ cat > short.txt
This is a short sentence.
So is this.
$ java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat slashTags 2> /dev/null
This_DT is_VBZ a_DT short_JJ sentence_NN ._.
So_RB is_VBZ this_DT ._.
$ java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat slashTags -tagSeparator \# 2> /dev/null
This#DT is#VBZ a#DT short#JJ sentence#NN .#.
So#RB is#VBZ this#DT .#.
$ java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat tsv 2> /dev/null
This DT
is VBZ
a DT
short JJ
sentence NN
. .

So RB
is VBZ
this DT
. .

$ java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat xml 2> /dev/null
<sentence id="0">
   <word wid="0" pos="DT">This</word>
   <word wid="1" pos="VBZ">is</word>
   <word wid="2" pos="DT">a</word>
   <word wid="3" pos="JJ">short</word>
   <word wid="4" pos="NN">sentence</word>
   <word wid="5" pos=".">.</word>
</sentence>
<sentence id="1">
   <word wid="0" pos="RB">So</word>
   <word wid="1" pos="VBZ">is</word>
   <word wid="2" pos="DT">this</word>
   <word wid="3" pos=".">.</word>
</sentence>
Exception in thread "main"
java.lang.NoSuchFieldError: featureFactoryArgs
at edu.stanford.nlp.ie.AbstractSequenceClassifier.(AbstractSequenceClassifier.java: 127)
at edu.stanford.nlp.ie.crf.CRFClassifier.(CRFClassifier.java: 173)
Exception in thread "main"
java.lang.NoSuchMethodError: edu.stanford.nlp.tagger.maxent.TaggerConfig.getTaggerDataInputStream(Ljava / lang / String;) Ljava / io / DataInputStream;
Caused by: java.lang.NoSuchMethodError: edu.stanford.nlp.util.Generics.newHashMap() Ljava / util / Map;
at edu.stanford.nlp.pipeline.AnnotatorPool.(AnnotatorPool.java: 27)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.getDefaultAnnotatorPool(StanfordCoreNLP.java: 305)
edu.stanford.nlp.io.RuntimeIOException:
   english - left3words - distsim.tagger.dict(The system cannot find the file specified)
english - left3words - distsim.tagger.tags(The system cannot find the file specified)
english - left3words - distsim.tagger.ex(The system cannot find the file specified)

Suggestion : 4

The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Our emphasis in this chapter is on exploiting tags, and tagging text automatically.,The process of automatically assigning parts of speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.,Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations including: predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.,Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.

>>> text = word_tokenize("And now for something completely different") >>>
   nltk.pos_tag(text)[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
      ('completely', 'RB'), ('different', 'JJ')]
>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit") >>>
   nltk.pos_tag(text)[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
      ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) >>>
   text.similar('woman')
Building word - context index...
   man day time year car moment world family house boy child country job
state girl place war way
case question
   >>>
   text.similar('bought')
made done put said found had seen given left heard been brought got
set was called felt in that told >>>
   text.similar('over')
in on to of and
for with from at by that into as up out down through
about all is
   >>>
   text.similar('the')
a his this their its her an that our any all one these my in your no
some other and
>>> tagged_token = nltk.tag.str2tuple('fly/NN') >>>
   tagged_token('fly', 'NN') >>>
   tagged_token[0]
'fly' >>>
tagged_token[1]
'NN'
>>> sent = ''
'
...The / AT grand / JJ jury / NN commented / VBD on / IN a / AT number / NN of /IN
   ...other / AP topics / NNS, /, AMONG/IN
them / PPO the / AT Atlanta / NP and / CC
   ...Fulton / NP - tl County / NN - tl purchasing / VBG departments / NNS which / WDT it / PPS
   ...said / VBD `` / ``
ARE / BER well / QL operated / VBN and / CC follow / VB generally / RB
   ...accepted / VBN practices / NNS which / WDT inure / VB to / IN the / AT best / JJT
   ...interest / NN of /IN both/ABX
governments / NNS '' / ''. / .
   ...''
' >>>
[nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
   ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ...('.', '.')
]
>>> nltk.corpus.brown.tagged_words()[('The', 'AT'), ('Fulton', 'NP-TL'), ...] >>>
   nltk.corpus.brown.tagged_words(tagset = 'universal')[('The', 'DET'), ('Fulton', 'NOUN'), ...]