result difference in stanford ner tagger nltk (python) vs java

  • Last Update :
  • Techknowledgy :

JAVA Result:

"ERwin": "PERSON"

Python Result:

In[6]: NERTagger.tag("Involved in all aspects of data modeling using ERwin as the primary software for this.".split())
Out[6]: [(u 'Involved', u 'O'),
   (u 'in', u 'O'),
   (u 'all', u 'O'),
   (u 'aspects', u 'O'),
   (u 'of', u 'O'),
   (u 'data', u 'O'),
   (u 'modeling', u 'O'),
   (u 'using', u 'O'),
   (u 'ERwin', u 'O'),
   (u 'as', u 'O'),
   (u 'the', u 'O'),
   (u 'primary', u 'O'),
   (u 'software', u 'O'),
   (u 'for', u 'O'),
   (u 'this.', u 'O')
]

I'm looking at StanfordNERTagger in nltk.tag to see if there's anything I can modify. Below is the wrapper code:

class StanfordNERTagger(StanfordTagger):
   ""
"
A class
for Named - Entity Tagging with Stanford Tagger.The input is the paths to:

   -a model trained on training data -
   (optionally) the path to the stanford tagger jar file.If not specified here,
   then this jar file must be specified in the CLASSPATH envinroment variable. -
   (optionally) the encoding of the training data(
      default: UTF - 8)

Example:

   >>>
   from nltk.tag
import StanfordNERTagger
   >>>
   st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') # doctest: +SKIP >>>
   st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) # doctest: +SKIP[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
      ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
      ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
""
"

_SEPARATOR = '/'
_JAR = 'stanford-ner.jar'
_FORMAT = 'slashTags'

def __init__(self, * args, ** kwargs):
   super(StanfordNERTagger, self).__init__( * args, ** kwargs)

@property
def _cmd(self):
   # Adding - tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer - tokenizerOptions tokenizeNLs = false
for not using stanford Tokenizer
return ['edu.stanford.nlp.ie.crf.CRFClassifier',
   '-loadClassifier', self._stanford_model, '-textFile',
   self._input_file_path, '-outputFormat', self._FORMAT, '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '\"tokenizeNLs=false\"'
]

def parse_output(self, text, sentences):
   if self._FORMAT == 'slashTags':
   # Joint together to a big list
tagged_sentences = []
for tagged_sentence in text.strip().split("\n"):
   for tagged_word in tagged_sentence.strip().split():
   word_tags = tagged_word.strip().split(self._SEPARATOR)
tagged_sentences.append((''.join(word_tags[: -1]), word_tags[-1]))

# Separate it according to the input
result = []
start = 0
for sent in sentences:
   result.append(tagged_sentences[start: start + len(sent)])
start += len(sent);
return result

raise NotImplementedError

Suggestion : 2

With both Stanford NER and Spacy, you can train your own custom models for Named Entity Recognition, using your own data.,We will use the Named Entity Recognition tagger from Stanford, along with NLTK, which provides a wrapper class for the Stanford NER tagger. ,Named Entity Recognition, or NER, is a type of information extraction that is widely used in Natural Language Processing, or NLP, that aims to extract named entities from unstructured text. ,Named entities can refer to people names, brands, organization names, location names, even things like monetary units, among others.

mkvirtualenv ner - analysis
pip install NLTK
import nltk
from nltk.tag.stanford
import StanfordNERTagger

PATH_TO_JAR = '/Users/christina/Projects/stanford_nlp/stanford-ner/stanford-ner.jar'
PATH_TO_MODEL = '/Users/christina/Projects/stanford_nlp/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz'
tagger = StanfordNERTagger(model_filename = PATH_TO_MODEL, path_to_jar = PATH_TO_JAR, encoding = 'utf-8')
sentence = "First up in London will be Riccardo Tisci, onetime Givenchy darling, favorite of Kardashian-Jenners everywhere, who returns to the catwalk with men’s and women’s wear after a year and a half away, this time to reimagine Burberry after the departure of Christopher Bailey."
words = nltk.word_tokenize(sentence)

Suggestion : 3

Stanford NER tagger: NER Tagger you can use with NLTK open-sourced by Stanford engineers and used in this tutorial.,You need to train your own model. To do so, create a dummy-french-corpus.tsv file in {yourAppFolder}/stanford-ner-tagger/train with the following syntax:,This guide shows how to use NER tagging for English and non-English languages with NLTK and Standford NER tagger (Python). You can also use it to improve the Stanford NER Tagger.,Neither NLTK, Spacy, nor SciPy handles french NER tagging out-of-the-box. Hopefully, you can train models for new languages but respective documentations are really light on that point.

Once installed, make sure your $JAVA_HOME environment is set:

echo $JAVA_HOME

If you haven’t done it yet, create a virtual environment to work on:

mkvirtualenv.venv - ner--python = /usr/bin / python3workon.venv - ner

Download NLTK:

pip install nltk

Run it:

python ner_english.py

Output should be:

[('Twenty', 'O'), ('miles', 'O'), ('east', 'O'), ('of', 'O'), ('Reno', 'ORGANIZATION'), (',', 'O'), ('Nev.', 'LOCATION'), (',', 'O'), ('where', 'O'), ('packs', 'O'), ('of', 'O'), ('wild', 'O'), ('mustangs', 'O'), ('roam', 'O'), ('free', 'O'), ('through', 'O'), ('the', 'O'), ('parched', 'O'), ('landscape', 'O'), (',', 'O'), ('Tesla', 'ORGANIZATION'), ('Gigafactory', 'ORGANIZATION'), ('1', 'ORGANIZATION'), ('sprawls', 'O'), ('near', 'O'), ('Interstate', 'LOCATION'), ('80', 'LOCATION'), ('.', 'O'), ('The', 'O'), ('Gigafactory', 'O'), (',', 'O'), ('whose', 'O'), ('construction', 'O'), ('began', 'O'), ('in', 'O'), ('June', 'DATE'), ('2014', 'DATE'), (',', 'O'), ('is', 'O'), ('not', 'O'), ('only', 'O'), ('outrageously', 'O'), ('large', 'O'), ('but', 'O'), ('also', 'O'), ('on', 'O'), ('its', 'O'), ('way', 'O'), ('to', 'O'), ('becoming', 'O'), ('the', 'O'), ('biggest', 'O'), ('manufacturing', 'O'), ('plant', 'O'), ('on', 'O'), ('earth', 'O'), ('.', 'O'), ('Now', 'O'), ('30', 'PERCENT'), ('percent', 'PERCENT'), ('complete', 'O'), (',', 'O'), ('its', 'O'), ('square', 'O'), ('footage', 'O'), ('already', 'O'), ('equals', 'O'), ('about', 'O'), ('35', 'O'), ('Costco', 'ORGANIZATION'), ('stores', 'O'), ('.', 'O')]

Suggestion : 4

We have 3 mailing lists for the Stanford Named Entity Recognizer, all of which are shared with other JavaNLP tools (with the exclusion of the parser). Each address is at @lists.stanford.edu: , Also available are caseless versions of these models, better for use on texts that are mainly lower or upper case, rather than follow the conventions of standard English , Ruby: tiendung has written a Ruby Binding for the Stanford POS tagger and Named Entity Recognizer. ,NLTK (2.0+) contains an interface to Stanford NER written by Nitin Madnani: documentation (note: set the character encoding or you get ASCII by default!), code, on Github.


java - cp "*"
edu.stanford.nlp.ie.crf.CRFClassifier - loadClassifier edu / stanford / nlp / models / ner / german.conll.hgc_175m_600.crf.ser.gz - testFile german - ner.tsv
java - cp "*"
edu.stanford.nlp.ie.crf.CRFClassifier - loadClassifier edu / stanford / nlp / models / ner / german.conll.hgc_175m_600.crf.ser.gz - tokenizerOptions latexQuotes = false - textFile german - ner.txt

Suggestion : 5

An alternative to NLTK's named entity recognition (NER) classifier is provided by the Stanford NER tagger. This tagger is largely seen as the standard in named entity recognition, but since it uses an advanced statistical learning algorithm it's more computationally expensive than the option provided by NLTK.,A big benefit of the Stanford NER tagger is that is provides us with a few different models for pulling out named entities. We can use any of the following:,The parameters passed to the StanfordNERTagger class include:,Once we've tokenized by word and classified the sentence, we see the tagger produces a list of tuples as follows:

Here's how we set it up to tag a sentence with the 3 class model:

# - * -coding: utf - 8 - * -

   from nltk.tag
import StanfordNERTagger
from nltk.tokenize
import word_tokenize

st = StanfordNERTagger('/usr/share/stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz',
   '/usr/share/stanford-ner/stanford-ner.jar',
   encoding = 'utf-8')

text = 'While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.'

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)

print(classified_text)

Once we've tokenized by word and classified the sentence, we see the tagger produces a list of tuples as follows:

[('While', 'O'), ('in', 'O'), ('France', 'LOCATION'), (',', 'O'), ('Christine', 'PERSON'), ('Lagarde', 'PERSON'), ('discussed', 'O'), ('short-term', 'O'), ('stimulus', 'O'), ('efforts', 'O'), ('in', 'O'), ('a', 'O'), ('recent', 'O'), ('interview', 'O'), ('with', 'O'), ('the', 'O'), ('Wall', 'ORGANIZATION'), ('Street', 'ORGANIZATION'), ('Journal', 'ORGANIZATION'), ('.', 'O')]