POS Tagger

Objective

Is to apply part of speech tagging to English text

Tool used

The Stanford POS Tagger, A Java implementation of a maximum-entropy (CMM) part-of-speech (POS) tagger.

Background

Part-of-speech tagging is the process of assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence. Tags play an important role in Natural language applications like speech recognition, natural language parsing, information retrieval and information extraction.

So here in this post I will show simple implementation of POS tagger with help of above mentioned tool. Before jumping into working model let’s first understand something about The Stanford POS Tagger.

The tagger has three modules namely Tagging, Training, and Testing.

Tagging allows you to use a predefined model (simply say a file of file) to assign part of speech tags to text.

In Training phase you will make the new model file which is manually tagged already. After training we can use the model for Tagging again.

Testing allows you to evaluating the results against the correct tags.

Right now we are going to see only tagging which used predefined model

The Stanford POS Tagger consists of two predefined model for English language.

1 bidirectional-distsim-wsj-0-18.tagger. Its accuracy is 97.32%

2 left3words-wsj-0-18.tagger. This tagger runs a lot faster. Its accuracy is 96.92%

MaxentTagger is one of class defined in The Stanford POS Tagger. The constructor of MaxentTagger class takes any predefined model as an argument for tagging process.

Since we are interested to tag English text we can either use bidirectional-distsim-wsj-0-18.tagger or left3words-wsj-0-18.tagger.

To tag a String of text and to get back a String with tagged words:

<code>

String taggedString = tagger.tagString("Here’s a tagged string.")

<code>

For demo purpose let tag a string from command prompt

<code>

Scanner sn = new Scanner(System.in);

System.out.println("Enter a String…");

System.out.println(tagger.tagString(sn.nextLine()));

<code>

Reference Tagset

CC

Coordinating conjunction

CD

Cardinal number

DT

Determiner

EX

Existential there

FW

Foreign word

IN

Preposition or subordinating conjunction

JJ

Adjective

JJR

Adjective, comparative

JJS

Adjective, superlative

LS

List item marker

MD

Modal

NN

Noun, singular or mass

NP

Proper noun singular

NPS

Proper noun plural

PDT

Predeterminer

POS

Possessive ending

PP

Personal pronoun

PP$

Possessive pronoun

RB

Adverb

RBR

Adverb, comparative

RBS

Adverb, superlative

RP

Particle

SYM

Symbol

TO

To

UH

Interjection

VB

Verb, base form

VBD

Verb, past tense

VBG

Verb, gerund or present participle

VBN

Verb, past participle

VBP

Verb, noun-3rd person singular present

VBZ

Verb, 3rd person singular present

WDT

Wh-determiner

WP

Wh-pronoun

WP$

Possessive wh-pronoun

WRB

Wh-adverb

implementation code

package postagger;

import edu.stanford.nlp.tagger.maxent.MaxentTagger;

import java.util.Scanner;

import java.util.StringTokenizer;

/***

* @author shakthydoss

*/

public class MyPOS {

public static void main(String[] args) throws Exception {

StringTokenizer st;

MaxentTagger tagger = new MaxentTagger("modelsbidirectional-distsim-wsj-0-18.tagger");

Scanner sn = new Scanner(System.in);

System.out.println("nEnter a String :");

System.out.println("nAfter tagging..n"+tagger.tagString(sn.nextLine().trim()));

}

}

Test case

I experimented with different text. I observed that, tagging results good if inputted English text has some proper grammar. If inputted English text does not have proper grammar then tagging results goes bad. This is because Stanford POS Tagger is implemented by maximum-entropy (CMM) which is based on grammar of languages.

Input

Output

Comment

John is working

John/NNP is/VBZ working/VBG

Good

working

working/VBG

Good

playing

playing/NN

Bad

She ran to the station quickly

She/PRP ran/VBD to/TO the/DT station/NN quickly/RB

Good

talk and jump

talk/NN and/CC jump/NN

Bad (input is bad grammar )

Output

pos tagger


4 Comments

  1. Peterpan wrote
    at 8:28 AM - 6th July 2011 Permalink

    Thank you for knowledge

  2. Rajagopal wrote
    at 5:36 PM - 10th April 2013 Permalink

    Very well written.

    I came across the Stanford POS Tagger recently, and I have been trying to implement it through my PHP page. Do you know if this POS tagger comes in a PHP API version? Or is there a way I can access the Java program through my PHP code?

    Thanks a lot for the info 🙂

  3. shakthydoss wrote
    at 5:42 PM - 10th April 2013 Permalink

    Thanks you Rajagopal.

    Below link will help you to implement POS tagger in PHP
    http://phpir.com/part-of-speech-tagging/

    However if you like to use java then use web service to implementation POS tagger in java and then call the service from php or java script.

  4. Rajagopal wrote
    at 6:46 PM - 10th April 2013 Permalink

    Thanks Shakthydoss, for that quick reply.

    I did check out the link you recommended earlier. The implementation of it was really easy and it is very efficient. However, I found several errors and unforgivable approximation in the Brill Tagger which I don’t find in the Stanford one. For instance, “I hope” has been tagged as “I_NN hope_NN” when it should be “I_PRP hope_VB|VBP”. I am currently working with the Brill Tagger, but I would sure like to use the Stanford one.

    As for your suggestion for using a web service, I think I will go down that route and see how viable it is for my purposes. Once again, thank you so much for your help.

Post a Comment

Your email is never published nor shared. Required fields are marked *