Information Extraction ( Name and email ID ) from free running text

Prologue

Hi I was recently given few assignments by my master  திரு. சுதர்சன் சாந்தியப்பன் (visiting professor of SRM University), on completion of assignments he advised us to publish our works on internet. And here by I do the same with my solutions.

Objective of this post is to extract name and mail id from free running document.

I assume that 1. Document possesses semi structured form.

                       2. Expected pattern can be extracted by rule base grammar.

So lets try how it’s doing to work. I want extract a name and email id from a web page. So first I went to website (http://www.ias.ac.in/php/assoc_all.php3?alpha=A )and manually copied the entire text and pated in notepad.  Instead of saving the notepad as txt I saved as html although it does have any html tags .I wrote a simple java program that will read the html file. When it encounters a string that exactly matches with regular expression Patten it display the result.

As the document is semi structured a rule can be inferred from the text to extract pattern. Here name and email id are the expected patterns. As the document is semi structured Name is followed by a label called ‘Name:’ so capturing this label and its following text would extract the name. Email id is followed by a label called ‘Email:’so capturing this label and its following text would extract the email id.

Evaluation

Regular expression pattern for name “Name : (.+)"

Regular expression pattern for mail “Email : (.+)"

program code :

package Extraction;

importjava.io.BufferedReader;

import java.io.FileNotFoundException;

import java.io.FileReader;

import java.io.IOException;

import java.util.StringTokenizer;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class NameAndMailExtraction {

public static void main(String[] args) throws FileNotFoundException, IOException {

FileReaderfr = new FileReader("C:UsersshakthydossDesktopsample.html");

BufferedReaderbr = new BufferedReader(fr);

FileWriterfstream = new FileWriter("out.txt");

BufferedWriter out = new BufferedWriter(fstream);

String s, s1;

StringTokenizerst;

String expression = "Name : (.+)" ;

Pattern name_pattern = Pattern.compile(expression,Pattern.CASE_INSENSITIVE);

Matcher namematch;

String expression2 = "Email : (.+)";

Pattern mail_pattern = Pattern.compile(expression2,Pattern.CASE_INSENSITIVE);

Matcher mailmatch;

while ((s = br.readLine()) != null)

{

namematch =name_pattern.matcher(s);

mailmatch =mail_pattern.matcher(s);

if(namematch.find())

{

out.write(namematch.group().trim());

out.newLine();

System.out.println(namematch.group());

}

if(mailmatch.find())

{

out.write(mailmatch.group().trim());

out.newLine();

out.newLine();

System.out.println(mailmatch.group().toString());

System.out.println(" ");

}

}

out.close();

}}

here is the output :

output2


2 Comments

  1. cathleen wrote
    at 6:11 AM - 26th October 2012 Permalink

    This is my first time pay a quick visit at here and i am truly pleassant
    to read all at once .

  2. Janet wrote
    at 12:45 PM - 1st November 2012 Permalink

    I Love It!

Trackbacks & Pingbacks 1

  1. From Kuq1983 on 09 Oct 2012 at 10:45 PM

    This is really good
    shakthydoss.com/information-extraction-name-and-email-id-from-free-running-text

Post a Comment

Your email is never published nor shared. Required fields are marked *