Information Extraction ( Phone numbers ) from free running text

Prologue

Hi I was recently given few assignments by my master  திரு. சுதர்சன் சாந்தியப்பன் (visiting professor of SRM University), on completion of assignments he advised us to publish our works on internet. And here by I do the same with my solutions.

Objective: To Extracting Phone numbers from resumes.

Assumption:

  1. All documents are unstructured
  2. Predictable patterns exhibit some commonality

Observation & Legends:

From the experiment it has been found that majority of people have their own format (style) for writing contact number even though there is standard format to be followed.

Before writing the rule I have collect different practice of writing contact numbers in resume. On having deep view on this collection I have rearranged those collections in some order so that they fall into the same clusters which are of similar in structure. Then I tried to write rules for each cluster.

For example different practices of writing contact numbers

2872 4672

+33/641472652

078887 865 638

(852) 9036 3344

078887 865 638

+39-041-5349-042

93 783434

+33/641472652

565-2972

+39/3283575424

9835-5675

413-434-2323

9835-5675

1390-130-3861

+32 02 90654739

+32 02 90654739

(853)36980616

82-11-661-4906

+39-041-5349-042

3343715090

 

 

Then rearranging the collections contact number s fall into the same clusters which are of similar in structure.

 

2872 4672

 

9835-5675

this can be considered as one cluster

565-2972

 

So rule for this above cluster (d{3,4}s?-?d{4})

 

(852) 9036 3344

this can be considered as one cluster

(853)36980616

 

So rule for this above cluster ((?d{3}?)?s?d{4}s?d{4})

 

93 783434

this can be considered as one cluster

82-11-661-4906

 

So rule for this above cluster (d{2}?s?-?d{2}?s?-?d{3}?s?-?d*)

 

078887 865 638

this can be considered as one cluster

So rule for this above cluster (d{6}s?d{3}?s?d{3})

 

+39/3283575424

 

+33/641472652

this can be considered as one cluster

+33/641472652

 

So rule for this above cluster (+d{2}?/?d{9,10})

 

+39-041-5349-042

this can be considered as one cluster

So rule for this above cluster (+d{2}?-?d{3}?-?d{4}?-?d{3})

 

1390-130-3861

 

+39-041-5349-042

this can be considered as one cluster

413-434-2323

 

So rule for this above cluster ((+d{2})?-?d{3,4}-d{3,4}-d{3,4})

 

+32 02 90654739

this can be considered as one cluster

3343715090

 

So rule for this above cluster ((+d{2})?s?d{2}?s?d{8})

 

Tel : 3456 3456

 

TEL: 67543 898

 

Telephone: 67543 898

 

telephone: 67543 898

 

TELEPHONE: 67543 898

this can be considered as one cluster

Phone : 23456343

 

PHONE:

 

Mobile: 98869366455

 

MOBILE: + 91 98346563434

 

So rule for this above cluster ((Tel)?(TELEPHONE)?(Phone)?(PHONE)?(Mobile)?(MOBILE)?.*)

Problem : sometime the rule ((Tel)?(TELEPHONE)?(Phone)?(PHONE)?(Mobile)?(MOBILE)?.*)

fails because there are cases where , contact label (phone) will be on one line and actual number will be on next line . so this problem could be raised when  handling this rule .

Eg

Phone :

9884965355

Sometimes blank space alone might be captured my this particular rule so it is advisable to leave this rule

In addition to this I have written some more rules keeping the countries stand format for writing phone numbers to hunt for contact numbers

Regular expression for Argentina

(std code)

Optional

(o)

Optional

(phone number)

Required

Std code can prefix with one or more ‘+’ symbol.

Can be enclosed with bracket

Can be enclosed with bracket 8 digit has be there

8 digit can be split into 3:3:2 , 4:4 , 3:2:3 ,etc

Can be enclosed with bracket

So I have written a regular expression that will match all above constrains

(((+*)?[5][4]?[0-9]{0,1})?(s?.?s?[0-9]{2,3}s?.?)?(s?[0-9]{1}-?[0-9]{3}.?[0-9]{4}.?))

Regular expression for Canada

3 digits

Required

3 digits

Required

4 digits

Required

Can be enclosed with bracket Can be enclosed with bracket

Begin with special character such as . – and space

Can be enclosed with bracket

Begin with special character such as . – and space

So I have written a regular expression that will match all above constrains

((?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{4}?s?)?)

Regular expression for Germany

Std code

Optional

2 (3) digit

Required

3 (2 )digit

Required

2 (3) digit

Required

2digit

Required

Std code can prefix with one or more ‘+’ symbol.

Can be enclosed with bracket

Can be enclosed with bracket Can be enclosed with bracket ,

Space may exhibits

Can be enclosed with bracket ,

Space may exhibits

Can be enclosed with bracket ,

Space may exhibits

So I have written a regular expression that will match all above constrains

(+*d*s*(?[?d?]?)?s?d?dd?s?ddd?s?dd?s?dd?s?d*s?)

Regular expression in generalised forum

Very generalised rule is required if a phone number is failed to represent in correct form .As it has mentioned early majority people follow their own format (hence it’s named unstructured).

Min 1 digit to max anything

Required

Min 1 digit to max anything

Optional

Min 1 digit to max anything

Optional

Min 1 digit to max anything

Optional

Prefix with one or more ‘+’ symbol.

Can be enclosed with bracket ,

Space may exhibits

Can be enclosed with bracket ,

Space may exhibits

Begin with special character such as . –

Can be enclosed with bracket ,

Space may exhibits

Begin with special character such as . –

Can be enclosed with bracket ,

Space may exhibits

Begin with special character such as . –

So I have written a regular expression that will match all above constrains

(+*(?[?-?d?d?d*-?]?)?s?(?[?-?d?d?d*-?]?)?s?(?[?-?d?d?d*-?]?)?s?(?[?-?d?d?d*-?]?)?))

Evaluation

To test the experiment sample of 100 resumes has been taken to extract the phone numbers. Multiple regular expressions have been grouped into single expression to extract the phone numbers from the resumes.Explanation about the rules ..

The above regular expressions have to be grouped into single entity

"((((+*)?[5][4]?[0-9]{0,1})?(s?.?s?[0-9]{2,3}s?.?)?(s?[0-9]{1}-?[0-9]{3}.?[0-9]{4}.?))|((?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{4}?s?)?)|(+*d*s*(?[?d?]?)?s?d?dd?s?ddd?s?dd?s?dd?s?d*s?)|(+*(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?)| (d{3,4}s?-?d{4})|((?d{3}?)?s?d{4}s?d{4})|(d{2}?s?-?d{2}?s?-?d{3}?s?-?d*)|(d{6}s?d{3}?s?d{3})|(+d{2}?/?d{9,10})|(+d{2}?-?d{3}?-?d{4}?-?d{3})|((+d{2})?-?d{3,4}-d{3,4}-d{3,4})|((+d{2})?s?d{2}?s?d{8}))";

Sometime the above rule can extract a pattern which resembles to be phone number but there are not actually, for example 2001 –2002 is valid phone number according to the rule. So there should mechanism to eliminate this kind of erratum.

These errata are avoided by checking there length and starting string, then if pattern satisfy contains it is displayed as output.

Program code

jTextArea1.setText("");

jTextArea2.setText("");

JFileChooser jFileChooser1 = new JFileChooser();

jFileChooser1.setCurrentDirectory(new File("Testing sample"));

jFileChooser1.setFileFilter(new javax.swing.filechooser.FileFilter() {

public boolean accept(File f) {

return f.getName().toLowerCase().endsWith(".txt")

|| f.isDirectory();

}

public String getDescription() {

return "txt";

}

});

int r = jFileChooser1.showOpenDialog(new JFrame());

String path = jFileChooser1.getSelectedFile().getAbsolutePath().toString();

FileReader fr = null;

try {

fr = new FileReader(path);

} catch (FileNotFoundException ex) {

Logger.getLogger(ExtractPhone.class.getName()).log(Level.SEVERE, null, ex);

}

BufferedReader br = new BufferedReader(fr);

String line;

try {

while ((line = br.readLine()) != null) {

if (line.length() == 0) {

continue;

}

jTextArea1.setText(jTextArea1.getText()+"n" + line.toString());

String ex2 = "((((+*)?[5][4]?[0-9]{0,1})?(s?.?s?[0-9]{2,3}s?.?)?(s?[0-9]{1}-?[0-9]{3}.?[0-9]{4}.?))|((?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{4}?s?)?)|(+*d*s*(?[?d?]?)?s?d?dd?s?ddd?s?dd?s?dd?s?d*s?)|(+*(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?)| (d{3,4}s?-?d{4})|((?d{3}?)?s?d{4}s?d{4})|(d{2}?s?-?d{2}?s?-?d{3}?s?-?d*)|(d{6}s?d{3}?s?d{3})|(+d{2}?/?d{9,10})|(+d{2}?-?d{3}?-?d{4}?-?d{3})|((+d{2})?-?d{3,4}-d{3,4}-d{3,4})|((+d{2})?s?d{2}?s?d{8}))";

Pattern pt2 = Pattern.compile(ex2);

Matcher mt2 = null;

mt2 = pt2.matcher(line.toString());

while (mt2.find()) {

String temp = mt2.group().toString();

int templen= mt2.group().toString().trim().length();

if(temp.trim().startsWith("19")||(templen<=7)||(temp.trim().startsWith("200"))||(temp.trim().startsWith("(200"))||(temp.trim().startsWith("(19")))

{

}

else

{

if(temp.startsWith(")")||temp.startsWith("a"))

{

temp = temp.substring(1);

}

jTextArea2.setText(jTextArea2.getText()+"n" + temp.trim());

}

}

}

} catch (IOException ex) {

Logger.getLogger(ExtractPhone.class.getName()).log(Level.SEVERE, null, ex);

}

Test case :

Regx = "((((+*)?[5][4]?[0-9]{0,1})?(s?.?s?[0-9]{2,3}s?.?)?(s?[0-9]{1}-?[0-9]{3}.?[0-9]{4}.?))|((?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{3}?s?)?s?-?s?(?s?[0-9]{4}?s?)?)|(+*d*s*(?[?d?]?)?s?d?dd?s?ddd?s?dd?s?dd?s?d*s?)|(+*(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?s?(?[?-?d*-?]?)?)| (d{3,4}s?-?d{4})|((?d{3}?)?s?d{4}s?d{4})|(d{2}?s?-?d{2}?s?-?d{3}?s?-?d*)|(d{6}s?d{3}?s?d{3})|(+d{2}?/?d{9,10})|(+d{2}?-?d{3}?-?d{4}?-?d{3})|((+d{2})?-?d{3,4}-d{3,4}-d{3,4})|((+d{2})?s?d{2}?s?d{8}))";

Tabulation here will explain the sample of contact number that would be extracted by the above regx .

input could be like this

Out put

Contact : (852) 9036 3344

(852) 9036 3344

Phone :

9886735437

9886735437

Tel : +39/3283575424

+39/3283575424

Phone no : 078887 865 638

078887 865 638

Tel: +54 (11) 4383.4933 / Mobile: +54 (911) 6003.7885

+54 (11) 4383.4933

+54 (911) 6003.7885

Output

ExP

Download :http://www.mediafire.com/?8tyk61vl37z8onw


1 Comment

  1. Sudarsun wrote
    at 4:42 PM - 3rd August 2010 Permalink

    This is AWESOME!

Post a Comment

Your email is never published nor shared. Required fields are marked *