Entity Extraction – Email id

 

Goal is to write a perfect and easiest way to identify email ids from the text documents. I am going to use regular expression and define rule for strings (email id) i am looking for.
Exampleshakthydoss@gmail.comstudent-244722@wilp.bits.pilani.edu.comgns4f-3895494981@sale.craigslist.org. 
 look for
Before blindly start writing some junk regx rule I took time to understand the format that email ids are composed of.
All email ids have two parts called local and domain. These two parts are separated by a symbol called @ (at). Characters set that preceding before ‘@’ is refereed as local part. Characters set that proceeding after ‘@’ is refereed as domain part. And domain part always either end with  dot(.) followed by two characters or dot(.) followed by three character.
So the regular expression for identifying the email address should have rules to match the local part and domain part.
Regular expression for domain name.
lets write a simple rule and experiments its ability to capture the domain names. On going further lets refine and retune the rule to make it powerful enough to match any domain that it may come across.
To get start lets consider a simple case of match domain like @gmail.com
so regx rule could be something like

@([a-z]+)(.[a-z]{2,3})

which will match @gmail.com, @yahoo.com, hotmail.com
I have used a online regx editor http://regexpal.com/ to test and evaluate the rules.
Regular expression for domain name.
As you see the above expression matches domain names according the rule we said but when further going deep the above rule does not match domain name such as  @yahoo.co.in, @wilp.bits.pilani.edu.in, @99achers.com @99acres.com, @gov-au.com
Regular expression for domain name.
This is because we have not said any thing about the dots,underscore,dash and numbers. Regx engine could not understand the string with the rule it has in hand. So it just left it has unmatched strings even if it a valid domain name.
Lets refine the rule to make it match the missed out valid domain names.

@([a-z0-9-.]+)(.[a-z]{2,3})$

As expected it matched all the domain that has dots, dash and numbers.

Regular expression for domain name.

 

Rule explanation

@ To match the character literal ‘@’
([a-z0-9-.]+) Group meta-characters  such as alphabet from a to z , dot and dash and group can occur one or more times.
(.[a-z]{2,3})$ Group meta-characters to .dot followed either two character or three characters.
Now regular expression for local part.
Like domain name local part of email id can also have alphabets, dots, underscore, and dash. So regx rule for local should be cable enough to identify such crucial strings.

([wd.-])+

Regular expression for local part.

 

Rule explanation
([wd.-])+ Group meta-characters  such as alphabet from a to z , dot, underscore and dash and group can occur one or more times.
Result 
Lets combine the local part and domain part rule to validate for the perfect email ids

([wd.-])+@([a-z0-9-.]+)(.[a-z]{2,3})$

Entity Extraction - Email id

Also note that anchor meta characters such a ^ and $ will not work in javascript based regx editors (http://regexpal.com/ ) so while evaluating the result please remove the anchor meta characters from rule.


Post a Comment

Your email is never published nor shared. Required fields are marked *