Entity Extraction – URL

 

For this entity extraction task my goal is to write a simple regex rule  to identify the most common URLs from the text documents.
Examplehttp://shakthydoss.com https://support.company.com , http://172.16.7.41/home/ ,
http://172.6.7.41/home?name=shakthydoss&year=2013 
 cartoon_detective1
As said earlier I took time to understand the structure, that URL is composed of.

Every URL consists of the following units:

  1. The schema name (commonly called protocol), then
  2. A colon, two slashes then
  3. A host, normally given as a domain name but sometimes as a literal IP address, then
  4. Optionally a port number, then
  5. The full path of the resource then
  6. Optionally a query parameter
My plan is to write a individual regex rule for each individual units and then  lets combine all the rules to make it powerful enough to match any URLs that it may come across.

 

Regular expression for matching protocol 
My requirement is to match either http or https so I will use pipe symbol which tells the regex engine to choose either this or that.

(http|https)

1

Regular expression to match colon and two slashes 
This is straightforward ://  let you use it directory

email-etraction-2

Regular expression to match host name or ip address 
We will reuse the same rule for home name that we have discussed in my previous post Entity extraction for email
so the regex rule will look like

(http|https)://(([a-z-.]+)(.[a-z]{2,3}))

email-etraction-3

As you see the regex rule failed to match the literal IP address. This is because we have not said any thing about the IP address to the regex engine.
So we have to write a simple regex rule for matching IP address which will have 4 segment separated by .dot notion and each segment may have maximum of 3 digit number.

([d]{1,3}.[d]{1,3}.[d]{1,3}.[d]{1,3})

email-etraction-4

 

As you the regex rule failed to match the  IP address that come with port number. Let refine our regex rule to match the port number. Here the constrain we have keep in mind is port may or  may not exist with IP address. However if it exist it has have min of 2 digit to maximum of 5 digit number.

(([d]{1,3}.[d]{1,3}.[d]{1,3}.[d]{1,3})(:[d]{2,5})?)

If you noticed I would have used repetition meta character ‘?’ to indicate the preceding item(port number)  may preceding zero or one time.
email-etraction-5
Now we have combine both the rule for matching host name and ip address that may optionally come with port number.
After combining regex rule will look some thing like this.

((http|https)://)(([a-z-.]+)(.[a-z]{2,3}))?(([d]{1,3}.[d]{1,3}.[d]{1,3}.[d]{1,3})(:[d]{2,5})?)?

email-etraction-6

 

Regex rule for mating full path of the resource 
Here the intention of regular expression is to match the page name or directory that it may pointing to.
So the regex rule would be like

((http|https)://)(([a-z-.]+)(.[a-z]{2,3}))?(([d]{1,3}.[d]{1,3}.[d]{1,3}.[d]{1,3})(:[d]{2,5})?)?(/w{1,}(.w{2,5})?)*(/)?

email-etraction-7
Now time to define  rule for matching query parameter that may along with urls

((http|https)://)(([a-z-.]+)(.[a-z]{2,3}))?(([d]{1,3}.[d]{1,3}.[d]{1,3}.[d]{1,3})(:[d]{2,5})?)?(/w{1,}(.w{2,5})?)*(/)?(?w{1,}=w{1,}(&w{1,}=w{1,})*)?

email-etraction-8

 

Result
At the end of the day I combine all the rules so far we have written for individual unit of URL to match the complete URL.
email-etraction-9
As you see in the above image result were much impressive to identify the hidden URL that may available in free running text. However while the evaluating the result i found that the above regex rule failed to match the encoded URLs.
So if you could write a rule to match the encoded URLs then let know that in comment section.

4 Comments

  1. Mahesh Chinnaswamy wrote
    at 4:47 PM - 31st August 2013 Permalink

    Hi shakthy

    I used the above rex rule u mention for ftp site…
    but unfortunate it fails to match….
    can you help me to match site other than http.

  2. shakthydoss wrote
    at 4:49 PM - 31st August 2013 Permalink

    Hi Mahesh ,
    I above regx rule is only meant for http and https urls. Using this you can’t match ftp urls.

    please write to me the details of url you like to match other than http(s).
    Let me try come up with a rule for you.

  3. shakthydoss wrote
    at 5:14 PM - 31st August 2013 Permalink

    Hi Mahesh ..

    Above regx rule is written with an intention to match http and https urls . Using this you cant mach FTP urls.
    Please write to me the details of url you like to match other than http(s).
    Let try to come up with rule for you.

  4. Mahesh Chinnaswamy wrote
    at 5:19 PM - 31st August 2013 Permalink

    Thanks for your prompt response shakthyd

    I have mailed the details.

Post a Comment

Your email is never published nor shared. Required fields are marked *