Word Cloud on Thirukural

Building word cloud isn’t that much scary until I know I could do this myself with some statistical packages provided in R.

For this expedition I decided to build word cloud on Thirukural. Thirukural is one the finest master pieces in the Tamil literature works which is believed to written during the Tamil sangam period. Thirukural constitutes of 133 Chapters each containing 10 couplets. Thus we get 133 * 10 = 1330 couplets in total. Depending on message or meaning, couplets are categorized into three groups namely Aram, Pourl, Inbam. Now let me stop here by giving more intrinsic facts about Thirukural and jump on to the actual notion of this of writing.

wordcloud

Below post will clearly explain the process I followed for building word could on Thirukural. All source code and corpus used in this excise can be found on my github account.

Steps in build the word cloud on Thirukural.

  1. Document preparation.

  2. Building Term Frequency matrix.

  3. Construing word cloud.

Document preparation.

I quickly jumped on the Internet to find archives(text, .pdf) of Thirukural couplets that exhibits only in Tamil. To my surprise I couldn’t find single corpus that will suit my needs. And  after surfing few minutes on Internet I decide to crawl some web-pages and create those documents am interested in.

Crawling web page is a trivial job only if the DOM structure is neatly arranged and target text are reachable either by HTML or CSS selectors. I picked http://www.gokulnath.com/thirukurals/1 website to Crawl for the two major reasons.

1. URL pattern

2. HTML selectors

URL pattern:

The URL pattern of gokulnath.com is very agile for web crawler i.e. I can loop through on URL patters to extract the couplets from each chapter just by increasing the number on last entity on the URL.

http://www.gokulnath.com/thirukurals/1 will point to chapeter 1
http://www.gokulnath.com/thirukurals/2 will points to chapter 2
http://www.gokulnath.com/thirukurals/133 will points to chapter 133

HTML selectors.

After reaching the chapter in the web-pages it time to extract couplets inside the chapters. HTML selectors of gokulnath.com is not as great as I thought but is reasonably good than other sites that I have gone so far. Here couplets are placed inside the table along with others information that I don’t look for. So I have make my crawl rules to focus on the target text and extract only what I want.

After analyzing the DOM structure I wrote a simple web-crawler that will iterate through URL patters and extract the couplets from each chapters. Followed by extraction I programed the crawler to write all those couplets from same chapter a separate files. Thus at the end of execution of the web-crawler I had 133 files in my machine containing 10 couples each.

I took 11 minutes to write the crawler. 

It took less than a 1 minute for crawler to finish the entire crawl job.

 thirukural-corpus

Term Frequency Document.

Next step in this expedition is to build Term Frequency document matrix from those corpus generated by web-crawler. Here rows corresponds to terms and column corresponds to document. And Each cell represents term frequency for the corresponding column(document). Thus TFD matrix gives us ability to sum the row count which is nothing but a total frequency count of a particular term in the over all corpus.

More detailed explanation of TFD could be found on my other post.

In this case, after building the TFD we will have 133 columns and 6566 rows in the matrix. After constructing the TFD matrix we have sum each rows to find the total frequency of terms in the entire document. And this total will goes as input to the word cloud function.

Sample TFD matrix.

D1

D2

D113

அகர

1

0

0

0

1

ஆதி

1

0

1

0

0

..

0

0

0

0

0

1

0

1

0

0

6566

0

0

1

1

0

Construing word could.

As I told in the introduction, building word cloud isn’t that much scary until I know I could do this myself with the statistical packages provided in R. Here we will use package “tm” and “wordcloud”. The package “tm” is used for text mining activity such as cleaning up the text data. Usually “tm” package is used to strip white space, stemming the text and building the TFD matrix. Since My corpus is in Tamil I will not worry about stemming. The package “wordcloud” is in presentation layer that used to create word cloud chats and graphs. Using the “wordcloud” properties I took liberty to customize the color and number of word that should be drawn on the graph. Here I customize to and said draw all those word whose frequency should be minimum of 3 and maximum number of word that could exhibit on cloud is 133.

wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2″))

Important properties of above code snippet

scale is used to controls the difference between the largest and smallest font.

max.words is required to limit the number of words in the cloud.

rot.per controls the percentage of vertical text.

That’s it I have created the word cloud.  All source code and corpus used in this excise can be found on my github account.

thiruvalluvar

And I don’t know I did not find word cloud for Thirukural on INTERNET I searched a lot. If it true I think I would be worlds first person to create worlds first word could on Thirukural !!!!…isn’t cool, Let me know your thoughts on comments. 

 


5 Comments

  1. சாந்தியப்பன் சுதர்சன் wrote
    at 9:37 PM - 23rd February 2014 Permalink

    Very nice effort Shakthy! Keep doing well.

  2. சாந்தியப்பன் சுதர்சன் wrote
    at 4:07 PM - 23rd February 2014 Permalink

    Very nice effort Shakthy! Keep doing well.

  3. magesh wrote
    at 2:59 PM - 3rd November 2014 Permalink

    sakthi u r wordcloud on tirukural is very interesting and nice plz this my pg project plz guide me……………….

  4. shakthydoss wrote
    at 3:02 PM - 3rd November 2014 Permalink

    Let me know how can I help you.

  5. Anbarasu Ramachandran wrote
    at 6:10 AM - 11th October 2015 Permalink

    Hi Sakthi…I tried to create the word cloud using your code. Could you pls help me, why I get this kind of a word cloud? Am using R studio and R version 3.1.2.

Post a Comment

Your email is never published nor shared. Required fields are marked *