Preface
In Latent Semantic Analysis – Part 1 i have covered procedure for building Term Document Matrix (TDM) as it is a prerequisite for building LSI model . Now lets see how this TDM is supplied to SVD to obtain U , S, and V matrices.
Objective
To build a Latent Semantic Analysis (LSA) model to find statistical synonyms of a word (Term-Term Similarity) from a huge corpus.
Observation and legends:
Once we have built our TDM matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The reason SVD is useful, is that it makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or “concepts” to use when approximating the matrix. In our case , once we have supplied TDM to SVD we will landed up with three matrix name U , S, V .
U matrix generally correspond to term vector coordinate , V matrix correspond to Document vector coordinate. Since our objective is to find Term-Term Similarity , we will concentrate on U matrix. As i already mentioned that i stated build TDM using java language so it will be continent to proceed the remain of work will same java language.
SVD is not include with basic jdk , special library files are need to process SVD . JAMA is a linear algebra package for Java , using this library files we could easily process SVD.
But before computing SVD , Let get the query term for which statistical synonym or similar term has to find. Each token in the query have to be passed to stop word list and stemming function. One done this ,find the know token (which is already present in Count matrix ) and sum their corresponding coordinates then applied weighted function to it.
In my experiment after computing the query terms Q , i kept the Q coordinate at the last index of my TDM , this is help me to find the Cosine relation of last index with remaining indexes.
With help JAMA library package , i passed the TDM matrix into SVD and then i retrieved U and S matrix.
<code>
Matrix A = new Matrix(Weighted_TDM);
SingularValueDecomposition svd = A.svd();
Matrix U = svd.getU();
Matrix S = svd.getS();
<code>
After getting U and S matrix ,we have to multiple U and S matrix then with the resultant matrix (say T matrix ) we can find Term-Term Similarity using Cosine relation.
V .^{ }W
cos A = —————
||V|| ||w||
for Example
To find the angle between two vector coordinate say
v = 2i + 3j + k
and
w = 4i + j + 2k
we compute:
and
and
v ^{. }w = 8 + 3 + 2 = 13
Hence
If the angle is between 0 and 90 then there exits a relation (some similarity ) between the two vector coordinated. Similarly we have compute cosine relation between query coordinated and other coordinate. lesser the angle more similarity between the terms.
It was tough for me to work from command prompt this time so i decide to build LSI model with GUI complaints.
Steps for execution.
1. Enter the corpus directory in the specified field.
2. Click Extract Keywords button
3. Enter the query keywords in the specified field.
4. Click Compute Weighted TDM button
5. Click Process Query button
6. Click Compute SVD from Weighted TDM button
Resultant of each event will display separately in differed toggle plane (Tabs).
Once the corpus directory is set and Extract keyword button is clicked the keywords from corpus which is not a stop words and stemmed will be displayed in List of words in corpus tab.
When Compute Weighted TDM button is clicked followed Process Query button is clicked after giving the query in specified field the count matrix , weighted TDM and Sum of query vector coordinate will be separately show in Count matrix tab , Weighted TDM tab ,Query Vector Coordinates tab.
Term similarity with corresponding cosine angle could be seen after clicking Compute SVD from Weighted TDM .
Test Case
To test the model different type of input (query term) has been given to find their statistical similar terms. Tabulation here show some of test case result.
Query |
Similar terms and their corresponding angle |
networking company | expertcosine A = 36.808665072949225
router cosine A = 40.57505845078519 activity cosine A = 43.89788624801361 cisco cosin A = 46.739184943348484 backbone cosin A = 48.831127830382684 bandwidth cosin A = 48.83112783038286 ……. |
Query |
Similar terms and their corresponding angle |
corporate in India | corporatecosin A = 0.0
mba/pgdm cosin A = 39.23152048359223 deliver cosin A = 50.76847951640772 exchange cosin A = 50.76847951640775 practice cosin A = 56.78908923910091 industrial cosin A = 58.909069642326905 potential cosin A = 58.90906964232696 opportuniti cosin A = 58.90906964232697 organiza cosin A = 59.529640534020224 ……. |
Result
To my understanding , construction of TDM is important role in LSI model , even though we have removed the stop words and infection the accuracy get increased only after manually removing noise words which are neither stop word nor infection. In My experiment i encountered noise words such as numbers , page numbers , date , some symbols notation etc. It has also observed that no .of keywords becomes less when corpus below to same domain, if corpus consists of different domain then list of keywords increase tremendously.
Thus we have successfully found statistical similar terms for inputted query.
Download : LSI Model
16 Comments
at 9:05 AM - 29th September 2010 Permalink
Why didnt post the coding for LSI part 2…
at 4:34 PM - 18th November 2010 Permalink
Hi dude…
Ur work is awesome…
Keep it up…
All the best for everything….
Let success follows u..
Everything will be good for you.
at 6:31 PM - 28th February 2011 Permalink
hi ur work is good . by the way about you..i mean where u live in….
at 7:39 AM - 1st March 2011 Permalink
Thank you sudheer
I live in Chennai , Tamil Nadu.
at 5:43 PM - 19th April 2011 Permalink
hi..
plz send latent semantic analysis project
my e mail id:spark7389gmail.com
thanks
at 3:49 AM - 20th April 2011 Permalink
There is download link at the end of this post , take a look of that.
thank you.
at 6:00 AM - 26th March 2012 Permalink
Sakthy will u post coding of lsi part II
at 5:51 PM - 21st April 2011 Permalink
how can i got coding from jar files….
at 7:44 AM - 23rd August 2011 Permalink
hi,,,sir….while executing LSI2 in the LSIMODEL am getting 0 matrix for weighted TDM….is any prob in coding part..pls reply soon..it ill help to my project work…thank u…
at 7:48 AM - 23rd August 2011 Permalink
hi sir ,while executing LSI Part-2 am getting some prob in LSI model…in weighted TDM matrix am getting matrix as 0….after executing also….y…pls reply soon,,,,is that any prob in coding part….
at 7:51 AM - 23rd August 2011 Permalink
hi..in part-1 if am getting 6 keywords,,part-2 am getting 7 keywords..y dis difference…it also changing query to keyword ah…pls reply …
at 9:48 AM - 21st September 2011 Permalink
hi, how can you nomalize the tfidf of the query that has more than one words for each document ? what formular did u use ?
at 4:22 AM - 27th March 2012 Permalink
hi sakhi send me the codings of latent semantic indexing part-II. it will be very helpful for my project. im doing clustering of web pages.
at 6:32 AM - 4th April 2012 Permalink
Same technique that used for converting corpus into vector is used for converting query(This is also a kind of document) into Vector.
at 2:59 PM - 6th May 2014 Permalink
Thanks a lot for posting such good stuff regarding Latent semantic Analysis. I found it very very helpful in understanding the concept for my M.tech dissertation work. no where else I found such explanation which could make me clear how latent semantic analysis is used to find term term similarity. Thanks a lot..
at 3:10 PM - 6th May 2014 Permalink
Thanks. Reenie
I’m glad to know that.
Post a Comment