Latent Semantic Analysis – Part 2


In  Latent Semantic Analysis – Part 1  i have covered procedure for building Term Document Matrix (TDM) as it is a prerequisite for building LSI model . Now lets see how this TDM is supplied to SVD to obtain U , S, and V matrices.


To build a Latent Semantic Analysis (LSA) model to find statistical synonyms of a word (Term-Term Similarity) from a huge corpus.


Observation and legends:

Once we have built our  TDM matrix, we call upon a powerful but little known technique called Singular Value Decomposition or SVD to analyze the matrix for us. The reason SVD is useful, is that it  makes the best possible reconstruction of the matrix with the least possible information. To do this, it throws out noise, which does not help, and emphasizes strong patterns and trends, which do help. The trick in using SVD is in figuring out how many dimensions or “concepts” to use when approximating the matrix. In our case , once we have supplied TDM to SVD we will landed up with three matrix name U , S, V .

U matrix generally correspond to term vector coordinate , V matrix correspond to Document vector coordinate. Since our objective is to find  Term-Term Similarity , we will concentrate on U matrix. As i already mentioned that i stated build TDM using java language so it will be continent  to proceed the remain of work will same java language.

SVD is not include with basic jdk , special library files are need to process SVD . JAMA is a linear algebra package for Java , using this library files we could easily process SVD.

But before computing SVD , Let get the query term for which statistical synonym or similar term has to find. Each token in the query have to be passed to stop word list and stemming function. One done this ,find the  know token (which is already present in Count matrix ) and sum their corresponding coordinates then applied weighted function to it.

In my experiment after computing the query terms Q , i kept the Q coordinate at the last index of my TDM , this is help me to find the Cosine relation of last index with remaining indexes.

With help  JAMA library package , i passed the TDM matrix into SVD and then i retrieved U and S matrix.


Matrix A = new Matrix(Weighted_TDM);

        SingularValueDecomposition svd = A.svd();

        Matrix U = svd.getU();

        Matrix S = svd.getS();


After getting U and S matrix ,we have to multiple U and S matrix then with the resultant matrix (say T matrix ) we can find Term-Term Similarity using Cosine relation.

                             V . W
cos A  =    —————
||V|| ||w||

for Example
To find the angle between two vector coordinate say
=  2i + 3j + k
       w  =  4i + j + 2k
we compute:


v . w  =  8 + 3 + 2 = 13

If the angle is between 0 and 90 then there exits a relation (some similarity ) between the two vector coordinated. Similarly we have compute cosine relation between query coordinated and other coordinate. lesser the angle more similarity between the terms.

It was tough for me to work from command prompt this time so i decide to build LSI model with GUI complaints.

Steps for execution.

1. Enter the corpus directory in the specified field.

2. Click Extract Keywords button

3. Enter the query keywords in the specified field.

4. Click Compute Weighted TDM button

5. Click Process Query button

6. Click Compute SVD from Weighted TDM button

Resultant of each event will display separately in differed toggle plane (Tabs).

Once the corpus directory is set and Extract keyword button is clicked the keywords from corpus which is not a stop words and stemmed will be displayed in List of words in corpus tab.


When Compute Weighted TDM button is clicked followed Process Query button is clicked after giving the query in specified field the count matrix , weighted TDM and Sum of query vector coordinate will be separately show in Count matrix tab , Weighted TDM tab ,Query Vector Coordinates tab.


Term similarity with corresponding cosine angle could be seen after clicking Compute SVD from Weighted TDM .

Test Case

To test the model different type of input (query term) has been given to find their statistical similar terms. Tabulation here show some of test case result.



Similar terms and their corresponding angle

networking company  expertcosine A = 36.808665072949225


cosine A = 40.57505845078519


cosine A = 43.89788624801361


cosin A = 46.739184943348484


cosin A = 48.831127830382684


cosin A = 48.83112783038286




Similar terms and their corresponding angle

corporate in India   corporatecosin A = 0.0


cosin A = 39.23152048359223


cosin A = 50.76847951640772


cosin A = 50.76847951640775


cosin A = 56.78908923910091


cosin A = 58.909069642326905


cosin A = 58.90906964232696


cosin A = 58.90906964232697


cosin A = 59.529640534020224



To my understanding , construction of TDM is important role in LSI model , even though  we have removed the stop words and infection the accuracy get increased only after manually removing noise words which are neither stop word nor infection. In My experiment i encountered noise words such as numbers , page numbers , date ,  some symbols notation etc. It has also observed that no .of keywords becomes less when corpus below to same domain, if corpus consists of different domain then list of keywords increase tremendously.


Thus we have successfully found statistical similar terms for inputted query.

Download : LSI Model


  1. Amu wrote
    at 9:05 AM - 29th September 2010 Permalink

    Why didnt post the coding for LSI part 2…

  2. Arthi... wrote
    at 4:34 PM - 18th November 2010 Permalink

    Hi dude…
    Ur work is awesome…
    Keep it up…
    All the best for everything….
    Let success follows u..
    Everything will be good for you.

  3. sudheer wrote
    at 6:31 PM - 28th February 2011 Permalink

    hi ur work is good . by the way about you..i mean where u live in….

  4. shakthydoss wrote
    at 7:39 AM - 1st March 2011 Permalink

    Thank you sudheer
    I live in Chennai , Tamil Nadu.

  5. santosh kumar jaiswal wrote
    at 5:43 PM - 19th April 2011 Permalink

    plz send latent semantic analysis project
    my e mail

  6. shakthydoss wrote
    at 3:49 AM - 20th April 2011 Permalink

    There is download link at the end of this post , take a look of that.
    thank you.

  7. priya wrote
    at 6:00 AM - 26th March 2012 Permalink

    Sakthy will u post coding of lsi part II

  8. santosh kumar jaiswal wrote
    at 5:51 PM - 21st April 2011 Permalink

    how can i got coding from jar files….

  9. sang wrote
    at 7:44 AM - 23rd August 2011 Permalink

    hi,,,sir….while executing LSI2 in the LSIMODEL am getting 0 matrix for weighted TDM….is any prob in coding part..pls reply ill help to my project work…thank u…

  10. sangeetha.s wrote
    at 7:48 AM - 23rd August 2011 Permalink

    hi sir ,while executing LSI Part-2 am getting some prob in LSI model…in weighted TDM matrix am getting matrix as 0….after executing also….y…pls reply soon,,,,is that any prob in coding part….

  11. sangeetha.s wrote
    at 7:51 AM - 23rd August 2011 Permalink part-1 if am getting 6 keywords,,part-2 am getting 7 keywords..y dis difference…it also changing query to keyword ah…pls reply …

  12. TV TUAN wrote
    at 9:48 AM - 21st September 2011 Permalink

    hi, how can you nomalize the tfidf of the query that has more than one words for each document ? what formular did u use ?

  13. priya wrote
    at 4:22 AM - 27th March 2012 Permalink

    hi sakhi send me the codings of latent semantic indexing part-II. it will be very helpful for my project. im doing clustering of web pages.

  14. deepankar wrote
    at 6:32 AM - 4th April 2012 Permalink

    Same technique that used for converting corpus into vector is used for converting query(This is also a kind of document) into Vector.

  15. Reenie Mahajan wrote
    at 2:59 PM - 6th May 2014 Permalink

    Thanks a lot for posting such good stuff regarding Latent semantic Analysis. I found it very very helpful in understanding the concept for my dissertation work. no where else I found such explanation which could make me clear how latent semantic analysis is used to find term term similarity. Thanks a lot..

  16. shakthydoss wrote
    at 3:10 PM - 6th May 2014 Permalink

    Thanks. Reenie
    I’m glad to know that.

Post a Comment

Your email is never published nor shared. Required fields are marked *