Skip to content Skip to sidebar Skip to footer

How To Plot A MDS From A Similarity Matrix?

I'm using a similarity matrix with values between 0 and 1 (1 means that the elements are equals), and I'm trying to plot a MDS with python and scikit-learn. I found multiple exampl

Solution 1:

I did a comparison with XLSTAT (an excel extension) in order to try a lot of scenarios and compare how to do what.

First : my input matrix was a "similarity" matrix, because I could interpret it as: "A and A are 100% equal". As MDS is taking a dissimilarity matrix as input, I must apply a transformation.

  1. In the litterature Ricco Rakotomalala's french course on data science (p 208-209), the easy way is to substract the maximum value to each cell (make a "1 - cell" operation). So you can easily make a python program, or (as I keep a trace of each matrix) an AWK pre-processing program :

similarity-to-dissimilarity-simple.awk

# We keep the tags around the CSV matrix
# X ; Word1 ; Word2 ; ...
# Header
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word/tag
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (1 - similarity)
    NUM = $i;
    VAL = 1 - NUM;
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

It can be called using the command :

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv
  1. A more complex way of calculating (I can't find back the reference, sorry :( ) is based on another transformation on each cell :

(sii + si'i' - 2 * sii')^1/2

This method seems to be perfectly adapted if the diagonal does not contain the same value (I saw there a co-occurrence matrix... it should apply to his cas). In my case, as the diagonal is ALWAYS full of 1, I reduced it to :

(2 - 2 * sii')^1/2

The AWK program to make this transformation (I implemented the simplified one, because of my data) is therefore :

similarity-to-dissimilarity-complex.awk

# Header
# X ; Word1 ; Word2 ; ...
NR == 1 {
    # First column is just "X" (or space)
    printf("%s", "X");

    # For each column, print the word
    for (i = 2; i <= NF; i++)
    {
    col = $i;
    printf("%s%s", OFS, col);
    }

    # End of line
    printf("\n");
}

# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
    # First column is the word
    col = $1;
    printf("%s", col);

    # For each column, process the number
    for (i = 2; i <= NF; i++)
    {
    # dissimilarity = (2 - 2 * similarity)^-1/2
    NUM = $i;
    VAL = sqrt(2 - 2 * NUM);
    printf("%s%s", OFS, VAL);
    }

    printf("\n");
}

And you can call it with this command :

awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv

When I used the Kruskal's stress to check which version was better... in my case, the simple similarity to dissimilarity (1 - cell) was the best (I kept a stress between 0,34 and 0,32... which is not good... where the complex shows bigger values than 0,34, which is worse).


Post a Comment for "How To Plot A MDS From A Similarity Matrix?"