How To Plot A MDS From A Similarity Matrix?
Solution 1:
I did a comparison with XLSTAT (an excel extension) in order to try a lot of scenarios and compare how to do what.
First : my input matrix was a "similarity" matrix, because I could interpret it as: "A and A are 100% equal". As MDS is taking a dissimilarity matrix as input, I must apply a transformation.
- In the litterature Ricco Rakotomalala's french course on data science (p 208-209), the easy way is to substract the maximum value to each cell (make a "1 - cell" operation). So you can easily make a python program, or (as I keep a trace of each matrix) an AWK pre-processing program :
similarity-to-dissimilarity-simple.awk
# We keep the tags around the CSV matrix
# X ; Word1 ; Word2 ; ...
# Header
NR == 1 {
# First column is just "X" (or space)
printf("%s", "X");
# For each column, print the word
for (i = 2; i <= NF; i++)
{
col = $i;
printf("%s%s", OFS, col);
}
# End of line
printf("\n");
}
# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
# First column is the word/tag
col = $1;
printf("%s", col);
# For each column, process the number
for (i = 2; i <= NF; i++)
{
# dissimilarity = (1 - similarity)
NUM = $i;
VAL = 1 - NUM;
printf("%s%s", OFS, VAL);
}
printf("\n");
}
It can be called using the command :
awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-simple.awk input.csv > output-simple.csv
- A more complex way of calculating (I can't find back the reference, sorry :( ) is based on another transformation on each cell :
This method seems to be perfectly adapted if the diagonal does not contain the same value (I saw there a co-occurrence matrix... it should apply to his cas). In my case, as the diagonal is ALWAYS full of 1, I reduced it to :
The AWK program to make this transformation (I implemented the simplified one, because of my data) is therefore :
similarity-to-dissimilarity-complex.awk
# Header
# X ; Word1 ; Word2 ; ...
NR == 1 {
# First column is just "X" (or space)
printf("%s", "X");
# For each column, print the word
for (i = 2; i <= NF; i++)
{
col = $i;
printf("%s%s", OFS, col);
}
# End of line
printf("\n");
}
# Other lines are processed
# WordN ; 1 ; 0.5 ; 0.2 ; ...
NR != 1 {
# First column is the word
col = $1;
printf("%s", col);
# For each column, process the number
for (i = 2; i <= NF; i++)
{
# dissimilarity = (2 - 2 * similarity)^-1/2
NUM = $i;
VAL = sqrt(2 - 2 * NUM);
printf("%s%s", OFS, VAL);
}
printf("\n");
}
And you can call it with this command :
awk -F ";" -v OFS=";" -f similarity-to-dissimilarity-complex.awk input.csv > output-complex.csv
When I used the Kruskal's stress to check which version was better... in my case, the simple similarity to dissimilarity (1 - cell) was the best (I kept a stress between 0,34 and 0,32... which is not good... where the complex shows bigger values than 0,34, which is worse).
Post a Comment for "How To Plot A MDS From A Similarity Matrix?"