| Size: 3175 Comment:  |  ← Revision 6 as of 2009-07-07 00:47:58  ⇥ Size: 4178 Comment: converted to 1.6 markup | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 10: | Line 10: | 
| Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual. | Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.<<BR>> For a very rich directory on string similarity go to [[http://www.dcs.shef.ac.uk/~sam/stringmetrics.html|this link]] | 
| Line 14: | Line 15: | 
| * '''Euclidean distance''', which is D = √ ((x,,1,, - x,,2,,)^2^ + (y1,,1,, - y,,2,,)^2^ + (z,,1,, - z,,2,,) + (...)^2^ + ...)[[BR]] it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions[[BR]] | * '''Euclidean distance''', which is D = √ ((x,,1,, - x,,2,,)^2^ + (y1,,1,, - y,,2,,)^2^ + (z,,1,, - z,,2,,) + (...)^2^ + ...)<<BR>> it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions<<BR>> | 
| Line 19: | Line 20: | 
| * '''Correlation-based''' [[BR]] this distance is ideal when you want to group objects with inter-dependent behavioral trend [[BR]] for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down[[BR]] whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down[[BR]] | * '''Correlation-based''' <<BR>> this distance is ideal when you want to group objects with inter-dependent behavioral trend <<BR>> for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down<<BR>> whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down<<BR>> | 
| Line 26: | Line 27: | 
| * '''Hamming''' = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)[[BR]] this distance was developed to measure transmission errors in binary coded strings[[BR]] * '''Jaccard-based''' = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)[[BR]] (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)[[BR]] this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array[[BR]] (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)[[BR]] ''Hamming distance is not good in my opinion when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the example above).''[[BR]] ~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''[[BR]] * 11 000 000[[BR]]-~ * ~-is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)-~ * ~-is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)-~ * ~-110 100 000-~ * ~-is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)-~ * ~-is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)-~ ''This is particularly important as some computational guys would probably go for Hamming as the first choice, without checking for its validity in the context of use.'' | * '''Hamming''' = number of discordant array elements (e.g. 1010 vs 1100 => D = 2)<<BR>> this distance was developed to measure transmission errors in binary coded strings<<BR>> * '''Jaccard-based''' = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays)<<BR>> (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)<<BR>> this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array<<BR>> (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)<<BR>> (e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not) ~-''Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).''-~<<BR>> ~-''E.g., consider these strings, and the choice made by Hamming and Jaccard:''-~<<BR>> * ~-{{{11 000 000}}}<<BR>>-~ * ~-{{{is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3)}}}-~ * ~-{{{is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5)}}}-~ * ~-{{{110 100 000}}}-~ * ~-{{{is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4)}}}-~ * ~-{{{is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7)}}}-~ ~-Mapping the first problem to interactions, let's say <<BR>> {{{A1 interacts with B, C}}}<<BR>> {{{A2 interacts with B, D}}}<<BR>> {{{A3 interacts with B, C, E, F, G}}}<<BR>> is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)<<BR>>-~ <<BR>> In addition, neither of these two distances is suited when some array positions are noisy, whereas others are information-rich, as in the following fictional examples:<<BR>> {{{110 010 001}}}<<BR>> {{{110 100 000}}}<<BR>> {{{110 000 110}}}<<BR>> {{{000 101 100}}}<<BR>> {{{001 101 011}}}<<BR>> {{{000 111 000}}}<<BR>> clearly, 1st and 2nd position are always co-conserved, as well as 4th and 6th; the co-conservation of other residues is irregular; therefore, we may think that the co-conserved positions hold more information than the other ones. | 
Distances: my personal experience
Just a few suggestions according to my own experience. For a more formal and detailed treatment go to Wikipedia or a maths manual.
 For a very rich directory on string similarity go to this link 
Quantitative Data Arrays
For quantitative data (e.g. transcription signals from microarray experiments) you can use
- Euclidean distance, which is D = √ ((x1 - x2)2 + (y11 - y2)2 + (z1 - z2) + (...)2 + ...) 
 it is the geometric distance you are used to calculate in 2 dimensions, just extended over N-dimensions
 be careful, I'd suggest to use it only when:- you are sure that the data are in the same magnitude scale 
- you do not want to consider anti-correlated patterns similar 
 
- Correlation-based 
 this distance is ideal when you want to group objects with inter-dependent behavioral trend
 for inst, for D = 1 - Correlation Index => similar = whenever A goes up, always B goes up as well; dissimilar = whenever A goes up, B always goes down
 whereas, for D = 1 - Abs (Correlation Index) => similar = whenever A goes up, always B goes up as well, AND whenever A goes up, B always goes down
 for these reason, I usually prefer Pearson-based distance when clustering genes by expression signals (Affymetrix arrays)
Binary Arrays
- Hamming = number of discordant array elements (e.g. 1010 vs 1100 => D = 2) 
 this distance was developed to measure transmission errors in binary coded strings
 
- Jaccard-based = 1 - (number of concordant 1s in the two arrays)/(number of positions with 1 in either of the two arrays) 
 (e.g. 1010 vs 1100 => D = 1 - 1/3 = 2/3)
 this distance is useful when array positions correspond to elements which can belong (value = 1) or not belong (value = 0) to the set associated to the array
 (e.g. given a list of Transcription Factors, you can associate to every gene a binary array to express whether they are regulated by each TF or not)
 (e.g. given a list of interaction partners, you can associate to every protein a binary array to express whether they are interacting with each partner or not)
Hamming distance is not good, in my opinion, when the strings compared have a very unequal 1/0 content, and the meaning of 1s and 0s is related to set-membership (as in the examples above).
 E.g., consider these strings, and the choice made by Hamming and Jaccard:
 
- 11 000 000 
 - is more similar to 10 000 001 according to Hamming (D,,H,, = 2, D,,J,, = 2/3) 
- is more similar to 10 000 001 according to Jaccard (D,,H,, = 3, D,,J,, = 3/5) 
 
- 110 100 000 - is more similar to 011 000 000 according to Hamming (D,,H,, = 3, D,,J,, = 3/4) 
- is more similar to 110 111 101 according to Jaccard (D,,H,, = 4, D,,J,, = 4/7) 
 
Mapping the first problem to interactions, let's say 
 A1 interacts with B, C
 A2 interacts with B, D
 A3 interacts with B, C, E, F, G
 is A1 neighborhood more similar to A2 (Hamming's choice) or A3 (Jaccard's choice)
 
 In addition, neither of these two distances is suited when some array positions are noisy,  whereas others are information-rich, as in the following fictional examples:
 110 010 001
 110 100 000
 110 000 110
 000 101 100
 001 101 011
 000 111 000
 clearly, 1st and 2nd position are always co-conserved, as well as 4th and 6th;  the co-conservation of other residues is irregular;  therefore, we may think that the co-conserved positions hold more information than the other ones. 
