NEURAL NETWORKS Ivan F Wilde Mathematics Department King’s College London London, WC2R 2LS, UK ivan. wilde@kcl. ac. uk www.

jntuworld. com www. jntuworld.

We Will Write a Custom Essay about Java and Web Technologies Essay**For You For Only $13.90/page!**

order now

com www. jwjobs. net Contents 1 Matrix Memory . . . . . .

. . . . . . .

. . . . .

. . .

. . . .

. 1 2 Adaptive Linear Combiner ….

……

…….

… 21 3 Arti? cial Neural Networks …..

……..

……

. 35 ..

…..

….

….

…

……..

45 5 Multilayer Feedforward Networks . . . . . .

. . .

. . . . . .

. 75 6 Radial Basis Functions 95 4 The Perceptron ……

……..

……..

7 Recurrent Neural Networks . . . . . . . .

. . . . .

. . . . . .

. 103 8 Singular Value Decomposition Bibliography . .

. . .

. . . . .

. . . . . . . .

115 . . . .

. . . . . .

. . . . .

. . . . . . .

. . . . .

. . 121 www. jntuworld.

com www. jntuworld. com www.

jwjobs. net Chapter 1 Matrix Memory We wish to construct a system which possesses so-called associative memory. This is de? able generally as a process by which an input, considered as a “key”, to a memory system is able to evoke, in a highly selective fashion, a speci? c response associated with that key, at the system output.

The signalresponse association should be “robust”, that is, a “noisy” or “incomplete” input signal should none the less invoke the correct response—or at least an acceptable response. Such a system is also called a content addressable memory. 3 mapping stimulus response Figure 1. 1: A content addressable memory.The idea is that the association should not be de? ed so much between the individual stimulus-response pairs, but rather embodied as a whole collection of such input-output patterns—the system is a distributive associative memory (the input-output pairs are “distributed” throughout the system memory rather than the particular input-output pairs being somehow represented individually in various di? erent parts of the system). To attempt to realize such a system, we shall suppose that the input key (or prototype) patterns are coded as vectors in Rn , say, and that the responses are coded as vectors in Rm .

For example, the input might be a digitized photograph comprising a picture with 100 ? 00 pixels, each of which may assume one of eight levels of greyness (from white (= 0) to black 1 www. jntuworld. com www. jntuworld. com www. jwjobs.

net 2 Chapter 1 (= 7)). In this case, by mapping the screen to a vector, via raster order, say, the input is a vector in R10000 and whose components actually take values in the set {0, . .

. , 7}. The desired output might correspond to the name of the person in the photograph. If we wish to recognize up to 50 people, say, then we could give each a binary code name of 6 digits—which allows up to 26 = 64 di? erent names.Then the output can be considered as an element of R6 . Now, for any pair of vectors x ? Rn , y ? Rm , we can e? ect the map x > y via the action of the m ? n matrix M (x,y) = y xT where x is considered as an n ? 1 (column) matrix and y as an m ? 1 matrix.

Indeed, M (x,y) x = y xT x = ? y, where ? = xT x = x 2 , the squared Euclidean norm of x. The matrix yxT is called the outer product of x and y . This suggests a model for our “associative system”. Suppose that we wish to consider p input-output pattern pairs, (x(1) , y (1) ), (x(2) , y (2) ), . . . , (x(p) , y (p) ). Form the m ? n matrix p y (i) x(i) T .

M= i=1 M is called the correlation memory matrix (corresponding to the given pattern pairs). Note that if we let X = (x(1) · · · x(p) ) and Y = (y (1) · · · y (p) ) be the n ? p and m ? p matrices with columns given by the vectors x(1) , . . . , x(p) and y (1) , . .

. , y (p) , respectively, then the matrix p=1 y (i) x(i) T is just Y X T . i Indeed, the jk -element of Y X T is p p T T Yji Xki Yji (X )ik = (Y X )jk = i=1 i=1 p (i) (i) yj xk = i=1 which is precisely the jk -element of M . When presented with the input signal x(j ) , the output is p (j ) Mx y (i) x(i) T x(j ) = i=1 p = y (j ) x(j ) T x(j ) + x(i) T x(j ) )y (i) .

i=1 i=j Department of Mathematics www. jntuworld. com www. jntuworld. com www. jwjobs. net 3 Matrix Memory In particular, if we agree to “normalize” the key input signals so that x(i) T x(i) = x(i) 2 = 1, for all 1 ? i ? p, then the ? rst term on the right hand side above is just y (j ) , the desired response signal. The second term on the right hand side is called the “cross-talk” since it involves overlaps (i.

e. , inner products) of the various input signals. If the input signals are pairwise orthogonal vectors, as well as being normalized, then x(i) T x(j ) = 0 for all i = j .

In this case, we get M x(j ) = y (j ) that is, perfect recall. Note that Rn contains at most n mutually orthogonal vectors. Operationally, one can imagine the system organized as indicated in the ? gure. key patterns are loaded at beginning vv input signal v output . .

. . . . ? Rn ? Rm Figure 1. 2: An operational view of the correlation memory matrix. • start with M = 0, • load the key input patterns one by one T M i = 1, .

. . , p, • ? nally, present any input signal and observe the response. Note that additional signal-response patterns can simply be “added in” at any time, or even removed—by adding in ? (j ) x(j ) T .

After the second stage above, the system has “learned” the signal-response pattern pairs. The collection of pattern pairs (x(1) , y (1) ), . . . , (x(p) , y (p) ) is called the training set.

King’s College London www. jntuworld. com www. jntuworld. com www. jwjobs. net 4 Chapter 1 Remark 1. 1.

In general, the system is a heteroassociative memory x(i) y (i) , 1 ? i ? p. If the output is the prototype input itself, then the system is said to be an autoassociative memory. We wish, now, to consider a quantitative account of the robustness of the autoassociative memory matrix.For this purpose, we shall suppose that the prototype patterns are bipolar vectors in Rn , i. e. , the components of (i) the x(i) each belong to {? 1, 1}. Then x(i) 2 = n=1 xj 2 = n, for each j v 1 ? i ? p, so that (1/ n)x(i) is normalized.

Suppose, further, that the prototype vectors are pairwise orthogonal (—this requires that n be even). The correlation memory matrix is 1 M= n p x(i) x(i) T i=1 and we have seen that M has perfect recall, M x(j ) = x(j ) for all 1 ? j ? p. We would like to know what happens if M is presented with x, a corrupted version of one of the x(j ) .In order to obtain a bipolar vector as output, we process the output vector M x as follows: Mx ?(M x) where ? : Rn > {? 1, 1}n is de? ned by 1, if zk ? 0 ?1, if zk < 0 ?(z )k = for 1 ? k ? n and z ? Rn .

Thus, the matrix output is passed through a (bipolar) signal quantizer, ?. To proceed, we introduce the notion of Hamming distance between pairs of bipolar vectors. Let a = (a1 , . . . , an ) and b = (b1 , . .

. , bn ) be elements of {? 1, 1}n , i. e. , bipolar vectors. The set {? 1, 1}n consists of the 2n vertices of a hypercube in Rn . Then n aT b = i=1 ai bi = ? ? ? where ? s the number of components of a and b which are the same, and ? is the number of di? ering components (ai bi = 1 if and only if ai = bi , and ai bi = ? 1 if and only if ai = bi ). Clearly, ? + ? = n and so aT b = n ? 2? . De? nition 1.

2. The Hamming distance between the bipolar vectors a, b, denoted ? (a, b), is de? ned to be n ?(a, b) = 1 2 i=1 |ai ? bi |. Department of Mathematics www. jntuworld.

com www. jntuworld. com www. jwjobs. net 5 Matrix Memory 1 Evidently (thanks to the factor 2 ), ? (a, b) is just the total number of mismatches between the components of a and b, i. e. , it is equal to ? above.

Hence aT b = n ? 2? (a, b). Note that the Hamming distance de? nes a metric on the set of bipolar vectors. Indeed, ? (a, b) = 1 a ? b 1 , where · 1 is the ? 1 -norm de? ned 2 on Rn by z 1 = n |zi |, for z = (z1 , . .

. , zn ) ? Rn . The ? 1 -norm is also i=1 known as the Manhatten norm—the distance between two locations is the sum of lengths of the east-west and north-south contributions to the journey, inasmuch as diagonal travel is not possible in Manhatten. Hence, using x(i) T x = n ? 2? (x(i) , x), we have 1 Mx = n = 1 n p x(i) x(i) T x i=1 p i=1 (n ? 2? i (x)) x(i) , where ? i (x) = ? x(i) , x), the Hamming distance between the input vector x and the prototype pattern vector x(i) . Given x, we wish to know when x x(m) , that is, when x Mx (m) .

According to our bipolar quantization rule, it will certainly ? (M x) = x be true that ? (M x) = x(m) whenever the corresponding components of M x (m) and x(m) have the same sign. This will be the case when (M x)j xj > 0, that is, whenever 1 1 (m) (m) (n ? 2? m (x)) xj xj + n n =1 p (i) (m) (n ? 2? i (x)) xj xj >0 i=1 i=m for all 1 ? j ? n. This holds if p (i) (m) (n ? 2? i (x)) xj xj i=1 i=m < n ? 2? m (x) ((? )) for all 1 ? j ? (—we have used the fact that if s > |t| then certainly s + t > 0). We wish to ? nd conditions which ensure that the inequality (? ) holds. By the triangle inequality, we get p p (n ? i=1 i=m (i) (m) 2? i (x)) xj xj ? i=1 i=m |n ? 2? i (x)| (?? ) King’s College London www.

jntuworld. com www. jntuworld. com www. jwjobs. net 6 Chapter 1 (i) (m) since |xj xj | = 1 for all 1 ? j ? n. Furthermore, using the orthogonality of x(m) and x(i) , for i = m, we have 0 = x(i) T x(m) = n ? 2? (x(i) , x(m) ) so that |n ? 2? i (x)| = |2? (x(i) , x(m) ) ? 2? i (x)| = 2|? (x(i) , x(m) ) ? ?(x(i) , x)| ? 2? x(m) , x), where the inequality above follows from the pair of inequalities ? (x(i) , x(m) ) ? ?(x(i) , x) + ? (x, x(m) ) and ?(x(i) , x) ? ?(x(i) , x(m) ) + ? (x(m) , x). Hence, we have |n ? 2? i (x)| ? 2? m (x) (??? ) for all i = m.

This, together with (?? ) gives p p (i) (m) i=1 i=m (n ? 2? i (x)) xj xj ? i=1 i=m |n ? 2? i (x)| ? 2(p ? 1)? m (x). It follows that whenever 2(p ? 1)? m (x) < n ? 2? m (x) then (? ) holds which means that M x = x(m) . The condition 2(p ? 1)? m (x) < n ? 2? m (x) is just that 2p? m (x) < n, i. e.

, the condition that ? m (x) < n/2p. Now, we observe that if ? (x) < n/2p, then, for any i = m, n ? 2? i (x) ? 2? m (x) n < p by (??? ), above, and so n ? 2? i (x) < n/p. Thus ?i (x) > 1 2 n? n p n p? 1 2 p n ?, 2p = assuming that p ? 2, so that p ? 1 ? 1. In other words, if x is within Hamming distance of (n/2p) from x(m) , then its Hamming distance to every other prototype input vector is greater (or equal to) (n/2p). We have thus proved the following theorem (L. Personnaz, I.

Guyon and G. Dreyfus, Phys. Rev.

A 34, 4217–4228 (1986)). Department of Mathematics www. jntuworld. com www. jntuworld. com www. jwjobs.

net 7 Matrix Memory Theorem 1. 3.Suppose that {x(1) , x(2) , . . . , x(p) } is a given set of mutually orthogonal bipolar patterns in {? 1, 1}n . If x ? {? 1, 1}n lies within Hamming distance (n/2p) of a particular prototype vector x(m) , say, then x(m) is the nearest prototype vector to x.

Furthermore, if the autoassociative matrix memory based on the patterns {x(1) , x(2) , . . . , x(p) } is augmented by subsequent bipolar quantization, then the input vector x invokes x(m) as the corresponding output. This means that the combined memory matrix and quantization system can correctly recognize (slightly) corrupted input patterns.The nonlinearity (induced by the bipolar quantizer) has enhanced the system performance— small background “noise” has been removed. Note that it could happen that the output response to x is still x(m) even if x is further than (n/2p) from x(m) . In other words, the theorem only gives su? cient conditions for x to recall x(m) .

As an example, suppose that we store 4 patterns built from a grid of 8 ? 8 pixels, so that p = 4, n = 82 = 64 and (n/2p) = 64/8 = 8. Each of the 4 patterns can then be correctly recalled even when presented with up to 7 incorrect pixels. Remark 1. 4.

If x is close to ? (m) , then the output from the combined autocorrelation matrix memory and bipolar quantizer is ? x(m) . 1 Proof. Let M = n p=1 (? x(i) )(? x(i) )T . Then clearly M = M , i.

e. , the i autoassociative correlation memory matrix for the patterns ? x(1) , . .

. , ? x(p) is exactly the same as that for the patterns x(1) , . .

. , x(p) . Applying the above theorem to the system with the negative patterns, we get that ? (M x) = ? x(m) , whenever x is within Hamming distance (n/2p) of ? x(m) . But then ? (M x) = x(m) , as claimed. A memory matrix, also known as a linear associator, can be pictured as a network as in the ? gure. 1 x2 x3 input vector (n components) M11 y1 y2 M21 Mm1 .

. . . . . output vector (m components) ym xn Mmn Figure 1. 3: The memory matrix (linear associator) as a network. King’s College London www.

jntuworld. com www. jntuworld. com www. jwjobs. net 8 Chapter 1 “Weights” are assigned to the connections. Since yi = j Mij xj , this suggests that we assign the weight Mij to the connection joining input node j to output node i; Mij = weight(j > i). The correlation memory matrix trained on the pattern pairs (x(1) , y (1) ),.

. . , (x(p) , y (p) ) is given by M = p =1 y (m) x(m) T , which has typical term m p (y (m) x(m) T )ijMij = m=1 p (m) (m) xj . = yi m=1 Now, Hebb’s law (1949) for “real” i. e. , biological brains says that if the excitation of cell j is involved in the excitation of cell i, then continued excitation of cell j causes an increase in its e? ciency to excite cell i.

To encapsulate a crude version of this idea mathematically, we might hypothesise that the weight between the two nodes be proportional to the excitation values of the nodes. Thus, for pattern label m, we would postulate that the weight, (m) (m) weight(input j > output i), be proportional to xj yi . We see that Mij is a sum, over all patterns, of such terms.For this reason, the assignment of the correlation memory matrix to a content addressable memory system is sometimes referred to as generalized Hebbian learning, or one says that the memory matrix is given by the generalized Hebbian rule. Capacity of autoassociative Hebbian learning We have seen that the correlation memory matrix has perfect recall provided that the input patterns are pairwise orthogonal vectors. Clearly, there can be at most n of these. In practice, this orthogonality requirement may not be satis? ed, so it is natural ask for some kind of guide as to the number of patterns that can be stored and e? ctively recovered. In other words, how many patterns can there be before the cross-talk term becomes so large that it destroys the recovery of the key patterns? Experiment con? rms that, indeed, there is a problem here.

To give some indication of what might be reasonable, consider the autoassociative correlation memory matrix based on p bipolar pattern vectors x(1) , . . .

, x(p) ? {? 1, 1}n , followed by bipolar quantization, ?. On presentation of pattern x(m) , the system output is (m) ?(M x 1 )=? n p x(i) x(i) T x(m) . i=1 Department of Mathematics www. jntuworld.

com www. jntuworld. com www.

jwjobs. net 9Matrix Memory (m) Consider the k th bit. Then ? (M x(m) )k = xk that is whenever 1 (m) (m) (m) T (m) 1 + x x xk x nk n 1 n (m) whenever xk (M x(m) )k > 0, p (m) (i) xk xk x(i) T x(m) > 0 i=1 i=j Ck or 1 + Ck > 0. In order to consider a “typical” situation, we suppose that the patterns are random. Thus, x(1) , . .

. , x(p) are selected at random from {? 1, 1}n , with all pn bits being chosen independently and equally likely. The output bit ? (M x(m) )k is therefore incorrect if 1 + Ck < 0, i. e. , if Ck < ? 1. We shall estimate the probability of this happening.

We see that Ck is a sum of many terms Ck = n (m) (i) (i) (m) where Xm,k,i,j = xk xk xj xj p n Xm,k,i,j i=1 j =1 i=m . We note ? rstly that, with j = k , (m) (i) (i) (m) Xm,k,i,k = xk xk xk xk = 1. Next, we see that, for j = k , each Xm,k,i,j takes the values ±1 with equal probability, namely, 1 , and that these di? erent X s form an independent 2 family. Therefore, we may write Ck as Ck = p? 1 1 +S n n where S is a sum of (n ? 1)(p ? 1) independent random variables, each 1 taking the values ±1 with probability 2 . Each of the X s has mean 0 and 2 = 1. By the Central Limit Theorem, it follows that the random variance ? variable (S/(n ? )(p ? 1))/(? / (n ? 1)(p ? 1)) = S/ (n ? 1)(p ? 1) has an approximate standard normal distribution (for large n). King’s College London www.

jntuworld. com www. jntuworld.

com www. jwjobs. net 10 Chapter 1 Hence, if we denote by Z a standard normal random variable, p? 1 S + < ? 1 = Prob S < ? n ? (p ? 1) n n S ?n ? (p ? 1) = Prob < (n ? 1)(p ? 1) (n ? 1)(p ? 1) Prob(Ck < ? 1) = Prob S = Prob (n ? 1)(p ? 1) ? Prob Z < ? ? Prob Z < ? n . p Suppose that we require that the probability of an incorrect bit be no greater than 0. 01 (or 1%). Then, from statistical tables, we ? d that Prob(Z > n/p) ? 0. 01 requires that n/p ? 2.

326. That is, we require n/p ? (2. 326)2 or p/n ? 0. 185. Now, to say that any particular bit is incorrectly recalled with probability 0.

01 is to say that the average number of incorrect bits (from a large sample) is 1% of the total. We have therefore shown that if we are prepared to accept up to 1% bad bits in our recalled patterns (on average) then we can expect to be able to store no more than p = 0. 185n patterns in our autoassociative system. That is, the storage capacity (with a 1% error tolerance) is 0.

185n. Generalized inverse matrix memoryWe have seen that the success of the correlation memory matrix, or Hebbian learning, is limited by the appearance of the cross-talk term. We shall derive an alternative system based on the idea of minimization of the output distortion or error. Let us start again (and with a change of notation). We wish to construct an associative memory system which matches input patterns a(1) ,. . .

, a(p) (from Rn ) with output pattern vectors b(1) ,. . . , b(p) (in Rm ), respectively. The question is whether or not we can ? nd a matrix M ? Rm? n , the set of m ? n real matrices, such that M a(i) = b(i) Department of Mathematics ww. jntuworld. com www.

jntuworld. com www. jwjobs.

net 11 Matrix Memory for all 1 ? i ? p. Let A ? Rn? p be the matrix whose columns are the vectors a(1) , . . .

, a(p) , i. e. , A = (a(1) · · · a(p) ), and let B ? Rm? p be the matrix with (j ) (j ) columns given by the b(i) s, B = (b(1) · · · b(p) ), thus Aij = ai and Bij = bi . Then it is easy to see that M a(i) = b(i) , for all i, is equivalent to M A = B . The problem, then, is to solve the matrix equation M A = B, for M ? Rm? n , for given matrices A ? Rn? p and B ? Rm? p . First, we observe that for a solution to exist, the matrices A and B cannot be arbitrary.

Indeed, if A = 0, then so is M A no matter what M is—so the equation will not hold unless B also has all zero entries. Suppose next, slightly more subtly, that there is some non-zero vector v ? Rp such that Av = 0. Then, for any M , M Av = 0.

In general, it need not be true that Bv = 0. Suppose then that there is no such non-zero v ? Rp such that Av = 0, i. e.

, we are supposing that Av = 0 implies that v = 0. What does this mean? We have p (Av )i = Aij vj j =1 = v1 Ai1 + v2 Ai2 + · · · + vp Aip (1) (2) ( p) = v1 ai + v2 ai + · · · + vp ai = ith component of (v1 a(1) + · · · + vp a(p) ).In other words, Av = v1 a(1) + · · · + vp a(p) . The vector Av is a linear combination of the columns of A, considered as elements of Rn . Now, the statement that Av = 0 if and only if v = 0 is equivalent to the statement that v1 a(1) + · · · + vp a(p) = 0 if and only if v1 = v2 = · · · = vp = 0 which, in turn, is equivalent to the statement that a(1) ,. . . ,a(p) are linearly independent vectors in Rn .

Thus, the statement, Av = 0 if and only if v = 0, is true if and only if the columns of A are linearly independent vectors in Rn . Proposition 1. 5. For any A ? Rn? p , the p ? matrix ATA is invertible if and only if the columns of A are linearly independent in Rn . Proof. The square matrix ATA is invertible if and only if the equation ATAv = 0 has the unique solution v = 0, v ? Rp . (Certainly the invertibility of ATA implies the uniqueness of the zero solution to ATAv = 0. For the converse, ? rst note that the uniqueness of this zero solution implies that King’s College London www.

jntuworld. com www. jntuworld. com www. jwjobs. net 12 Chapter 1 ATA is a one-one linear mapping from Rp to Rp .

Moreover, using linearity, one readily checks that the collection ATAu1 , . . , ATAup is a linearly independent set for any basis u1 , . . . , up of Rp . This means that it is a basis and so ATA maps Rp onto itself. Hence ATA has an inverse. Alternatively, one can argue that since ATA is symmetric it can be diagonalized via some orthogonal transformation. But a diagonal matrix is invertible if and only if every diagonal entry is non-zero. In this case, these entries are precisely the eigenvalues of ATA. So ATA is invertible if and only if none of its eigenvalues are zero. ) Suppose that the columns of A are linearly independent and that ATAv = 0.Then it follows that v T ATAv = 0 and so Av = 0, since v T ATAv = n 2 2 i=1 (Av )i = Av 2 , the square of the Euclidean length of the n-dimensional vector Av . By the argument above, Av is a linear combination of the columns of A, and we deduce that v = 0. Hence ATA is invertible. On the other hand, if ATA is invertible, then Av = 0 implies that ATAv = 0 and so v = 0. Hence the columns of A are linearly independent. We can now derive the result of interest here. Theorem 1. 6. Let A be any n ? p matrix whose columns are linearly independent. Then for any m ? p matrix B , there is an m ? n matrix M such that M A = B .Proof. Let M = B (ATA)? 1 AT . m? p p? n p? p Then M is well-de? ned since ATA ? Rp? p is invertible, by the proposition. Direct substitution shows that M A = B . So, with this choice of M we get perfect recall, provided that the input pattern vectors are linearly independent. Note that, in general, the solution above is not unique. Indeed, for any matrix C ? Rm? n , the m ? n matrix M ? = C 1 n ? A(ATA)? 1 AT satis? es l M ? A = C A ? A(ATA)? 1 ATA = C (A ? A) = 0 ? Rm? p . Hence M + M ? satis? es (M + M ? )A = B . Can we see what M looks like in terms of the patterns a(i) , b(i) ? The answer is “yes and no”.We have A = (a(1) · · · a(p) ) and B = (b(1) · · · b(p) ). Then n (ATA)ij = n AT Akj = ik k=1 n = Aki Akj k=1 (i) (j ) ak ak k=1 Department of Mathematics www. jntuworld. com www. jntuworld. com www. jwjobs. net 13 Matrix Memory which gives ATA directly in terms of the a(i) s. Let Q = ATA ? Rp? p . Then M = BQ? 1 AT , so that p Bik (Q? 1 )k? A? j Mij = k,? =1 p (k ) (? ) bi (Q? 1 )k? aj = since AT = Aj? . ?j k,? =1 This formula for M , valid for linearly independent input patterns, expresses M more or less in terms of the patterns. The appearance of the inverse, Q? 1 , somewhat lessens its appeal, however.To discuss the case where the columns of A are not necessarily linearly independent, we need to consider the notion of generalized inverse. De? nition 1. 7. For any given matrix A ? Rm? n , the matrix X ? Rn? m is said to be a generalized inverse of A if (i) AXA = A, (ii) XAX = X , (iii) (AX )T = AX , (iv) (XA)T = XA. The terms pseudoinverse or Moore-Penrose inverse are also commonly used for such an X . Examples 1. 8. 1. If A ? Rn? n is invertible, then A? 1 is the generalized inverse of A. 2. If A = ? ? R1? 1 , then X = 1/? is the generalized inverse provided ? = 0. If ? = 0, then X = 0 is the generalized inverse. . The generalized inverse of A = 0 ? Rm? n is X = 0 ? Rn? m . 4. If A = u ? Rm? 1 , u = 0, then one checks that X = uT /(uT u) is a generalized inverse of u. The following result is pertinent to the theory. Theorem 1. 9. Every matrix possesses a unique generalized inverse. Proof. We postpone discussion of existence (which can be established via the Singular Value Decomposition) and just show uniqueness. This follows King’s College London www. jntuworld. com www. jntuworld. com www. jwjobs. net 14 Chapter 1 by repeated use of the de? ning properties (i),. . . ,(iv). Let A ? Rm? n be given and suppose that X, Y ?Rn? m are generalized inverses of A. Then X = XAX, by (i), T = X (AX ) , T by (iii), T = XX A = XX T AT Y T AT , = XX T AT AY, = XAX AY, = XAY, by (i)T , by (iii), by (iii), by (ii), = XAY AY, T by (i), T = XAA Y Y, by (iv), = AT X T AT Y T Y, T T = A Y Y, = Y AY, = Y, by (iv), T by (i) , by (iv), by (ii), as required. Notation For given A ? Rm? n , we denote its generalized inverse by A# . It is also often written as Ag , A+ or A† . Proposition 1. 10. For any A ? Rm? n , AA# is the orthogonal projection onto ran A, the linear span in Rm of the columns of A, i. e. , if P = AA# ? Rm? , then P = P T = P 2 and P maps Rm onto ran A. Proof. The de? ning property (iii) of the generalized inverse A# is precisely the statement that P = AA# is symmetric. Furthermore, P 2 = AA# AA# = AA# , by condition (i), =P so P is idempotent. Thus P is an orthogonal projection. For any x ? Rm , we have that P x = AA# x ? ran A, so that P : Rm > ran A. On the other hand, if x ? ran A, there is some z ? Rn such that x = Az . Hence P x = P Az = AA# Az = Az = x, where we have used condition (i) in the penultimate step. Hence P maps R onto ran A. Proposition 1. 11. Let A ? Rm? n . (i) If rank A = n, then A# = (ATA)? AT . (ii) If rank A = m, then A# = AT (AAT )? 1 . Department of Mathematics www. jntuworld. com www. jntuworld. com www. jwjobs. net 15 Matrix Memory Proof. If rank A = n, then A has linearly independent columns and we know that this implies that ATA is invertible (in Rn? n ). It is now a straightforward matter to verify that (ATA)? 1 AT satis? es the four de? ning properties of the generalized inverse, which completes the proof of (i). If rank A = m, we simply consider the transpose instead. Let B = AT . Then rank B = m, since A and AT have the same rank, and so, by the argument above, B # = (B T B )? B T . However, AT # = A# T , as is easily checked (again from the de? ning conditions). Hence A# = A# T T = (AT )# T = B # T = B (B T B )? 1 = AT (AAT )? 1 which establishes (ii). De? nition 1. 12. The · A F -norm 2 F on Rm? n is de? ned by = Tr(ATA) for A ? Rm? n , where Tr(B ) is the trace of the square matrix B ; Tr(B ) = norm is called the Frobenius norm and sometimes denoted · i Bii . This 2. We see that n A 2 F (ATA)ii T = Tr(A A) = i=1 n m AT Aji ij = i=1 j =1 m n A2 ij = i=1 j =1 since AT = Aji . Hence ij A F = (sum of squares of all entries of A) . We also note, here, that clearly A F = AT F .Suppose that A = u ? Rm? 1 , an m-component vector. Then A 2 = F m 2 i=1 ui , that is, A F is the usual Euclidean norm in this case. Thus the notation · 2 for this norm is consistent. Note that, generally, A F is just the Euclidean norm of A ? Rm? n when A is “taken apart” row by row and considered as a vector in Rmn via the correspondence A – (A11 , A12 , . . . , A1n , A21 , . . . , Amn ). The notation A 2 is sometimes used in numerical analysis texts (and in the computer algebra software package Maple) to mean the norm of King’s College London www. jntuworld. com www. jntuworld. com www. jwjobs. net 16 Chapter 1A as a linear map from Rn into Rm , that is, the value sup{ Ax : x ? Rn with x = 1}. One can show that this value is equal to the square root of the largest eigenvalue of ATA whereas A F is equal to the square root of the sum of the eigenvalues of ATA. Remark 1. 13. Let A ? Rm? n , B ? Rn? m , C ? Rm? n , and X ? Rp? p . Then it is easy to see that (i) Tr(AB ) = Tr(BA), (ii) Tr(X ) = Tr(X T ), (iii) Tr(AC T ) = Tr(C T A) = Tr(AT C ) = Tr(CAT ). The equalities in (iii) can each be veri? ed directly, or alternatively, one notices that (iii) is a consequence of (i) and (ii) (replacing B by C T ). Lemma 1. 14. For A ? Rm? , A# AAT = AT . Proof. We have (A# A)AT = (A# A)T AT by condition (iv) = (A(A# A))T = AT by condition (i) as required. Theorem 1. 15. Let A ? Rn? p and B ? Rm? p be given. Then X = BA# is an element of Rm? n which minimizes the quantity X A ? B F . Proof. We have XA ? B 2 F = (X ? BA# )A + B (A# A ? 1 p ) l = (X ? BA# )A 2 F = (X ? BA# )A 2 F 2 F + B (A# A ? 1 p ) l 2 F + B (A# A ? 1 p ) l 2 F # + 2 Tr AT (X ? BA# )T B (A# A ? 1 p ) l + 2 Tr (X ? BA# )T B (A A ? 1 p )AT . l = 0 by the lemma Hence XA ? B 2 F = (X ? BA# )A 2 F + B (A# A ? 1 p ) l which achieves its minimum, B (A# A ? 1 p ) l Department of Mathematics ww. jntuworld. com 2, F 2 F when X = BA# . www. jntuworld. com www. jwjobs. net 17 Matrix Memory Note that any X satisfying XA = BA# A gives a minimum solution. If AT has full column rank (or, equivalently, AT has no kernel) then AAT is invertible. Multiplying on the right by AT (AAT )? 1 gives X = BA# . So under this condition on AT , we see that there is a unique solution X = BA# minimizing X A ? B F . In general, one can show that BA# is that element with minimal · F norm which minimizes X A ? B F , i. e. , if Y = BA# and Y A ? B F = B A# A ? B F , then B A# F < Y F . Now let us return to our problem of ? ding a memory matrix which stores the input-output pattern pairs (a(i) , b(i) ), 1 ? i ? p, with each a(i) ? Rn and each b(i) ? Rm . In general, it may not be possible to ? nd a matrix M ? Rm? n such that M a(i) = b(i) , for each i. Whatever our choice of M , the system output corresponding to the input a(i) is just M a(i) . So, failing equality M a(i) = b(i) , we would at least like to minimize the error b(i) ? M a(i) . A measure of such an error is b(i) ? M a(i) 2 the squared 2 Euclidean norm of the di? erence. Taking all p patterns into account, the total system recall error is taken to be i=1 b(i) ? M a(i) 2 . 2 Let A = (a(1) · · · a(p) ) ? Rn? p and B = (b(1) · · · b(p) ) ? Rm? p be the matrices whose columns are given by the pattern vectors a(i) and b(i) , respectively. Then the total system recall error, above, is just B ? MA 2 F. We have seen that this is minimized by the choice M = BA# , where A# is the generalized inverse of A. The memory matrix M = BA# is called the optimal linear associative memory (OLAM) matrix. Remark 1. 16. If the patterns {a(1) , . . . , a(p) } constitute an orthonormal family, then A has independent columns and so A# = (ATA)? 1 AT = 1 p AT , so l hat the OLAM matrix is BA# = BAT which is exactly the correlation memory matrix. In the autoassociative case, b(i) = a(i) , so that B = A and the OLAM matrix is given as M = AA# . We have seen that AA# is precisely the projection onto the range of A, i. e. , onto the subspace of Rn spanned by the prototype patterns. In this case, we say that M is given by the projection rule. King’s College London www. jntuworld. com www. jntuworld. com www. jwjobs. net 18 Chapter 1 Any input x ? Rn can be written as AA# x x= + (1 ? AA# )x . l OLAM system output “novelty” Kohonen calls 1 ? AA# the novelty ? ter and has applied these ideas to l image-subtraction problems such as tumor detection in brain scans. Nonnull novelty vectors may indicate disorders or anomalies. Pattern classi? cation We have discussed the distributed associative memory (DAM) matrix as an autoassociative or as a heteroassociative memory model. The ? rst is mathematically just a special case of the second. Another special case is that of so-called classi? cation. The idea is that one simply wants an input signal to elicit a response “tag”, typically coded as one of a collection of orthogonal unit vectors, such as given by the standard basis vectors of Rm . In operation, the input x induces output M x, which is then associated with that tag vector corresponding to its maximum component. In other words, if (M x)j is the maximum component of M x, then the output M x is associated with the j th tag. Examples of various pattern classi? cation tasks have been given by T. Kohonen, P. Lehti? , E. Oja, A. Kortekangas and K. M? kisara, Demonstration of o a pattern processing properties of the optimal associative mappings, Proceedings of the International Conference on Cybernetics and Society, Washington, D. C. , 581–585 (1977). See also the article “Storage and Processing of Information in Distributed Associative Memory Systems” by T. Kohonen, P. Lehti? and E. Oja in “Parallel Models of Associative Memory” edited o by G. Hinton and J. Anderson, published by Lawrence Erlbaum Associates, (updated edition) 1989. ) In one such experiment, ten people were each photographed from ? ve di? erent angles, ranging from 45? to ? 45? , with 0? corresponding to a fully frontal face. These were then digitized to produce pattern vectors with eight possible intensity levels for each pixel.A distinct unit vector, a tag, was associated with each person, giving a total of ten tags, and ? fty patterns. The OLAM matrix was constructed from this data. The memory matrix was then presented with a digitized photograph of one of the ten people, but taken from a di? erent angle to any of the original ? ve prototypes. The output was then classi? ed according to the tag associated with its largest component. This was found to give correct identi? cation. The OLAM matrix was also found to perform well with autoassociation. Pattern vectors corresponding to one hundred digitized photographs were Department of Mathematics ww. jntuworld. com www. jntuworld. com www. jwjobs. net 19 Matrix Memory used to construct the autoassociative memory via the projection rule. When presented with incomplete or fuzzy versions of the original patterns, the OLAM matrix satisfactorily reconstructed the correct image. In another autoassociative recall experiment, twenty one di? erent prototype images were used to construct the OLAM matrix. These were each composed of three similarly placed copies of a subimage. New pattern images, consisting of just one part of the usual triple features, were presented to the OLAM matrix.The output images consisted of slightly fuzzy versions of the single part but triplicated so as to mimic the subimage positioning learned from the original twenty one prototypes. An analysis comparing the performance of the correlation memory matrix with that of the generalized inverse matrix memory has been o? ered by Cherkassky, Fassett and Vassilas (IEEE Trans. on Computers, 40, 1429 (1991)). Their conclusion is that the generalized inverse memory matrix performs better than the correlation memory matrix for autoassociation, but that the correlation memory matrix is better for classi? cation.This is contrary to the widespread belief that the generalized inverse memory matrix is the superior model. King’s College London www. jntuworld. com www. jntuworld. com www. jwjobs. net 20 Chapter 1 Department of Mathematics www. jntuworld. com www. jntuworld. com www. jwjobs. net Chapter 2 Adaptive Linear Combiner We wish to consider a memory matrix for the special case of one-dimensional output vectors. Thus, we consider input pattern vectors x(1) , . . . , x(p) ? R? , say, with corresponding desired outputs y (1) , . . . , y (p) ? R and we seek a memory matrix M ? R1?? such that M x(i) = y (i) , for 1 ? i ? . Since M ? R1?? , we can think of M as a row vector M = (m1 , . . . , m? ). The output corresponding to the input x = (x1 , . . . , x? ) ? R? is just ? mi xi . y = Mx = i=1 Such a system is known as the adaptive linear combiner (ALC). weights on connections x1 x2 input signal components m1 . . . xn m2 y output mn “adder” forms n mi xi i=1 Figure 2. 1: The Adaptive Linear Combiner. We have seen that we may not be able to ? nd M which satis? es the exact input-output relationship M x(i) = y (i) , for each i. The idea is to look for an M which is in a certain sense optimal. To do this, we seek m1 , . . . , m? 1 www. jntuworld. com www. jntuworld. com www. jwjobs. net 22 Chapter 2 such that (one half) the average mean-squared error p E? 11 2p i=1 |y (i) ? z (i) |2 is minimized—where y (i) is the desired output corresponding to the input vector x(i) and z (i) = M x(i) is the actual system output. We already know, from the results in the last chapter, that this is achieved by the OLAM matrix based on the input-output pattern pairs, but we wish here to develop an algorithmic approach to the construction of the appropriate memory matrix. We can write out E in terms of the mi as follows p E= 1 2p i=1 ? |y (i) ? j =1 p (i) mj xj |2 ? i=1 j,k=1 ? = 1 2 j,k=1 ? (i) (i) y (i)2 + 1 2p mj mk xj xk ? 2 (i) y (i) mj xj j =1 ? mj Ajk mk ? 1 bj mj + 2 c j =1 (i) (i) (i) 1 1 1 where Ajk = p p=1 xj xk , bj = p p=1 y (i) xj and c = p p=1 y (i)2 . i i i Note that A = (Ajk ) ? R??? is symmetric. The error E is a non-negative quadratic function of the mi . For a minimum, we investigate the equalities ? E /? mi = 0, that is, ?E = 0= ?mi ? k=1 Aik mk ? bi , for 1 ? i ? ?, where we have used the symmetry of (Aik ) here. We thus obtain the so-called Gauss-normal or Wiener-Hopf equations ? Aik mk = bi , k=1 for 1 ? i ? ?, or, in matrix form,Am = b , with A = (Aik ) ? R??? , m = (m1 , . . . , m? ) ? R? and b = (b1 , . . . , b? ) ? R? . If A is invertible, then m = A? 1 b is the unique solution. In general, there may be many solutions. For example, if A is diagonal with A11 = 0, then necessarily b1 = 0 (otherwise E could not be non-negative as a function of the mi ) and so we see that m1 is arbitrary. To relate this to the OLAM matrix, write E as E = 21p M X ? Y 2 , F Department of Mathematics www. jntuworld. com www. jntuworld. com www. jwjobs. net 23 Adaptive Linear Combiner where X = (x(1) · · · x(p) ) ? R?? p and Y = (y (1) · · · y (p) ) ?R1? p . This, we know, is minimized by M = Y X # ? R1?? . Therefore m = M T must be a solution to the Wiener-Hopf equations above. We can write A, b and c in 1 1 terms of the matrices X and Y . One ? nds that A = p XX T , bT = p Y X T 1 and c = p Y TY . The equation Am = b then becomes XX T m = XY T giving m = (XX T )? 1 XY T , provided that A is invertible. This gives M = mT = Y X T (XX T )? 1 = Y X # , as above. One method of attack for ? nding a vector m? minimizing E is that of gradient-descent. The idea is to think of E (m1 , . . . , m? ) as a bowl-shaped surface above the ? -dimensional m1 , . . , m? -space. Pick any value for m. The vector grad E , when evaluated at m, points in the direction of maximum increase of E in the neighbourhood of m. That is to say, for small ? (and a vector v of given length), E (m+? v )? E (m) is maximized when v points in the same direction as grad E (as is seen by Taylor’s theorem). Now, rather than increasing E , we wish to minimize it. So the idea is to move a small distance from m to m ? ? grad E , thus inducing maximal “downhill” movement on the error surface. By repeating this process, we hope to eventually reach a value of m which minimizes E .The strategy, then, is to consider a sequence of vectors m(n) given algorithmically by m(n + 1) = m(n) ? ? grad E , for n = 1, 2, . . . , with m(1) arbitrary and where the parameter ? is called the learning rate. If we substitute for grad E , we ? nd m(n + 1) = m(n) + ? b ? Am(n) . Now, A is symmetric and so can be diagonalized. There is an orthogonal matrix U ? R??? such that U AU T = D = diag(? 1 , . . . , ?? ) and we may assume that ? 1 ? ?2 ? · · · ? ?? . We have 1 1 E = 2 mT Am ? bT m + 2 c 1 = 1 mT U TU AU TU m ? bT U TU m + 2 c 2 = 1 z T Dz ? v T z + 1 c 2 2 where z = U m and v = U b ? = 1 2 i=1 ? 2 ?i zi ?