123

mage inputs that are either very large and have a diffuse (in pixel terms) distribution of edgesmay warrant larger pooling sizes, particularly in the lower layers of the network. However,this approach reduces the dimension of the signal transmitted to subsequent layers and as aconsequence can result in excessive information loss. On this basis, the max pooling shape wasnot explored within the hyperparameter optimisation phase and remained static at 2×2. Giventhis, and its presence within the Brownlee architecture, a 2×2 grid was deemed an appropriatepooling shape.Number of filtersThe filter selection range was distributed around the number of filters observed within theBrownlee model, with two additional values included above and below to provide what wasdeemed a reasonable, but nevertheless constrained, search space (i.e. 16,24, 32, 40, 48).OptimisationThree different optimisers were examined: Adam, Stochastic Gradient Descent (SGD) and RM-SProp. Initial experimentation, supported by research findings (Kingma & Ba 2014), demon-strated the Adam optimiser’s consistent outperformance both SGD and RmsProp. On thisbasis the choice of Optimiser was not included within the optimal hyperparameter search andthereby reducing the search space size.Activation FunctionsActivation functions with linear components are now common place within neural networks,especially deep networks, owing to their superior performance. While literature (Clevert et al.2015) did suggest that ELU activation functions outperformed the related ReLU function giventhe ReLU functions prevalence it was also considered within the optimisation.Regularisation L1 L2The keras framework offers three applications of traditional regularisation: kernel, bias and ac-tivity regularization. Initial experimentation suggested that limited benefits could be realisedby regularizing the kernel and activity functions, and hence they were excluded.Three further types of regularization are available within the keras framework L1, L2 anda combination of L1 and L2. Neurons with L1 regularization typically settle on a sparse sub-set of the most important inputs and become effectively invariant to the other, noisy or lessinformative, inputs. In contrast, the final weight vectors observed with L2 regularization aretypically small and diffuse, owing to the quadratic penalty arising from large weights. Due tothe multiplicative interactions between inputs and weights, this has the appealing property ofincentivising the neural network to use all the inputs rather than over-rely on a select few. Fi-nally a combination of L1 and L2 can also be performed, enjoying the benefits of both methods.All three regularization strategies were included within the optimisation phase with a defaultpenalty weight of 0.1. Subsequently, a more refined penalty weight was explored with the bestperforming regularization approach.Regularisation Dropout and DropConnectAs discussed within the further research section, Dropout regularization has become an ex-tremely popular regularization method within the deep learning community, no doubt informedby both its efficacy and its simplicity. In order to ensure an acceptable degree of ‘dropped out’nodes, a uniformly distributed range was supplied for optimisation, but with a dropout rangeuniformly distributed between 0.2 and 0.8.