1) **For those new to the book, how would you summarise your approach to presenting machine learning?**

The approach that is followed by the book is to provide an in-depth coverage of some of the main directions in Machine Learning around Classification, Regression and, also, aspects of unsupervised learning such as probabilistic graphical models. Each chapter starts from the more basic notions, in a way that can be followed by the newcomer in the field, and builds steadily to more advanced notions/topics. The chapters are written so that to be as self-contained as possible. So, for example, if the reader wants to learn only the basics, then he/she can do it by reading two or three chapters. For example, one can start with the three chapters that deal with the basic notions related to a) parametric modelling, regression, and fundamental machine learning concepts such as bias-variance trade-off, overfitting, and cross-validation (Chapter 3), b) classification basics (Chapter 7) and finally c) deep neural networks (Chapter 18). Although the book does not follow the black box approach, the required maths (especially for the more basic chapters) are standard college probability and linear algebra. Furthermore, care is taken so that the involved formulae to be explained via physical/geometric arguments that help the reader to understand what is behind the maths and the “cold” symbols. Once the reader grasps the basics, then he/she can read other chapters, depending on the emphasis and his/her interests. Also, every chapter is accompanied by computer exercises both in Matlab and Python. Chapter 18 also includes computer exercises in tensorflow.

For a limited time, you can access *Chapter 3:** Learning in Parametric Modeling: Basic Concepts and Directions* on ScienceDirect.

**2) One of the big changes in the new edition is your extended coverage of Deep Learning. Can you say what is your approach to this topic and the new content you have covered?**

Indeed, the first edition was published in 2015, which basically means that it covered advances till 2014. However, 2015 and after were the years that a big boom took place in this field. Not only in the sense of new methods and algorithms, but also in the sense of consolidation of what had been proposed earlier. The path that I have followed in the related chapter is a historical one. That is, neural networks and the related concepts are given by following the evolution that took place in the field over the years. Thus, the chapter starts by commenting on the neuron discovery as the building block of our brain, in the late 19th century by Ramon’ y Cajal. Then, it moves on to the first model of an artificial neuron, i.e., the McCulloch-Pitts neuron, and presents Rosenblatt’s perceptron algorithm, which are the early milestones in the field. Then it progressively moves to “build” multilayer perceptrons, and the back propagation algorithm, which is the fourth and more recent milestone. Then it steps in the more recent trends, including up-to-date optimisation algorithms, such as Nesterov’s variants and the Adam algorithm, convolutional neural networks (CNN), recurrent neural networks (RNN), adversarial examples and learning, and the use of the attention mechanism. Finally, it “lands” to review generative adversarial networks (GANs), variational auto encoders, capsule networks and ends up with a case study related to Neural Machine Translation.

**3) What other changes have you made in the 2nd edition?**

Besides Chapter 18 that has been basically rewritten, chapter 13 that is dedicated to Bayesian learning has been enriched with new sections on nonparametric Bayesian learning, and it now includes Gaussian processes as well as Dirichlet processes with a detailed reference to Chinese Restaurant and Indian Buffet processes. Also, in all chapters, certain parts have been rewritten to be more clear with more examples. Also, in Chapter 11, the notion of random Fourier features is now treated.

**4) You have end of chapter exercises that use MATLAB and Python. Can you describe the nature of these and how they aid learning?**

In the second edition all the computer exercises have also been given in Python code. Codes for all exercises, both in MATLAB and Python, are free available via the book’s website. It is of paramount importance that the reader will experiment with the code while reading the book.

**5) You have been researching and teaching pattern recognition and machine learning for over 20 years. The field has changed and grown in importance immensely since you started in the field. How do you see the field evolving in the next few years?**

It is very difficult to predict the future. No doubt, after the advent or rather the “rediscovery” of neural networks, nothing is the same as before. However, after almost 15 years of intense research, it seems that the field has reached a level of saturation and a number of important and highly challenging open problems need to be addressed. For example, issues related to their interpretability, issues related to their adaptability to new data sets without the need for retraining, issues related to the need for huge training sets and computer power for their training. New topics are becoming of interest such federated learning, manifold and geometric learning. Also, hardware implementation issues on neuromorphic and non-Von Neumann type of computers is a challenging field for the future. My feeling and dream is to see this powerful algorithms to run on, e.g., mobile phones and not to have to resort to the cloud and powerful GPUs. Of course, the most challenging task is to move away from what machine learning currently is, that is a powerful “predictor”. The vison is to search for what is known as strong AI that will strive to achieve more human-like intelligence, that cares for causality and some form of reasoning. Some of these issues as well as related ethical concerns are discussed in the introductory chapter.

**About the book **

- Presents the physical reasoning, mathematical modeling and algorithmic implementation of each method

- Updates on the latest trends, including sparsity, convex analysis and optimization, online distributed algorithms, learning in RKH spaces, Bayesian inference, graphical and hidden Markov models, particle filtering, deep learning, dictionary learning and latent variables modeling

- Provides case studies on a variety of topics, including protein folding prediction, optical character recognition, text authorship identification, fMRI data analysis, change point detection, hyperspectral image unmixing, target localization, and more

For a limited time, you can access *Chapter 3:** Learning in Parametric Modeling: Basic Concepts and Directions* on ScienceDirect. Want your own copy? Enter code

]]>

Such methods have been used for several decades in a number of scientific disciplines, as for example statistical learning, pattern recognition, adaptive and statistical signal processing, system identification and control, image analysis and more recently in data mining and information retrieval, computer vision and robotics.

The name Machine Learning has its deep roots in computer science where early pioneers envisioned constructing machines that learn in a way analogous to how the brain learns from data; that is, from experience. ML is closely related, yet different in focus, to the scientific field of artificial intelligence (AI), which mainly relies on symbolic computations and rule-based reasoning, via a set of rules built into the system by the designer. In contrast, in ML, the emphasis is on establishing input-output relations, which are expressed via an adopted model, that is learned from the data.

Take as an example the task of Optical Character Recognition (OCR), whose goal is to convert images of handwritten or printed text into a machine encoded sequence. At the heart of such systems lies the design of a *classifier*, whose purpose is to *recognize* each character of the written text; in other words, to identify which one of the known characters in the alphabet is present in the corresponding text image. This is a typical ML task. The classifier is a system that relates the input text images to an output label (code number), which is uniquely associated with a character in the alphabet. The classifier is defined in terms of a set of unknown parameters. These are *learned* via a set of so called *training *data; that is, a set of text images which involve characters known to the designer. Hence, the system is trained via a set of known examples, and once the involved parameters have been estimated, the system is ready to perform label predictions when text images, with unknown characters, are presented in its input (scanner). OCR systems are widely used for data entry from printed records such as passports, invoices, books and are used for storage as well as for a number of ML-related tasks, such as automatic machine translation, text-to-speech conversion and text mining.

ML aims at unraveling hidden regularities that underlie the available data set, via the use of various models.

With the advent of the information/knowledge society, ML has emerged as a core technology that runs across a number of scientific disciplines and in almost any engineering application. To mention a few examples, applications range from neuroscience to digital communications, from robotics to fMRI data analysis, from speech recognition and music information retrieval to medical applications and assistive technologies.

The more recent trend is that of big data applications, including areas such as smart grids, social networks and internet of things. In big data applications, ML techniques have to learn from data residing in databases as big as a few petabytes (10^15) of stored information. Such applications have pushed ML to confront with new challenges and problems.

The goal of this book is to offer a unifying approach to the major methods and algorithms that are currently used across these types of applications.

There are two major philosophies in handling the task of learning from data. One relies on deterministic models, which treat the unknown models as fixed yet unknown entities. This line of philosophy also encompasses probabilistic models of the so-called frequentist school of thought. The essence of these methods is that they build around the notion of parameter optimization and different cost functions can be adopted for such a task. A typical example, familiar to all engineers, computer scientists and statisticians from the early years of their graduate studies, is the method of least-squares (LS) estimation. A prediction model of an output variable is expressed as a weighted average of a set of input variables, and it is defined in terms of a set of unknown weights/parameters. The weights are estimated so that to minimize the sum of squared errors, over a set of input-output measurements. LS is just an instance of a cost function to be optimized. A number of alternatives and various ways of optimizing a cost function, with respect to a set of unknown parameters, are considered in depth in this book. Depending on the nature of the available data and also the available computational resources, different costs and optimization algorithms can be adopted for each specific learning task.

The other major direction is to learn from the data via probabilistic arguments, where the involved model parameters are treated as random variables, each one associated with a probability distribution, known as prior. In such a way, one can build up a hierarchy of probabilistic models. This type of modeling philosophy is referred to as the Bayesian inference approach. To its full extent, such an approach can, in principle, bypass the need for optimization by imposing appropriate priors in a high enough level of the adopted model hierarchy.Take, as an example, the case where the output variable, given the values of the input variables, is modeled via a Gaussian probability density function (pdf). A Gaussian pdf is defined in terms of its mean value and its variance. According to the previously mentioned frequentist-type of treatment, the unknown mean and variance values are considered fixed and their values can be obtained in order to maximize the joint probability distribution over a number of received measurements (known as the Maximum Likelihood (ML) method).

In contrast, according to the Bayesian rationale, the mean and variance are also considered as random variables, which are described via a new set of probability distributions, that are defined in terms of a new set of unknown parameters, known as hyperparameters. In the sequel, these hyperparameters can also be treated as random variables, defined in terms of another set of hyperparameters, and so on. This is how the hierarchy of models is formed. To its full extent, at the highest level of the hierarchy, the remaining unknown parameters, which define a prior probability distribution, can be assigned to some values. Alternatively, one can involve an optimization step of a ML-type.

Both previously stated schools of thought are treated in depth in this book. In practice, the method of choice depends very much on the type of task at hand, the available number of data points for training, and the complexity and memory requirements imposed by each method. These issues are becoming critical factors in big data applications. Every student and researcher needs to become familiar with all (as much as is possible) the basic theories and methods in order to be prepared for future scientific developments ; and no doubt, this is something that cannot be predicted!

The book starts from the basics, necessary for any newcomer to the field, and builds steadily to more advanced and state-of-the-art concepts and methods. This development is carried out along two of the major pillars of ML, namely classification and regression. These are the two generic tasks that are an umbrella for a large class of problems faced in ML, known as supervised learning, which is the focus of this book. Such techniques involve training data (measurements/observations) for both the output as well as for the input variables. Unsupervised learning /clustering is another direction of ML, where only input data is available. Such techniques are defined in the book in terms of a classical method, the k-means algorithm, however, they are not part of the major focus. Clustering is extensively treated separately in a companion book. For the newcomer to the field, the required mathematical prerequisite is an understanding of basic probability and linear algebra. This is sufficient to cover all the basic material addressed in the book. In order to refresh the memory, a chapter summarizing the fundamentals of probability and statistics is provided at the beginning of the book. Necessary linear algebra definitions and formulae are also summarized in an appendix. More advanced methods may require some further mathematical skills and the related material is discussed and explained in appendices; such methods are intended for more experienced readers who are interested in delving deeper into more advanced and state-of-the art techniques.

Chapters are as self-contained as possible. This has a two-fold purpose. The researcher who is interested in specific type of methods/problems can quickly identify what he/she needs most and focus on the respective chapters. At the same time, the book can serve needs for different courses such as: pattern recognition, adaptive and statistical signal processing, Bayesian learning and graphical models, sparsity-aware learning.

Special attention is paid to various aspects associated with learning from big data, such as online learning, distributed learning and dimensionality reduction. Two of the most powerful techniques in ML, namely nonlinear modeling in reproducing kernel Hilbert spaces and deep neural network architectures, are treated as separate chapters. Deep learning has been selected by the MIT Review as one among the ten breakthrough technologies for the year 2013. The goal in deep learning is to build input-output relation models that mimic the many-layered structure of neurons in the neocortex, which accounts for about the 80% of our brain, where thinking occurs and what we call intelligence is formed .

A number of case studies are discussed and serve as a vehicle to demonstrate the application of ML methods in the context of practical applications. Some examples are: echo cancelation and channel equalization, image de-noising and de-blurring, time-frequency analysis of echosignals transmitted by bats, optical character recognition (OCR), change point detection, text authorship identification, protein folding prediction in bioinformatics, hyperspectral image unmixing, fMRI data analysis.

In writing this book I wanted to address the needs of advanced graduate and postgraduate students as well as researchers in the field of ML. The book is the outcome of many years of research, participation in international projects and teaching experience in computer science and engineering departments at different universities, both for graduate and postgraduate courses, as well as many short courses for industry-related audiences.

The book is written in a way to satisfy the needs of the reader who wants to learn the methods in depth, and proofs are provided either in the text or in the problems. For those readers who are not interested in proofs, they can simply bypass them; as much as possible, the various methods are also explained in terms of physical reasoning that facilitates understanding, without having to resort to proofs. Moreover, a number of MATLAB exercises are given as part of the problems and the MATLAB code will also be available via the website of the book.

Sergios’ book *Machine Learning: A Bayesian and Optimization Perspective *is available for purchase on Google Play. Through July 31st, get 40% off on this and all Elsevier titles on Google Play.

**About the Author**

Sergios Theodoridis is Professor of Signal Processing and Machine Learning in the Department of Informatics and Telecommunications of the University of Athens. He is the co-author of the bestselling book, Pattern Recognition, and the co-author of Introduction to Pattern Recognition: A MATLAB Approach. He serves as Editor-in-Chief for the IEEE Transactions on Signal Processing, and he is the co-Editor in Chief with Rama Chellapa for the Academic Press Library in Signal Processing.

He has received a number of awards including the 2014 IEEE Signal Processing Magazine Best Paper Award, the 2009 IEEE Computational Intelligence Society Transactions on Neural Networks Outstanding Paper Award, the 2014 IEEE Signal Processing Society Education Award, the EURASIP 2014 Meritorious Service Award, and he has served as a Distinguished Lecturer for the IEEE Signal Processing Society and the IEEE Circuits and Systems Society. He is a Fellow of EURASIP and a Fellow of IEEE.

]]>