Electronics & Electrical Engineering

Share this article:

# Machine Learning – Learning From Data

*Machine Learning *(ML) is the scientific discipline that encompasses methods and related algorithms whose goal is to *learn from data. *In other words, ML aims at unraveling hidden regularities that underlie the available data set, via the use of various models. Once a model has been learned, it can then be used to perform predictions on previously unseen data.

Such methods have been used for several decades in a number of scientific disciplines, as for example statistical learning, pattern recognition, adaptive and statistical signal processing, system identification and control, image analysis and more recently in data mining and information retrieval, computer vision and robotics.

The name Machine Learning has its deep roots in computer science where early pioneers envisioned constructing machines that learn in a way analogous to how the brain learns from data; that is, from experience. ML is closely related, yet different in focus, to the scientific field of artificial intelligence (AI), which mainly relies on symbolic computations and rule-based reasoning, via a set of rules built into the system by the designer. In contrast, in ML, the emphasis is on establishing input-output relations, which are expressed via an adopted model, that is learned from the data.

Take as an example the task of Optical Character Recognition (OCR), whose goal is to convert images of handwritten or printed text into a machine encoded sequence. At the heart of such systems lies the design of a *classifier*, whose purpose is to *recognize* each character of the written text; in other words, to identify which one of the known characters in the alphabet is present in the corresponding text image. This is a typical ML task. The classifier is a system that relates the input text images to an output label (code number), which is uniquely associated with a character in the alphabet. The classifier is defined in terms of a set of unknown parameters. These are *learned* via a set of so called *training *data; that is, a set of text images which involve characters known to the designer. Hence, the system is trained via a set of known examples, and once the involved parameters have been estimated, the system is ready to perform label predictions when text images, with unknown characters, are presented in its input (scanner). OCR systems are widely used for data entry from printed records such as passports, invoices, books and are used for storage as well as for a number of ML-related tasks, such as automatic machine translation, text-to-speech conversion and text mining.

ML aims at unraveling hidden regularities that underlie the available data set, via the use of various models.

With the advent of the information/knowledge society, ML has emerged as a core technology that runs across a number of scientific disciplines and in almost any engineering application. To mention a few examples, applications range from neuroscience to digital communications, from robotics to fMRI data analysis, from speech recognition and music information retrieval to medical applications and assistive technologies.

The more recent trend is that of big data applications, including areas such as smart grids, social networks and internet of things. In big data applications, ML techniques have to learn from data residing in databases as big as a few petabytes (10^15) of stored information. Such applications have pushed ML to confront with new challenges and problems.

The goal of this book is to offer a unifying approach to the major methods and algorithms that are currently used across these types of applications.

There are two major philosophies in handling the task of learning from data. One relies on deterministic models, which treat the unknown models as fixed yet unknown entities. This line of philosophy also encompasses probabilistic models of the so-called frequentist school of thought. The essence of these methods is that they build around the notion of parameter optimization and different cost functions can be adopted for such a task. A typical example, familiar to all engineers, computer scientists and statisticians from the early years of their graduate studies, is the method of least-squares (LS) estimation. A prediction model of an output variable is expressed as a weighted average of a set of input variables, and it is defined in terms of a set of unknown weights/parameters. The weights are estimated so that to minimize the sum of squared errors, over a set of input-output measurements. LS is just an instance of a cost function to be optimized. A number of alternatives and various ways of optimizing a cost function, with respect to a set of unknown parameters, are considered in depth in this book. Depending on the nature of the available data and also the available computational resources, different costs and optimization algorithms can be adopted for each specific learning task.

The other major direction is to learn from the data via probabilistic arguments, where the involved model parameters are treated as random variables, each one associated with a probability distribution, known as prior. In such a way, one can build up a hierarchy of probabilistic models. This type of modeling philosophy is referred to as the Bayesian inference approach. To its full extent, such an approach can, in principle, bypass the need for optimization by imposing appropriate priors in a high enough level of the adopted model hierarchy.Take, as an example, the case where the output variable, given the values of the input variables, is modeled via a Gaussian probability density function (pdf). A Gaussian pdf is defined in terms of its mean value and its variance. According to the previously mentioned frequentist-type of treatment, the unknown mean and variance values are considered fixed and their values can be obtained in order to maximize the joint probability distribution over a number of received measurements (known as the Maximum Likelihood (ML) method).

In contrast, according to the Bayesian rationale, the mean and variance are also considered as random variables, which are described via a new set of probability distributions, that are defined in terms of a new set of unknown parameters, known as hyperparameters. In the sequel, these hyperparameters can also be treated as random variables, defined in terms of another set of hyperparameters, and so on. This is how the hierarchy of models is formed. To its full extent, at the highest level of the hierarchy, the remaining unknown parameters, which define a prior probability distribution, can be assigned to some values. Alternatively, one can involve an optimization step of a ML-type.

Both previously stated schools of thought are treated in depth in this book. In practice, the method of choice depends very much on the type of task at hand, the available number of data points for training, and the complexity and memory requirements imposed by each method. These issues are becoming critical factors in big data applications. Every student and researcher needs to become familiar with all (as much as is possible) the basic theories and methods in order to be prepared for future scientific developments ; and no doubt, this is something that cannot be predicted!

The book starts from the basics, necessary for any newcomer to the field, and builds steadily to more advanced and state-of-the-art concepts and methods. This development is carried out along two of the major pillars of ML, namely classification and regression. These are the two generic tasks that are an umbrella for a large class of problems faced in ML, known as supervised learning, which is the focus of this book. Such techniques involve training data (measurements/observations) for both the output as well as for the input variables. Unsupervised learning /clustering is another direction of ML, where only input data is available. Such techniques are defined in the book in terms of a classical method, the k-means algorithm, however, they are not part of the major focus. Clustering is extensively treated separately in a companion book. For the newcomer to the field, the required mathematical prerequisite is an understanding of basic probability and linear algebra. This is sufficient to cover all the basic material addressed in the book. In order to refresh the memory, a chapter summarizing the fundamentals of probability and statistics is provided at the beginning of the book. Necessary linear algebra definitions and formulae are also summarized in an appendix. More advanced methods may require some further mathematical skills and the related material is discussed and explained in appendices; such methods are intended for more experienced readers who are interested in delving deeper into more advanced and state-of-the art techniques.

Chapters are as self-contained as possible. This has a two-fold purpose. The researcher who is interested in specific type of methods/problems can quickly identify what he/she needs most and focus on the respective chapters. At the same time, the book can serve needs for different courses such as: pattern recognition, adaptive and statistical signal processing, Bayesian learning and graphical models, sparsity-aware learning.

Special attention is paid to various aspects associated with learning from big data, such as online learning, distributed learning and dimensionality reduction. Two of the most powerful techniques in ML, namely nonlinear modeling in reproducing kernel Hilbert spaces and deep neural network architectures, are treated as separate chapters. Deep learning has been selected by the MIT Review as one among the ten breakthrough technologies for the year 2013. The goal in deep learning is to build input-output relation models that mimic the many-layered structure of neurons in the neocortex, which accounts for about the 80% of our brain, where thinking occurs and what we call intelligence is formed .

A number of case studies are discussed and serve as a vehicle to demonstrate the application of ML methods in the context of practical applications. Some examples are: echo cancelation and channel equalization, image de-noising and de-blurring, time-frequency analysis of echosignals transmitted by bats, optical character recognition (OCR), change point detection, text authorship identification, protein folding prediction in bioinformatics, hyperspectral image unmixing, fMRI data analysis.

In writing this book I wanted to address the needs of advanced graduate and postgraduate students as well as researchers in the field of ML. The book is the outcome of many years of research, participation in international projects and teaching experience in computer science and engineering departments at different universities, both for graduate and postgraduate courses, as well as many short courses for industry-related audiences.

The book is written in a way to satisfy the needs of the reader who wants to learn the methods in depth, and proofs are provided either in the text or in the problems. For those readers who are not interested in proofs, they can simply bypass them; as much as possible, the various methods are also explained in terms of physical reasoning that facilitates understanding, without having to resort to proofs. Moreover, a number of MATLAB exercises are given as part of the problems and the MATLAB code will also be available via the website of the book.

Sergios’ book *Machine Learning: A Bayesian and Optimization Perspective *is available for purchase on Google Play. Through July 31st, get 40% off on this and all Elsevier titles on Google Play.

**About the Author**

Sergios Theodoridis is Professor of Signal Processing and Machine Learning in the Department of Informatics and Telecommunications of the University of Athens. He is the co-author of the bestselling book, Pattern Recognition, and the co-author of Introduction to Pattern Recognition: A MATLAB Approach. He serves as Editor-in-Chief for the IEEE Transactions on Signal Processing, and he is the co-Editor in Chief with Rama Chellapa for the Academic Press Library in Signal Processing.

He has received a number of awards including the 2014 IEEE Signal Processing Magazine Best Paper Award, the 2009 IEEE Computational Intelligence Society Transactions on Neural Networks Outstanding Paper Award, the 2014 IEEE Signal Processing Society Education Award, the EURASIP 2014 Meritorious Service Award, and he has served as a Distinguished Lecturer for the IEEE Signal Processing Society and the IEEE Circuits and Systems Society. He is a Fellow of EURASIP and a Fellow of IEEE.

Electronics and electrical engineering have practically limitless applications. From power engineering, telecommunications, and consumer electronics to circuit design, computer engineering, and embedded systems, these disciplines form the backbone of our increasingly tech-dependent world. Elsevier’s collection of electronics and electrical engineering content — particularly our Newnes and Academic Press Imprints — encompasses these areas and more. Our books and journals provide fundamental knowledge and practical, up-to-date toolkits for professional engineers and technicians, undergraduate and postgraduate students, and electronics enthusiasts.