Statistiques

High dimensional pattern learning applied to symbolic time-series

Publié le

Auteurs : Amir Dib

While the adoption of machine learning in many applied contexts has been growing rapidly in the last decade, there remain challenges to use it in certain industrial settings. The main reason is the clash between established historical procedures with the uncertainty and lack of transparency of a machine learning pipeline's decision process. Another reason is that the input needed to feed a traditional machine learning model does not fit the available type or quality of available data. Most industrial databases have not been developed for statistical analysis but to comply with the regulatory requirements and to perform administrative tasks. In particular, non-numerical or symbolic features are common as it is a versatile way of recording events of interest. Examples of such data are textual documents, sequence of log-events or DNA sequences. The exponential number of possible patterns typically dominates the complexity associated with learning relevant information from symbols. This thesis's applicative framework and primary motivation is to design efficient, human-readable and computationally tractable methods for predictive maintenance on the french train fleet. To that end, we propose to go beyond standard approaches by using a combination of traditional machine learning algorithms with pattern mining techniques to allow human experts to understand and interact with the algorithmic layer of the predictive maintenance pipeline. This thesis's main objective is to tackle these issues by proposing approaches that can be generally applied to a symbolic sequence of data with a human-readable output and trained at a reasonable computational cost. To that end, we begin by constructing a complete machine learning pipeline solution for predictive maintenance on a large fleet of rail vehicles that can be computed at a reasonable cost and provides valuable insight on the underlying symbol dynamic of the degradation process. As a second contribution, we propose a new method for symbolic data set based on a Bayesian generative model for patterns that can increases score accuracy in an interpretable fashion for any symbolic data set. As a third contribution, we introduce a new progressive mining method based on local complexities to obtain sharper statistical bounds on the pattern frequency. Finally, a new and general stochastic optimization method based on alternative sampling is proposed. This method can be applied to the specific use case of Bayesian learning through the Variational Inference setting. In this instance, we provide theoretical and empirical proof of the superiority of this approach compared to the most advanced methods.