Outline the paper
With the vast development in emerging technologies such as artificial intelligence, medical field advancements and IoT, more data has been available in the market. The latter led to embrace more data-driven decisions for drawing accurate conclusions in major industries. Hence, it is of great interest in a variety of real-world applications to recognize and isolate data that has abnormal or exceptional behaviour which often manifests interesting facts, such as in fraud discovery, image processing, signal analysis, network intrusion, measurement errors detection in data derived from sensors, and machine learning modelling, to name a few. This kind of data which appears to be inconsistent with the remainder of that set of Data is known as outlier. In this work, we propose a novel, yet effective learning algorithm for outlier detection in multivariate data where the number of attributes is greater than or equal 3.
Figure 1 (a) Synthetic data 1. (b) Synthetic data 2. All outliers have been identified correctly (denoted by red triangles)
Who will it help?
The employment of machine learning has been revolutionary across all industries. Hence, there has been more demand on data analysis, data precreation and pre-processing techniques that concern all workers in those fields, specially the technological and medical ones. Over the years, many outlier detection methods have been introduced in different research communities, where each author claims their methods to be immaculate. Yet, most (if not all) of these methods have some drawbacks or flaws. This algorithm, however, is a totally new approach that is mathematically, experimentally, and data analytically proved to be very promising and competitive. This novel algorithm, Rotation-based Outlier Detection (ROD), is parameter free, requires no statistical distribution assumptions and is intuitive in three-dimensional space.
What is the future of this research?
Potential future use as a distance metric for the clustering algorithms that are used in data mining across all industries. ROD can be further used to investigate ways to construction of efficient methodology in data analysis that plays a significant role in gaining actionable insights from data, in which every data analytics project is performed with an aim to make a decision-making process easier. ROD can be mainly used as a robust algorithm for anomalies and outliers’ detection in multivariate data. However, ROD is, after all, a metric that can be utilized in areas other than outlier detection. A potential use-case is the KNN algorithm, which originally uses the Euclidean Distance as a proximity measure of data points, where the latter is not robust in high dimensions. Since ROD exhibited robustness in high dimensions, it is a very potential candidate, if it is tweaked properly. Nevertheless, KNN is extremely fast, yet the ROD time complexity is high. In order to utilize ROD in such fast algorithms, theories, ways and many mathematical properties should be established, investigated and experimented. For example, choosing sub-dimensions of interest shall reduce the time complexity noticeably, though, the latter is still an open question that requires extensive work.
Publication Title: A Novel Algorithm for Outlier Detection in Multivariate Data – Rotation-based Outlier Detection.
Publication Date: 3 November 2020
Journal: IEEE Transactions on Knowledge and Data Engineering
Link to publication: https://ieeexplore.ieee.org/document/9250609