Graduate Project Title

Visual Data Mining

Document Type

Graduate Project

Date of Degree Completion

Fall 2017

Degree Name

Master of Science (MS)

Department

Computational Science

Committee Chair

Boris Kovalerchuk

Second Committee Member

Christos Graikos

Third Committee Member

Razvan Andonie

Abstract

Occlusion is one of the major problems for interactive visual knowledge discovery and data mining in the process of finding patterns in multidimensional data.This project proposes a hybrid method that combines visual and analytical means to deal with occlusion in visual knowledge discovery called as GLC-S which uses visualization of n-D data in 2D in a set of Shifted Paired Coordinates (SPC). A set of Shifted Paired Coordinates for n-D data consists of n/2 pairs of common Cartesian coordinates that are shifted relative to each other to avoid their overlap. Each n-D point A is represented as a directed graph A* in SPC, where each node is the 2D projection of A in a respective pair of the Cartesian coordinates.

The proposed GLC-S method significantly decrease cognitive load for analysis of n-D data and simplify pattern discovery in n-D data. The GLC-S method iteratively splits n-D data into non-overlapping clusters (hyper-rectangles) around local centers and visualizes only data within these clusters at each iteration. The requirements for these clusters are to contain cases of only one class and be the largest cluster with this property in SPC visualization.

Such sequential splitting allows: (1) avoiding occlusion, (2) finding visually local classification patterns, rules, and (3) combine local sub-rules to a global rule that classifies all given data of two or more classes. The computational experiment with Wisconsin Breast Cancer data(9-D), User Knowledge Modeling data(6-D), and Letter Recognition data(17-D) from UCI Machine Learning Repository confirm this capability. At each iteration, these data have been split into training (70%) and validation (30%) data. It required 3 iterations in Wisconsin Breast Cancer data, 4 iterations in User Knowledge Modeling and 5 iterations in Letter Recognition data and respectively 3, 4, 5 local sub-rules that covered over 95% of all n-D data points with 100% accuracy at both training and validation experiments. After each iteration, the data that were used in this iteration are removed and remaining data are used in the next iteration. This removal process helps to decrease occlusion too. The GLC-S algorithm refuses to classify remaining cases that are not covered by these rules, i.e.,., do not belong to found hyper-rectangles. The interactive visualization process in SPC allows adjusting the sides of the hyper-rectangles to maximize the size of the hyper-rectangle without its overlap with the hyper-rectangles of the opposing classes.

The GLC-S method splits data using the fixed split of n coordinates to pairs. This hybrid visual and analytical approach avoids throwing all data of several classes into a visualization plot that typically ends up in a messy highly occluded picture that hides useful patterns. This approach allows revealing these hidden patterns.

The visualization process in SPC is reversible (lossless). i.e.,., all n-D information is visualized in 2D and can be restored from 2D visualization for each n-D case. This hybrid visual analytics method allowed classifying n-D data in a way that can be communicated to the user’s in the understandable and visual form.

COinS