Document Type

Thesis

Date of Degree Completion

Spring 2022

Degree Name

Master of Science (MS)

Department

Computational Science

Committee Chair

Boris Kovalerchuk

Second Committee Member

Razvan Andonie

Third Committee Member

Szilard Vajda

Abstract

This research contributes to interpretable machine learning via visual knowledge discovery in General Line Coordinates (GLC). The concepts of hyperblocks as interpretable dataset units and GLC are combined to create a visual self-service machine learning model. Two variants of GLC known as Dynamic Scaffold Coordinates (DSC) are proposed. DSC1 and DSC2 can map in a lossless manner multiple dataset attributes to a single two-dimensional (X, Y) Cartesian plane using a dynamic scaffolding graph construction algorithm.

Hyperblock analysis is used to determine visually appealing dataset attribute orders and to reduce line occlusion. It is shown that hyperblocks can generalize decision tree rules and a series of DSC1 or DSC2 plots can visualize in a lossless manner n-D data in accordance with a decision tree model. For large decision trees with many branches such as MNIST handwritten digits where hyperblock discovery was hampered, dimensionality reduction techniques such as principal component analysis, singular value decomposition, and t-distributed stochastic neighbor embedding were used to create new attributes of interest for visual class separation.

Major benefits of DSC1 and DSC2 is their highly interpretable nature. They allow domain experts to control or establish new machine learning models through visual pattern discovery. A software package referred to as Dynamic Scaffold Coordinates Visualization System (DSCViz) was created to showcase the DSC1 and DSC2 systems. DSCViz expands the end-user’s capabilities by offering several functions such as real-time drag and zoom, scaling techniques, sample clipping, attribute reordering, and the ability to hide classes or change their colors. DSC2 was used to estimate and visualize the worst-case validation splits in the Wisconsin Breast Cancer, Iris, and Seeds dataset. DSC2 was also plotted against MNIST Handwritten digits to determine its feasibility in large datasets. In general, the technique of estimating worst-case validation splits is important for every high-risk application.

Share

COinS