#### Graduate Project Title

#### Document Type

Graduate Project

#### Date of Degree Completion

Fall 2017

#### Degree Name

Master of Science (MS)

#### Department

Computational Science

#### Committee Chair

Boris Kovalerchuk

#### Second Committee Member

Christos Graikos

#### Third Committee Member

Razvan Andonie

#### Abstract

Occlusion is one of the major problems for interactive visual knowledge discovery and data mining in the process of finding patterns in multidimensional data.This project proposes a hybrid method that combines visual and analytical means to deal with occlusion in visual knowledge discovery called as GLC-S which uses visualization of n-D data in 2D in a set of Shifted Paired Coordinates (SPC). A set of Shifted Paired Coordinates for n-D data consists of n/2 pairs of common Cartesian coordinates that are shifted relative to each other to avoid their overlap. Each n-D point A is represented as a directed graph A* in SPC, where each node is the 2D projection of A in a respective pair of the Cartesian coordinates.

The proposed GLC-S method significantly decrease cognitive load for analysis of n-D data and simplify pattern discovery in n-D data. The GLC-S method iteratively splits n-D data into non-overlapping clusters (hyper-rectangles) around local centers and visualizes only data within these clusters at each iteration. The requirements for these clusters are to contain cases of only one class and be the largest cluster with this property in SPC visualization.

Such sequential splitting allows: (1) avoiding occlusion, (2) finding visually local classification patterns, rules, and (3) combine local sub-rules to a global rule that classifies all given data of two or more classes. The computational experiment with Wisconsin Breast Cancer data(9-D), User Knowledge Modeling data(6-D), and Letter Recognition data(17-D) from UCI Machine Learning Repository confirm this capability. At each iteration, these data have been split into training (70%) and validation (30%) data. It required 3 iterations in Wisconsin Breast Cancer data, 4 iterations in User Knowledge Modeling and 5 iterations in Letter Recognition data and respectively 3, 4, 5 local sub-rules that covered over 95% of all n-D data points with 100% accuracy at both training and validation experiments. After each iteration, the data that were used in this iteration are removed and remaining data are used in the next iteration. This removal process helps to decrease occlusion too. The GLC-S algorithm refuses to classify remaining cases that are not covered by these rules, i.e.,., do not belong to found hyper-rectangles. The interactive visualization process in SPC allows adjusting the sides of the hyper-rectangles to maximize the size of the hyper-rectangle without its overlap with the hyper-rectangles of the opposing classes.

The GLC-S method splits data using the fixed split of n coordinates to pairs. This hybrid visual and analytical approach avoids throwing all data of several classes into a visualization plot that typically ends up in a messy highly occluded picture that hides useful patterns. This approach allows revealing these hidden patterns.

The visualization process in SPC is reversible (lossless). i.e.,., all n-D information is visualized in 2D and can be restored from 2D visualization for each n-D case. This hybrid visual analytics method allowed classifying n-D data in a way that can be communicated to the user’s in the understandable and visual form.

#### Recommended Citation

1. Wrolstadt, Jay: Satellite Smashes Terabyte Data Barrier, NewsFactor Sci::Tech,http://sci.newsfactor.com/perl/story/18424.html, June 2002. 2. Wolfgang Muller,Heidrun Schumann,"Visual Data Mining", NORSIGD Info,2002. 3. Kovalerchuk, B. (2014). Visualization of multidimensional data withcollocated paired coordinates and general line coordinates. In Proc. SPIE 9017, visualization and data analysis (p. 90170I). 4. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linearprogramming", SIAM News, Volume 23, Number 5, September 1990, pp 1&18. 5. William H. Wolberg and O.L. Mangasarian: "Multisurface method of patternseparation for medical diagnosis applied to breast cytology",Proceedings of the National Academy of Sciences, U.S.A., Volume 87,December 1990, pp 9193-9196. 6. David J. Slate Odesta Corporation; 1890 Maple Ave; Suite 115; Evanston, IL 60201, David J. Slate (dave@math.nwu.edu) (708) 491-3867, January, 1991 7. H. T. Kahraman, Sagiroglu, S., Colak, I., Developing intuitive knowledgeclassifier and modeling of users’ domain dependent data in web, Knowledge Based Systems, vol. 37, pp. 283-295, 2013. 8. Kovalerchuk B, Grishin V. Adjustable general line coordinates for visualknowledge discovery in nD data. Information Visualization. 2017 Jul:1473871617715860. 9. Kovalerchuk B. Visual Cognitive Algorithms for High-Dimensional Data andSuper-intelligence Challenges. Cognitive Systems Research. 2017 Jun 6. 10. Bertini, E., Tatu, A., & Keim, D. (2011). Quality metrics in highdimensionaldata visualization: An overview and systematization. IEEE Transactions on Visualization and Computer Graphics, 17(12), 2203-2212. 11. Grishin, V., & Kovalerchuk, B. (2014). Multidimensional collaborative losslessvisualization: Experimental study. In Luo (Ed.), CDVE 2014,Seattle, Sept 2014. CDVE 2014, LNCS 8683 (pp. 27-35). Springer. 12. Hibbard, B. (2002). Super-intelligent machines. Kluwer. 13. Dhanabal S, CHANDRAMATHI DS. CLUSTERING OF HIGH DIMENSIONAL DATASET USING K-MAM (MAX-AVG-MIN) METHOD WITH PRINCIPAL COMPONENT ANALYSIS A HYBRID APPROACH. Journal of Theoretical & Applied Information Technology. 2014 Mar 10;61(1). 14. Akay MF. Support vector machines combined with feature selection for breastcancer diagnosis. Expert systems with applications. 2009 Mar 31;36(2):3240-7. 15. Albrecht, A. A., Lappas, G., Vinterbo, S. A., Wong, C. K., & OhnoMachado,L.(2002). Two applications of the LSA machine. In Proceedings of the 9th international conference on neural information processing (pp. 184-189). 16. Pena-Reyes, C. A., & Sipper, M. (1999). A fuzzy-genetic approach to breastcancer diagnosis. Artificial Intelligence in Medicine(17), 131-155 17. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journalof Artificial Intelligence Research(4), 77-90. 18. Hamiton, H. J., Shan, N., & Cercone, N. (1996). RIAC: A rule inductionalgorithm based on approximate classification. Technical Report CS 96-06, University of Regina 19. Ster, B., & Dobnikar, A. (1996). Neural networks in medicaldiagnosis:comparison with other methods. In Proceedings of the international conference on engineering applications of neural networks (pp. 427-430). 20. Abonyi, J., & Szeifert, F. (2003). Supervised fuzzy clustering for theidentification of fuzzy classifiers. Pattern Recognition Letters, 14(24), 2195-2207 21. Polat, K., & Gunes, S. (2007). Breast cancer diagnosis using least squaresupport vector machine. Digital Signal Processing, 17(4), 694-701 22. Adnan Alrabea, A. V. Senthilkumar, Hasan Al-Shalabi, and Ahmad Bader, Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with PCA, Journal of Advances in Computer Networks, Vol. 1, No. 2, June 2013 23. Fahim A.M., Salem A.M., Torkey F.A.,Saake,G and Ranadan M.A., âAIJ An˘ Efficient k-means with good initial starting pointsâA˘˙I, Georgian Electronic Scientific Journal,Computer Science and Telecommunication vol 2, No19., PP -47 -57 ,2009. 24. Kahraman HT, Sagiroglu S, Colak I. The development of intuitive knowledgeclassifier and the modeling of domain dependent data. Knowledge-Based Systems. 2013 Jan 31;37:283-95. 25. H.T. Kahraman, Designing and application of web-based adaptive intelligenteducation system, Ph. D. Thesis, Institute of Science and Technology, Ankara,2009 26. A.C. Martins, L. Faria, C. Vaz de Carvalho, E. Carrapatoso, User modeling inadaptive hypermedia educational systems, Educational Technology & Society 11 (1) (2008) 194-207. 27. M.P. O’Mahony, B. Smyth, A classification-based reviewrecommender,Knowledge-Based Systems 23 (2010) 323-329. 28. R. Virgilio, R. Torlone, G.J. Houben, Rule-Based Adaptation of WebInformation Systems, World Wide Web, vol. 10, Springer Science + Business Media, 2007.pp. 443-470. 29. M. Simko, M. Bielikova, User modeling based on emergent domainsemantics,in: 18th International Conference on User Modeling, Adaptation, and Personalization, UMAP 2010, Springer-Verlag, Berlin Heidelberg, 2010, pp.411-414. 30. M. Laguia, J.L. Castro, Local distance-based classification, Knowledge-BasedSystems 21 (2008) 692-703