*Gaussian-Based Visualization of Gaussian and Non-Gaussian Model-Based Clustering*

**Description:**

*Authors*:**Christophe Biernacki**and**Matthieu Marbac**and**Vincent Vandewalle**.*License*: GPL-2.*Download ClusVis 1.0.1*: link.

*Reference*: todo

- Introduction.
- Visualization of model-based clustering of categorical data.
- Visualization of model-based clustering of functional data.

All the experiments are used with R package ClusVis.1.0

**ClusVis** implements a method to visualize results of Gaussian or non-Gaussian model-based clustering, on a map. It considers that model-based clustering is done by mixture model \(f\), defined on the native space of the variables (possibly non-continuous). The visualization is permitted by a constrained Gaussian mixture \(g\), whose the centers of the components have to be estimated. The aim is to estimate the parameters of \(g\) to make \(g\) as similar as possible than \(f\) for the clustering purpose. However, \(f\) and \(g\) are not defined on the same space. In clustering, the target is to model the probabilities of classification, thus we want that \(f\) and \(g\) are similar for this quantity. Therefore, parameters of \(g\) cannot be estimated directly from distribution \(f\). The distributions \(f\) and \(g\) define two distributions for the ratios of probabilities of classification, denoted \(p\) and \(q\) respectively. Therefore, parameters of \(g\) are assessed to minimize the Kullback-Leibler divergence from \(q\) to \(p\). Distribution \(g\) is chosen to be a constrained Gaussian mixture in order to have a one-to-one relation between \(g\) and \(q\), to facilitate the interpretation and to easily assess the parameters. However, we discuss the use of alternated continuous distribution. The distribution \(g\) is represented on a map by projection. Therefore, the proposed method provides two graphs. The first graph represents the components overlaps and the risk of misclassification. The second graph displays the scatter-plot of the observations and different levels of the probabilities of classification.

**ClusVis** is illustrated with the analysis of the Congress data set composed of votes for each of the 435 U.S. House of Representatives Congressmen on the 16 key votes. For each vote, three levels are considered: yea, nay or unknown disposition.

**Loadings and Model-based clustering of categorical data with Rmixmod**

```
rm(list=ls())
# Data and Package loadings
library(ClusVis)
require(Rmixmod)
```

`Loading required package: Rmixmod`

`Loading required package: Rcpp`

```
Rmixmod version 2.1.1 loaded
R package of mixmodLib version 3.2.2
Condition of use
----------------
Copyright (C) MIXMOD Team - 2001-2013
MIXMOD is publicly available under the GPL license (see www.gnu.org/copyleft/gpl.html)
You can redistribute it and/or modify it under the terms of the GPL-3 license.
Please understand that there may still be bugs and errors. Use it at your own risk.
We take no responsibility for any errors or omissions in this package or for any misfortune that may befall you or others as a result of its use.
Please report bugs at: http://www.mixmod.org/article.php3?id_article=23
More information on : www.mixmod.org
```

```
data(congress)
# clustering
resmix <- mixmodCluster(congress, 4)
```

**Inference of the parameters used to visualize**

Parameters used to visualize are directly estimated from the results of function \(mixmodCluster\)

` resvisu <- clusvisMixmod(resmix)`

**Component interpretation graph**

ClusVis implements a component interpretation graph which represents, on the most discriminating space, the centers of the projected Gaussian components (numbers). This graph permits to see proximities between components because these centers are assessed with respects to the overlaps between the components of \(f\). Moreover, we also represent the 95\(\%\) confidence level (black border which separates the area outside the confidence level in white to the area inside the confidence level in gray levels) and curves of iso-probability of classification of the resulting mixture (gray levels). Finally, the accuracy of this representation is given by the difference between entropies \(E(f) - E(\tilde{g})\) and the percentile of inertia by axis. We now illustrate this proposition on the running example.

` plotDensityClusVisu(resvisu, add.obs = F, positionlegend = "bottomright")`

**Iso-probabilities of classification and scatter-plot**

ClusVis implements a scatter-plot of the observation memberships. We represent the observations by their coordinates arisen from the LDA of \(g\). The partition defined by the MAP rule is given by the color of each dot. Moreover, the information about the uncertainty of classification is given by the curves of iso-probability of classification. Finally information about the visualization accuracy is given by the difference between entropies and the percentiles of inertia.

` plotDensityClusVisu(resvisu, add.obs = T, positionlegend = "bottomright")`

We consider now the study of the Bike sharing system data presented by . We analyze station occupancy data collected over the course of one month on the bike sharing system in Paris. The data were collected over 5 weeks, between February, 24 and March, 30, 2014, on 1189 bike stations. The station status information, in terms of available bikes and docks, were downloaded every hour during the study period for the seven systems from the open-data APIs provided by the JCDecaux company. To accommodate the varying stations sizes (in terms of the number of docking points), normalized the number of available bikes by the station size and obtained a loading profile for each station. The final data set contains 1189 loading profiles, one per station, sampled at 1448 time points. Notice that the sampling is not perfectly regular; there is an hour, on average, between the two sample points. The daily and weekly habits of inhabitants introduce a periodic behavior in the BSS station loading profiles, with a natural period of one week. It is then natural to use a Fourier basis to smooth the curves, with basis functions corresponding to sine and cosine functions of periods equal to fractions of this natural period of the data. Using such a procedure, the profiles of the stations were projected on a basis of 25 Fourier functions.

**Loadings and Model-based clustering of functional data with funFEM**

```
rm(list=ls())
# Data and Package loadings
require(funFEM)
```

`Loading required package: funFEM`

`Loading required package: MASS`

`Loading required package: fda`

`Loading required package: splines`

`Loading required package: Matrix`

```
Attaching package: 'fda'
```

```
The following object is masked from 'package:graphics':
matplot
```

`Loading required package: elasticnet`

`Loading required package: lars`

`Loaded lars 1.2`

```
data(velib)
# clustering
basis<- create.fourier.basis(c(0, 181), nbasis=25)
fdobj <- smooth.basis(1:181,t(velib$data),basis)$fd
res = funFEM(fdobj,K=10,model='AkjB')
```

`The "ward" method has been renamed to "ward.D"; note new "ward.D2"`

This figure presents the curves classified among the 10 components by the MAP rule.

```
# Visualization of group means
fdmeans = fdobj; fdmeans$coefs = t(res$prms$my)
colset <- c("darkorange1", "dodgerblue2", "black", "chartreuse2", "darkorchid2", "gold2", "deeppink2", "deepskyblue1", "firebrick2", "cyan1")
par(mfrow=c(10,1), mar=c(1,4,1,1))
for (k in 1:10){
tmp <- fdobj
who <- which(res$cls==k)
tmp$coefs <- tmp$coefs[,who]
colo <- rep(1, 10)-1
colo[k] <- 2
color <- rep("white", 10)
color[k]= colset[k]
lt <- rep(0, 10)
lt[k]= 1
plot.fd(tmp, col=colset[k], lwd=0.1, ylim=c(-0.2,1.1))
plot(fdmeans,col=color,xaxt='n',lwd=colo, lty=lt, add=TRUE)
}
```