Sunday, January 1, 2017
Thursday, December 29, 2016
IntroductionProviding better understanding of multidimensional data is important task. This article shows how to complete this task using backpropagation neural networks. One of the parts of EDA is multidimensional data visualization, which requires an effective algorithm to mapping nD data into 2D plane. There are a lot of techniques including Kohonen Self-Organizing maps, SVD, PCA, etc., but sometimes they are difficult to be interpreted. Here is presented another one, which seems more simple to understanding. This technique will illustrate on well-known Iris dataset.
AnalysisThe Iris dataset contains four input features (sepal length, sepal width, petal length, petal width) and one output - species. Thus, it is the 4D input space which cannot be visualized directly. Using 2D plots we can show only pair relationships as provided below.
# Load the necessary libraries library(neuralnet) library(caret) library(ggplot2) # Print the simple pairs plot featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs", pch = 21, auto.key = list(columns = 3))
It seems quite understandable, but, for instance, we can not make sure that the entire input space contains clusters. Even the 4D data can confound. Moreover, it is even more difficult to draw conclusions if we have more dimensions.
Backpropagation neural networks can effectively decrease data dimensionality. To do this, we need to pass the data through the “bottleneck” with lower dimension than source data. This neural network should have five layers: input layer, first hidden layer to perform data compression, second hidden layer “bottleneck”, third hidden layer to perform data decompression and an output layer with the same size as an input layer. Thus, this type of neural networks has a symmetrical architecture.
In our case the “bottleneck” will be contain just two neurons, that is necessary to visualize 2D data. And the network will allow only most important data to pass through this layer. Later we will see that this data only is necessary to restore the source data distribution.
First, we create a train dataset with the same inputs and ouputs and do the simple preprocess (normalization):
Next, we create the neural netwok model that compress 4D source data to 2D dataset:
train <- iris train <- train[,1:4] train <- cbind(train, train) colnames(train) <- c("SL.I","SW.I", "PL.I", "PW.I", # Inputs from iris dataset... "SL.O","SW.O", "PL.O", "PW.O") # and the same outputs # Preprocess train dataset with caret package preProcValues <- preProcess(train, method = c("range")) train2 <- predict(preProcValues, train)
set.seed(1) model2d <- neuralnet(SL.O + SW.O + PL.O + PW.O ~ SL.I + SW.I + PL.I + PW.I, train2, hidden=c(3,2,3), algorithm = 'rprop+', threshold = 0.01) # print (paste("Mean square error = ", model2d$result.matrix)) plot(model2d, rep = 1)
Third, we get the activation values from the middle hidden layer to create a new compressed 2D-dataset:
result2d <- compute(model2d, train2[,1:4]) # Run input set through the neural network out2d <- as.data.frame(result2d$neurons[][,2:3]) # Get the 2D-data from middle layer out2d <- cbind(out2d, iris[,5]) # Append labels from source dataset colnames(out2d) <- c("HL3.1","HL3.2", "SPECIES") ggplot(out2d, aes(x=HL3.1, y=HL3.2, color=SPECIES)) + geom_point() # Visualize it
Thus, we have compressed 4D input space to 2D dataset with no physical meaning, visualized and labeled it. As we can see, the three clusters which corresponds with iris species can be clearly visually determined.
Let’s try to push the data through only 1 neuron!
set.seed(1) model1d <- neuralnet(SL.O + SW.O + PL.O + PW.O ~ SL.I + SW.I + PL.I + PW.I, train2, hidden=c(3,1,3), algorithm = 'rprop+', threshold = 0.01) # print (paste("Mean square error = ", model1d$result.matrix)) plot(model1d, rep=1)
result1d <- compute(model1d, train2[,1:4]) #Run them through the neural network out1d <- as.data.frame(result1d$neurons[][,2]) out1d <- cbind(out1d, iris[,5]) colnames(out1d) <- c("HL1", "SPECIES") ggplot(out1d, aes(x=SPECIES, y=HL1, color=SPECIES)) + geom_violin(trim = FALSE) + geom_jitter(position=position_jitter(0.2))
As we can see, it works worse. This model have a greater mean square error. And its clusterization does not look good - there we can not see three clear clusters: data according versicolor and virginica species is intersected in 1-1.5 interval of values from the “bottleneck” neuron.