Thursday, December 29, 2016

Using backpropagation neural networks to compress and visualize multidimensional data in R.


Providing better understanding of multidimensional data is important task. This article shows how to complete this task using backpropagation neural networks. One of the parts of EDA is multidimensional data visualization, which requires an effective algorithm to mapping nD data into 2D plane. There are a lot of techniques including Kohonen Self-Organizing maps, SVD, PCA, etc., but sometimes they are difficult to be interpreted. Here is presented another one, which seems more simple to understanding. This technique will illustrate on well-known Iris dataset.


The Iris dataset contains four input features (sepal length, sepal width, petal length, petal width) and one output - species. Thus, it is the 4D input space which cannot be visualized directly. Using 2D plots we can show only pair relationships as provided below.
# Load the necessary libraries

# Print the simple pairs plot
featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs", pch = 21, auto.key = list(columns = 3))

It seems quite understandable, but, for instance, we can not make sure that the entire input space contains clusters. Even the 4D data can confound. Moreover, it is even more difficult to draw conclusions if we have more dimensions.
Backpropagation neural networks can effectively decrease data dimensionality. To do this, we need to pass the data through the “bottleneck” with lower dimension than source data. This neural network should have five layers: input layer, first hidden layer to perform data compression, second hidden layer “bottleneck”, third hidden layer to perform data decompression and an output layer with the same size as an input layer. Thus, this type of neural networks has a symmetrical architecture.
In our case the “bottleneck” will be contain just two neurons, that is necessary to visualize 2D data. And the network will allow only most important data to pass through this layer. Later we will see that this data only is necessary to restore the source data distribution.
First, we create a train dataset with the same inputs and ouputs and do the simple preprocess (normalization):
train <- iris
train <- train[,1:4]
train <- cbind(train, train)
colnames(train) <- c("SL.I","SW.I", "PL.I", "PW.I", # Inputs from iris dataset...
                     "SL.O","SW.O", "PL.O", "PW.O") # and the same outputs

# Preprocess train dataset with caret package
preProcValues <- preProcess(train, method = c("range"))
train2 <- predict(preProcValues, train)
Next, we create the neural netwok model that compress 4D source data to 2D dataset:
model2d <- neuralnet(SL.O + SW.O + PL.O + PW.O ~ SL.I + SW.I + PL.I + PW.I, 
                    train2, hidden=c(3,2,3), algorithm = 'rprop+', threshold = 0.01)
# print (paste("Mean square error = ", model2d$result.matrix[1]))

plot(model2d, rep = 1)

Third, we get the activation values from the middle hidden layer to create a new compressed 2D-dataset:
result2d <- compute(model2d, train2[,1:4]) # Run input set through the neural network
out2d <-$neurons[[3]][,2:3]) # Get the 2D-data from middle layer
out2d <- cbind(out2d, iris[,5]) # Append labels from source dataset
colnames(out2d) <- c("HL3.1","HL3.2", "SPECIES")
ggplot(out2d, aes(x=HL3.1, y=HL3.2, color=SPECIES)) + geom_point() # Visualize it

Thus, we have compressed 4D input space to 2D dataset with no physical meaning, visualized and labeled it. As we can see, the three clusters which corresponds with iris species can be clearly visually determined.
Let’s try to push the data through only 1 neuron!
model1d <- neuralnet(SL.O + SW.O + PL.O + PW.O ~ SL.I + SW.I + PL.I + PW.I, 
                     train2, hidden=c(3,1,3), algorithm = 'rprop+', threshold = 0.01)
# print (paste("Mean square error = ", model1d$result.matrix[1]))

plot(model1d, rep=1)

result1d <- compute(model1d, train2[,1:4]) #Run them through the neural network
out1d <-$neurons[[3]][,2])
out1d <- cbind(out1d, iris[,5])
colnames(out1d) <- c("HL1", "SPECIES")

ggplot(out1d, aes(x=SPECIES, y=HL1, color=SPECIES)) +
        geom_violin(trim = FALSE) +

As we can see, it works worse. This model have a greater mean square error. And its clusterization does not look good - there we can not see three clear clusters: data according versicolor and virginica species is intersected in 1-1.5 interval of values from the “bottleneck” neuron.


The above example shows that the backpropagation neural networks can be very helpful to compress and visualize multi-dimensional data. And this technique can be more clear than some well-known methods. Usage of 2D model applied to iris dataset gave us good understanding about existing data clusters. In comparison, 1D model looked worse, but in some cases even 1D model can make sense! For example, in the task of modeling house prices the data from only middle neuron (the “bottleneck”) could generalize a set of the most important factors affecting the price.