Running a K-Means Cluster Analysis

A k-means cluster analysis was conducted to identify subgroups of college students based on their similarity of responses to an evaluation questioner regarding classes they attended. The dataset included a total of 5820 records

The quantitative clustering variable below were included. All clustering variables were standardized to have a mean of 0 and a standard deviation of 1.

  • Class: Course code; possible values from {1-13}
  • Repeat: Number of times the student is taking this course; values taken from {0,1,2,3,…}
  • Attendance: Code of the level of attendance; values from {0, 1, 2, 3, 4}
  • Difficulty: Level of difficulty of the course as perceived by the student; values taken from {1,2,3,4,5}

Possible values for the variables below are {1,2,3,4,5}

  • Q1: The semester course content, teaching method and evaluation system were provided at the start.
  • Q2: The course aims and objectives were clearly stated at the beginning of the period.
  • Q3: The course was worth the amount of credit assigned to it.
  • Q4: The course was taught according to the syllabus announced on the first day of class.
  • Q5: The class discussions, homework assignments, applications and studies were satisfactory.
  • Q6: The textbook and other courses resources were sufficient and up to date.
  • Q7: The course allowed field work, applications, laboratory, discussion and other studies.
  • Q8: The quizzes, assignments, projects and exams contributed to helping the learning.
  • Q9: I greatly enjoyed the class and was eager to actively participate during the lectures.
  • Q10: My initial expectations about the course were met at the end of the period or year.
  • Q11: The course was relevant and beneficial to my professional development.
  • Q12: The course helped me look at life and the world with a new perspective.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The variance in the clustering variables that was accounted for by the clusters was plotted for each of the nine cluster solutions in an elbow curve to provide guidance for choosing the number of clusters to interpret.


The elbow curve suggested that 2 or 3 cluster solutions might be interpreted. The results below are for an interpretation of the 2-cluster solution.

Canonical discriminant analyses was used to reduce the 15 clustering variable down a few variables that accounted for most of the variance in the clustering variables. A scatterplot of the first two canonical variables by cluster (see below) indicated that the two clusters had almost no within cluster variance, and did not overlap. The two clusters were not densely packed suggesting a relatively high within cluster variance.


The means on the clustering variables showed this was the first attempt at this class for students in cluster-1 (negative average), students in this cluster had a high attendance record (compared to cluster-1) and they had a favorable impression about their classes. Students in cluster-1 had a poor attendance record, many repeating their classes and had negative impression about classes they were taking.

clustering variables mean

In order to validate the clusters, an Analysis of Variance (ANOVA) was conducting to test for significant differences between the clusters on Q9 (students who greatly enjoyed their class and were eager to participate).

OLS Regression Results

A tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on Q9

Multiple Comparison of Means - Tukey


Machine Learning – Lasso Regression Using Python

A lasso regression analysis was conducted to identify a subset of predictors from a pool of 23 categorical and quantitative variables that best predicted a quantitative target variable. The target variable in this case was school connectedness in adolescents. Categorical predictors included gender and a series of 5 binary categorical variables for race and ethnicity (Hispanic, White, Black, Native American and Asian) to improve interpretability of the selected model with fewer predictors.

Data was collected by asking individual questions about whether the adolescent had ever used alcohol, marijuana, cocaine or inhalants. Additional categorical predictors included the availability of cigarettes in the home, whether or not either parent was on public assistance and any experience with being expelled from school. Quantitative predictor variables include age, alcohol problems, and a measure of deviance that included such behaviors as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school. Another scale for violence, one for depression, and others measuring self-esteem, parental presence, parental activities, family connectedness and grade point average were also included.

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations.

In [38] pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,     test_size=.3, random_state=123)

Applying the Lasso Regression to the data assigns a Regression Coefficient to each predictor. Predictors with a Regression Coefficient of zero were eliminated,18 were retained.


During the estimation process, self-esteem and depression were most strongly associated with school connectedness, followed by engaging in violent behavior and GPA. Depression and violent behavior were negatively associated with school connectedness and self-esteem and GPA were positively associated with school connectedness. Other predictors associated with greater school connectedness included older age, Hispanic and Asian ethnicity, family connectedness, and parental involvement in activities. Other predictors associated with lower school connectedness included being male, Black and Native American ethnicity, alcohol, marijuana, and cocaine use, availability of cigarettes at home, deviant behavior, and history of being expelled from school.

RegressionCoefficient Progression

As predictors are added to the model, the Mean Square Error (MSE) maintains same pattern. Initially it decreases rapidly then it levels off where adding more predictors does not lead to much reduction in the MSE.


The MSE for the training data stood at 18.15 while it was 17.29 for the test data, while a bit lower the overall prediction accuracy was pretty stable cross the two datasets.


#from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import os
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
os.chdir("D:\MOOC\Machine Learning\Wesleyan Univ\DataSets") 
#Load the dataset
data = pd.read_csv("tree_addhealth.csv")
#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)
# Data Management - Drop records with missing data
data_clean = data.dropna()
recode1 = {1:1, 2:0}
data_clean['MALE']= data_clean['BIO_SEX'].map(recode1)
#select predictor variables and target variable as separate data sets 
predvar= data_clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',
target = data_clean.SCHCONN1

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, 
 test_size=.3, random_state=123)
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
# print variable names and regression coefficients
#dict creates a dictionary object, zip creates a list
dict(zip(predictors.columns, model.coef_))
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
 label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.title('Regression Coefficients Progression for Lasso Paths')
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
 label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
 label='alpha CV')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print ('test data MSE')
# R-square from training and test data
print ('training data R-square')
print ('test data R-square')


Random Forest Analysis Using Python

Random forest analysis was performed to evaluate the importance of a series of predictor variables in predicting a binary, categorical variable. The following predictor variables were included as possible contributors to a random forest evaluating the target variable of regular smoking:

  • Age,
  • Gender,
  • (race/ethnicity) Hispanic, White, Black, Native American and Asian.
  • Alcohol use,
  • Marijuana use,
  • Cocaine use,
  • Inhalant use,
  • Availability of cigarettes in the home,
  • Whether or not either parent was on public assistance,
  • Any experience with being expelled from school,
  • Alcohol problems,
  • Deviance,
  • Violence,
  • Depression,
  • Self-esteem,
  • Parental presence,
  • Parental activities,
  • Family connectedness,
  • School connectedness
  • Grade point average.

The 24 predictor variables used has different importance levels in the prediction, the most important is Marijuana Use, followed by GPA and School Connectedness. The least important predictors are Asian and Native American (see below).


Accuracy of the prediction is 84%


From the accuracy plot below we can see that the subsequent growing of multiple trees adds little to the overall accuracy of the model suggesting that interpretation of a single decision tree is as appropriate.


Accuracy Plot

Machine Learning – Decision Tree using Python

Decision tree analysis was performed to test nonlinear relationships among a set of predictors and a binary, categorical target variable. For the present analyses, the Gini Index was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

For simplicity only two predictor variables were used as possible contributors: Marijuana Use (marever1) and Cigarette Availability (cigavail). Marijuana Use was the most important predictor.


The prediction has a 75% accuracy rate.


The ‘Marijuana Use’ score was the first variable to separate the sample into two subgroups.









From a sample of 85 valid records, 63 said they’ve smoked marijuana before and 22 said that they’ve never done it. Of the 63, 50 said they did not have cigarettes available at home, of those 50, 46 were not regular smokers and 4 were. The remaining 13 did have cigarettes available, 11 were non regular smokers and two were regular smokers.

The left half of the tree is the 22 who did experience with marijuana in the past. 22 of them did have cigarettes available 6 of them were non-regular smokers and 9 were. The remaining 7 did not have cigarettes available, 3 of them were no regular smokers and 4 were.

# -*- coding: utf-8 -*-

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

os.chdir(“D:\MOOC\Machine Learning\Wesleyan Univ\Week-1”)

Data Engineering and Analysis
#Load the dataset

AH_data = pd.read_csv(“tree_addhealth2.csv”)

data_clean = AH_data.dropna()

Modeling and Prediction
#Split into training and testing sets

predictors = data_clean[[‘marever1’, ‘cigavail’]]
#’VIOL1′,’PASSIST’,’SCHCONN1′,’GPA1′,’EXPEL1′,’cocever1′, ‘inhever1′,,’age’,]]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

#Build model on training data


#showing correct/incorrect classification True falseand false trues

#shows the accuracy score.
sklearn.metrics.accuracy_score(tar_test, predictions)

print DataFrame(classifier.feature_importances_, columns = [“Imp”], index = predictors.columns).Imp

#Displaying the decision tree
from sklearn import tree

from io import BytesIO

#from StringIO import StringIO
#from IPython.display import Image

out = BytesIO()

tree.export_graphviz(classifier, out_file=out, feature_names=predictors.columns)
import pydotplus


with open(‘DecisionTree.jpg’, ‘wb’) as f:

MOOC Websites

With many prestigious universities offering free MOOCs, below you can find a listing of the leading sites. Check them out, I’m sure you will find a class you like.

  1. Coursera
  2. Udacity
  3. edX
  4. Course Sites
  5. Open 2 Study
  6. Desire 2 Learn
  7. MS Virtual Academy
  8. Canvas (students)
  9. Canvas (Instructor)
  10. Stanford University
  11. NovoED
  12. Building Mobile Apps (Harvard)
  13. Coders Joint
  14. EdCast
  15. Eduonix
  16. OpenClassroom (Stanford)
  17. Big Data University

  1. Lumosity
  2. Google Mapping
  3. Focus Training
  4. qGIS
  5. Mind Tools
  6. Boundless Geo
  7. Udemy (ESRI)
  8. MyBringBack
  9. EMS (Big Data)