A Portfolio of my Work with Analytics, Finance and Technology: February 2016

Saturday, 13 February 2016

Multivariable Regression for Concrete Compression Testing through R

Today's project focuses on creating a linear model that would describe the influence of multiple ingredients on concrete's ability to withstand loads. The linear model was built with R.

Data comes from Chung-Hua University, China. Input variables measured were cement, slag, fly ash, water, super plasticizer(SP), coarse aggregate and fine aggregates. Input variables were measured in kg/m3 of concrete. The output variable is compressive strength after 28 days, measured in MPa. Results show that water is the strongest influencer of compressive strength. Slag is the weakest influencer of compressive strength. Super plasticizer had little to no impact and was completely removed from the model. The compressive strength was determined to follow below equation:

Compressive strength
= 0.04970*(Cement) - 0.04519*(Slag) + 0.03859*(Fly ash) - 0.27055*(Water) - 0.06986*(Coarse Aggregate) - 0.05358*(Fine Aggregate)

Normalized Histogram of Residuals

The correlation coefficient shows a strong fit (R2 = 0.8962) and the probability values are low for each variable. The normalized histogram shows a normal distribution of residuals. The distribution of residuals strongly support the linear model and removes the risk of systematic error.

The problem was approached by creating a multivariable linear regression of all the input variables:

Initial Regression

A high correlation coefficient exists. Some of the probability values, however, do not show strong evidence against the null hypothesis - notably slag, fine aggregates and SP. Fortunately, the step() function only selects feasible variables.

Final Regression

The coefficients are listed in the column. The full coding are as follows:

#Multivariable regression of Concrete Compression Test

#By Matthew Mano (matthewm3109@gmail.com)

#import data

concrete<-read.csv("slump.csv")

#remove incomplete tests

concretec<-concrete[complete.cases(slump),]

#generate linear model

concreter<-lm(CS~ Cement+Slag+Fly.ash+Water+SP+CA+FA, data=concretec)

#get information of initial model

summary(concreter)

#remove unnecessary variables

concreter2=step(concreter)

#get information of secondary model

summary(concreter2)

r<-residuals(concreter2)

#graphing residuals in histogram

hist(r, prob=TRUE,main="Normalized Histograms of Residuals",xlab="Standard Deviations")

#adding reference normal curve

curve(dnorm(x, mean=mean(r), sd=sd(r)), add=TRUE, col="red")

The links to the code, csv file and original dataset are attached. If you have any ideas for improvement or would like to get in contact, please comment or email me directly at matthewm3109@gmail.com.

Link to code & csv: http://bit.ly/1QRzyjr
Link to original data: https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test

Sunday, 7 February 2016

Scatterplot Matrices to Analyse Water Parameters with R

So far, scatterplot matrices are the most useful function I have every seen in any software. Scatterplot matrices graphically summarize important relationships between vectors. Most impressively, scatterplot matrices can calculate the correlation coefficients between all possible combinations of vectors in a dataset. Also, matrices are easy to generate.

The goal for today's project is to identify physical water quality parameters with the strongest fit. The data is collected from River Avon, UK. Salinity and conductivity had a perfect fit, which was expected. Salinity and temperature had a moderate downhill (negative) linear relationship. Conductivity and temperature also had a moderate downhill (negative) linear relationship. Since conductivity, temperature and salinity likely influences each other, these parameters should be further analysed. Next steps could involve finding a regression plane between the three variables.

Water Parameters in River Avon, UK

Water parameters measured are temperature (in Celsius), pH, Conductivity (mS), Dissolved Oxygen (%) and Salinity (ppt). The reading are conducted in different locations along the river during the summer season of 2015 (June, July and August).

Coding is as follows:

#River Avon Water Parameters

#by Matthew Mano (matthewm3109@gmail.com)

water<-read.csv("waterQuality3.csv",header=T)

library("psych") #psych is a REALLY useful package

pairs.panels(water[c(3, 4, 5, 6, 7)], gap = 0) #concatenation used to identify columns for regression

As explained, coding is simple but powerful.

Link for data source: http://bit.ly/1LvM5XY
Link to download csv and r file: http://bit.ly/1LvMr0S