A Portfolio of my Work with Analytics, Finance and Technology

Saturday, 13 February 2016

Multivariable Regression for Concrete Compression Testing through R

Today's project focuses on creating a linear model that would describe the influence of multiple ingredients on concrete's ability to withstand loads. The linear model was built with R.

Data comes from Chung-Hua University, China. Input variables measured were cement, slag, fly ash, water, super plasticizer(SP), coarse aggregate and fine aggregates. Input variables were measured in kg/m3 of concrete. The output variable is compressive strength after 28 days, measured in MPa. Results show that water is the strongest influencer of compressive strength. Slag is the weakest influencer of compressive strength. Super plasticizer had little to no impact and was completely removed from the model. The compressive strength was determined to follow below equation:

Compressive strength
= 0.04970*(Cement) - 0.04519*(Slag) + 0.03859*(Fly ash) - 0.27055*(Water) - 0.06986*(Coarse Aggregate) - 0.05358*(Fine Aggregate)

Normalized Histogram of Residuals

The correlation coefficient shows a strong fit (R2 = 0.8962) and the probability values are low for each variable. The normalized histogram shows a normal distribution of residuals. The distribution of residuals strongly support the linear model and removes the risk of systematic error.

The problem was approached by creating a multivariable linear regression of all the input variables:

Initial Regression

A high correlation coefficient exists. Some of the probability values, however, do not show strong evidence against the null hypothesis - notably slag, fine aggregates and SP. Fortunately, the step() function only selects feasible variables.

Final Regression

The coefficients are listed in the column. The full coding are as follows:

#Multivariable regression of Concrete Compression Test

#By Matthew Mano (matthewm3109@gmail.com)

#import data

concrete<-read.csv("slump.csv")

#remove incomplete tests

concretec<-concrete[complete.cases(slump),]

#generate linear model

concreter<-lm(CS~ Cement+Slag+Fly.ash+Water+SP+CA+FA, data=concretec)

#get information of initial model

summary(concreter)

#remove unnecessary variables

concreter2=step(concreter)

#get information of secondary model

summary(concreter2)

r<-residuals(concreter2)

#graphing residuals in histogram

hist(r, prob=TRUE,main="Normalized Histograms of Residuals",xlab="Standard Deviations")

#adding reference normal curve

curve(dnorm(x, mean=mean(r), sd=sd(r)), add=TRUE, col="red")

The links to the code, csv file and original dataset are attached. If you have any ideas for improvement or would like to get in contact, please comment or email me directly at matthewm3109@gmail.com.

Link to code & csv: http://bit.ly/1QRzyjr
Link to original data: https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test

Sunday, 7 February 2016

Scatterplot Matrices to Analyse Water Parameters with R

So far, scatterplot matrices are the most useful function I have every seen in any software. Scatterplot matrices graphically summarize important relationships between vectors. Most impressively, scatterplot matrices can calculate the correlation coefficients between all possible combinations of vectors in a dataset. Also, matrices are easy to generate.

The goal for today's project is to identify physical water quality parameters with the strongest fit. The data is collected from River Avon, UK. Salinity and conductivity had a perfect fit, which was expected. Salinity and temperature had a moderate downhill (negative) linear relationship. Conductivity and temperature also had a moderate downhill (negative) linear relationship. Since conductivity, temperature and salinity likely influences each other, these parameters should be further analysed. Next steps could involve finding a regression plane between the three variables.

Water Parameters in River Avon, UK

Water parameters measured are temperature (in Celsius), pH, Conductivity (mS), Dissolved Oxygen (%) and Salinity (ppt). The reading are conducted in different locations along the river during the summer season of 2015 (June, July and August).

Coding is as follows:

#River Avon Water Parameters

#by Matthew Mano (matthewm3109@gmail.com)

water<-read.csv("waterQuality3.csv",header=T)

library("psych") #psych is a REALLY useful package

pairs.panels(water[c(3, 4, 5, 6, 7)], gap = 0) #concatenation used to identify columns for regression

As explained, coding is simple but powerful.

Link for data source: http://bit.ly/1LvM5XY
Link to download csv and r file: http://bit.ly/1LvMr0S

Sunday, 24 January 2016

Use of Bar Graphs in R Programming to Investigate Ethnicities in Toronto (Part 2)

The purpose of this post is to learn to make bar graphs in R. As practice, I used the public data of the demographics within Toronto's many neighborhoods. R found four neighborhoods with the greatest percent of population speaking Chinese, Tamil and Tagalog. Results show that visible minorities tend to live close to each other, just outside of the downtown core of a major city.

Languages in Toronto by Neighborhood

Cross referencing with Google Earth shows that, with regards to Chinese speakers, the neighborhoods were in the Northeast end of Toronto. For Tamil speakers, prominent neighborhoods were in the East end. For Tagalog, prominent neighborhoods were in the North end. Public data came from a 2011 survey conducted by Wellbeing Toronto, a program with the City of Toronto. I attached a link to all Excel and coding files.

For the coding, I had to learn to create bar graphs using the barplot function, create columns in a dataset and order values by size. My coding is as follows:

#Language in Toronto

#By Matthew Mano

lang<-read.csv("torontoLanguage.csv", header=T)

Import the file

lang$PCh<-(lang$Ch/lang$Tot)*100

lang$PTl<-(lang$Tam/lang$Tot)*100

lang$PTg<-(lang$Tag/lang$Tot)*100

Create three additional columns with the percent population

lang.newch<-lang[order(-lang$PCh),c(1,6)]

lang.newtl<-lang[order(-lang$PTl),c(1,7)]

lang.newtg<-lang[order(-lang$PTg),c(1,8)]

Order the percent population by decreasing size

par(mfrow=c(3,1))

Display 3 charts in 1 window

barplot(height=lang.newch$PCh[1:4],names.arg=lang.newch$Ne[1:4],ylab="Percent out of Total Population (%)",border=NA)

Create a barplot of the first four neighborhoods for Chinese speakers. Did the same for the other two languages.

title(main="Toronto Neighborhoods with the Greatest Percent of Chinese Speakers")

barplot(height=lang.newtl$PTl[1:4],names.arg=lang.newtl$Ne[1:4],ylab="Percent out of Total Population (%)",border=NA)

title(main="Toronto Neighborhoods with the Greatest Percent of Tamil Speakers")

barplot(height=lang.newtg$PTg[1:4],names.arg=lang.newtg$Ne[1:4],ylab="Percent out of Total Population (%)",border=NA)

title(main="Toronto Neighborhoods with the Greatest Percent of Tagalog Speakers",)

Link for downloads: http://bit.ly/20xMtvZ

Saturday, 23 January 2016

Use of Bar Graphs in R Programming to Investigate Ethnicities in Toronto

The goal for my next project is to look at the major ethnicities in Toronto - by neighborhood. I'll measure ethnicity by looking at languages spoken at home. Ignoring English, some of the major languages spoken in Toronto are Cantonese, Tagalog and Tamil. The purpose is to identify neighborhoods with the greatest percent of foreign speakers out of total population. The dataset I will be using comes from Wellbeing Toronto, a program run by City of Toronto. The CSV and Excel files are attached.

Download Link for CSV files: http://bit.ly/20xMtvZ

Sunday, 17 January 2016

Basic Line Graph Coding in R

Using public information available from the Toronto Police, I analyzed vehicle collisions in Toronto from 1998-2012. I created a basic line graph in R to visualize the data.

I graphed the Percent of Fatal Collisions because I believe that fatal collisions are a stronger indicator of danger than total number of collisions. Collisions are unavoidable. But if drivers follow the law and, more importantly, our car engineers develop safer cars, there should be no fatalities. Looking at the above graph shows a decrease in both number of collisions and the percent of fatal collisions. The fall in fatal collisions could be because of greater strides in safety engineering, greater penalties for reckless driving and more public awareness for better drivers.

Here is my coding:

#Vehicle Collisions in Toronto
#By Matthew Mano (matthewm3109@gmail.com)
collisions=read.csv("TorontoCollisions.csv", header=T) #upload csv file year=collisions$Year #Identify specific columns
total=collisions$C
fatal=collisions$Fpercent=(fatal/total)*100 #Find percent of Fatal Collisions to Total Number of Collisions
plot(year,total, type="l", xlab="Year", ylab="Number of Collisions") #Use type="l" to create a line plot
par(new=TRUE) #keep working on the original graph
plot(year,percent,type="l",xaxt="n",yaxt="n",xlab="",ylab="",col="red")mtext("Percent that were Fatal Collisions (%)",side=4,line=3)axis(4) #Use right hand axix to display percent
legend("topright",col=c("black","red"),lty=1,legend=c("Number of Collisions","Percent that were Fatal Collisions")) #Add legend title(main="Vehicle Collisions in Toronto") #Add title

The trickiest part was to generate a graph with two y-axis. I plotted the percent normally, then I added two special lines of code:

mtext("Percent that were Fatal Collisions (%)",side=4,line=3)

Added the axis name to the margins.

axis(4) #Use right hand axis to display percent

Used the 4th 'side' of the graph - and the desired axis - to display gridlines

I added link to download my full coding. Please let me know if I can improve this or add anything else. Let me know if you have any questions.

Link to download code & download CSV and Excel file: http://bit.ly/1oBVFnj

Saturday, 16 January 2016

Basic Line Graphs

Hey everyone,

The goal today is to familiarize myself with basic graphing functions in R. As a proud Torontonian, I'll be using public data pulled from the Toronto Police Service. I'll be analyzing the number of traffic collisions from 1998-2012.

Link to public documents: http://www.torontopolice.on.ca/publications/#reports
Link for csv/excel download: https://drive.google.com/folderview?id=0B4ylNmE1KdhQR05mcm5Cc0MzOXM&usp=sharing

I'll be uploading my code soon. If you have any questions or any ideas for improvement, shoot me an email at matthewm3109@gmail.com.

Thanks,

Matthew Mano