A Portfolio of my Work with Analytics, Finance and Technology: 2016

Thursday, 17 November 2016

Looking at Oil with R (Part 2) - Moving Averages, Continuously Compounded Returns and Oil Futures.

Introduction

In my previous post, I looked at techniques R offers to forecast stock prices. I learned how to decompose time series data to separate cyclical and long-term trends. I decided to further my research by investigating the relationship between oil futures and the current oil price. I proved that when current oil quotes are low, oil futures will tend to be consistently higher than the current quote. When oil prices are high, there is a greater degree of inconsistency. For the most part, the expectation was that the price would drop.

I also proved that the correlation between oil futures and the current oil price is a good indicator of the public's confidence that oil prices will continue in its current path. The average correlation between 2005 - 2016 is 0.71. An indicator of crude oil surplus is a decrease in correlation between the continuously compounded returns of oil and oil futures. The correlation was around 0.5 during 2007 financial crisis. Confidence in the oil market has been steadily rising since 2015. The increase in positive correlation while oil prices are decreasing means that the public is confident that the prices of oil will continue to drop.

Technique

Like my previous post, all my data comes from Quandl API. For oil quotes, I relied on the OPEC Basket Price. My values for the daily crude oil futures came from Quandl. As approximation, I averaged the open and close price.

library("Quandl")

OilFuture = Quandl("CHRIS/CME_CL1",trim_start="01-02-2004", trim_end="10-01-2016", type="zoo")

Oil = Quandl("OPEC/ORB", trim_start="01-02-2004", trim_end="10-01-2016", type="zoo")

AvgOilFuture = ((OilFuture[,2]+OilFuture[,1])/2)

Figure 1, Oil Futures vs. Oil

First, I wanted a high level view of what I wanted to accomplish. I wanted to compare OPEC prices against Oil Futures. By comparing oil futures against oil prices from 2004 to 2016, I was able to gain some back-of-the-envelope insights (Figure 1). The blue represents the expectation that future oil prices will equal the current price.

At low prices (<$80), the expectation was that the prices will increase.
At high prices (>$80), the expectation had greater degree of uncertainty. For the most part, the expectation was that the price would drop.

I got the idea of converting the price to continuously compounded returns from an excellent piece by Guangming Lang. Converting the price to returns meant looking at the rate of increase as opposed to the prices themselves.

plot(MergedOil[,1],MergedOil[,2],pch=".",col="red",main = "Oil Futures vs. Oil Price",xlab = "OPEC Crude Oil Basket Price",ylab = "Crude Oil Futures")

lines(MergedOil[,1],MergedOil[,1],col="blue")

Figure 2, Experimenting with Moving Averages

I ran into some difficulty trying to grasp the concept of moving average. Simply put, for a certain time point, moving averages would take a segment of values around it and average them. Then, it would move onto the next point. Increasing the 'width' of the segment can minimize outliers, reduce the impact of short-term panic and remove random noise that are not relevant to the underlying trend. Increasing 'width' can erase useful information too. It is a balancing game.

I experimented with different widths in Figure 2. At 5 points, short-term fluctuations are not smoothed out. At 500 points, long-term trends are erased. 100 points would be the optimal width.

plot(rollapply(ccret[,1],5,mean),col='red',xlab = "Year",ylab="Continuously Compounded Returns",main = "Returns from OPEC Basket Price with Moving Average")
legend('top', c("5 Points","50 Points","100 Points","500 Points"), lty=1, col=c('red','black', 'blue', 'green',' brown'), bty='n', cex=.9)
lines(rollapply(ccret[,1],50,mean),col='black')
lines(rollapply(ccret[,1],100,mean),col='blue')
lines(rollapply(ccret[,1],500,mean),col='green')

Figure 3, Comparison between Oil and Oil Futures

The next step was to compare the actual crude oil price and its future - with the moving average applied to remove any noise. Figure 3 shows that returns from both datasets are pretty much equal: when the future's quote increase, the current oil's quote increases proportionally. There is actually a way to quantify the correlation between both datasets.

plot(movingavg[,1], xlab="Time", ylab="Continuously Compounded Returns",main="Comparison between Crude Oil Current Prices and Futures")
lines(movingavg[,2],col="red")
legend('topleft', c("OPEC Crude Oil Basket Price","Crude Oil Futures"), lty=1, col=c('black','red'), bty='n', cex=.9)

Figure 4, Correlation between Oil and Oil Futures

Figure 4 illustrates the correlation between Oil and Oil Futures. I can determine that that there is an overall moderate positive correlation between the two datasets. Additionally, I can see that the weakest relationship were during the 2007-2008 financial crisis and 2014-2015 oil glut. A weak relationship means that values are more spread around the expected value.

Simply put, a weak positive relationship means that the investors are uncertain that the oil market will continue in its current trajectory. I can also determine that the 2007 recession took the investing public by surprise (judging by the steep fall) but confidence was decreasing well before the 2014-2015 oil glut

Also, the correlation has been growing stronger since 2015. This means that the belief that the price of oil will stay in its current trajectory has been growing steadily stronger. Unfortunately, the current trajectory for the price of oil is decreasing.

correlate_oil= function(x) cor(x)[1, 2]
oil_cor= rollapply(ccret, width=100, FUN=correlate_oil, by.column=FALSE, align="right")
plot(oil_cor,ylab="Correlation",xlab="Time",main="Correlation between Continuously Compounded Returns of Oil and Oil Futures")

I was pleasantly surprised about my conclusions. I learned that during major economic crisis, the public will lose confidence that the oil price will increase. I also learned that the 2014-2015 oil glut could have been predicted as confidence started to dip around 2013. This is surprising as major oil companies should have prepared for this by reducing production instead of later laying off employees. In July 2016, Calgary, a major city dependent on the petroleum industry, had an unemployment rate close to 9% because of the layoffs.

Next steps into my research will be to look at commodities to forecast oil prices - such as gold, copper and even rice.

Resources

Freaknomics . (2008, July ). Forecasting Oil Prices: It’s Easy to Beat the Experts. Retrieved November 2016 , from Freaknomics : http://freakonomics.com/2008/07/21/forecasting-oil-prices-its-easy-to-beat-the-experts/

Lang, G. (2014, October ). Analyze Stock Price Data Using R (Part 3) . Retrieved November 2016, from Master R : http://masterr.org/r/analyze-stock-price-data-using-r-part3/

Ron Alquist, L. K. (2011). Forecasting the Price of Oil. Board of Governors of the Federal Reserve System.

Sunday, 23 October 2016

Looking at Oil with R (Part 1)

As a chemical engineer, I believe that the price of crude oil impacts my job market. Since the second quarter of 2014, the crude price of oil has dipped significantly. There are numerous factors that can explain the downward trend. Recently, New York Times reported that the problem was record high production from North America, South American and the OPEC (Organization of the Petroleum Exporting Countries) (Krauss, 2016).

Over the next several weeks, I'll be looking at the features available in R to forecast commodity prices. I am doing this for three reasons: (1) To better understand the economics behind my field, (2) to build a strong portfolio as a data engineer and (3) to make a buck or two by hopefully investing some of my own money in the future.

A really useful package is the Quandl package. Quandl publishes financial data for the public to download freely.

install.packages("Quandl")

Another package is STL. Developed in the 1990s, STL specializes in separating seasonal trends from a model.

install.packages("STL")

Load the packages:

library("Quandl")
library("STL")

Get data from Quandl and create a summary plot:

oil <- Quandl("OPEC/ORB", trim_start="01-01-2000", trim_end="10-01-2016", type="xts")
plot(oil,xlab="Date",ylab="Cost per Barrel ($USD)", main="OPEC Basket Price")

Ideally, I would prefer the WTI (West Texas Intermediate) Oil Price because its model covers sweet and light oil - which industries prefer. However Quandl offers a more complete dataset with the OPEC basket price.

Oil production fluctuates throughout the year - possibly due to weather, holidays and consumer behaviour. Assuming a cyclic pattern occurring every year (365 days), I isolated the seasonal variation.

attr(oil, 'frequency') <- 365
plot(y<-decompose(as.ts(oil)))

Because this is an additive decomposition, simply adding the random component to the seasonal and trend component will equal to the original dataset. The important information is the overall trend - without the influence of seasons. We can forecast this decomposed trend.

fit <- stl(oil, t.window=1, s.window="periodic", robust=TRUE)
fcast <- forecast(fit, method="ets")
plot(fcast, ylab="New orders index")

Forecasting the decomposed trend and later adding the seasonal variation generates the above model. At this point, there is still much improvement to be done. There is still too much variability within the confidence limits.

Krauss, C. (2016, November 2). Oil Prices: What’s Behind the Volatility? Simple Economics. Retrieved November 6, 2016, from New York Times : http://www.nytimes.com/interactive/2016/business/energy-environment/oil-prices.html

Shoutout to the Reddit community for their help

Tuesday, 20 September 2016

Using R to Model the Spread of Diseases

Using data from a class assignment, I attempted to use R to model the spread of diseases in a fixed population. I was given sample data assuming an unlimited population and asked to predict the number of infected for every round within a group of 90 people. My simulation showed that within 20 interactions, 99% of the population will be infected (Figure 1).

Figure 1, Spread of Test Disease within Fixed Population

I extracted data from Excel (Table 1) into R for better flexibility.

Table 1, Information on Infected

From the given data, the first step would be to find the transmission parameter. The transmission parameter describes how quick a disease would spread. The greater the parameter, the faster it spreads. The parameter can be found via the below equation:

n(t+1)=1+bn(t)

Where b is the transmission parameter, n(t) is the present infected population value and n(t+1) is the infected population in the next round, or interaction.

Using R, I iterated between different values of b until I found the parameter that gave the least squares regression, or the minimum squared residual, with data in Table 1. In order to understand least square regression better, I stated it explicitly instead of simply using R's library.

I plotted least squares with respect the the transmission parameter to illustrate this (Figure 2).

Figure 2, Least Square Regression for Different Transmission Parameters

The minimum squared residual occurs at 0.366. I applied the parameter into the Logistic Growth equation that took into consideration the maximum infected population (N).

n(t+1)= (-b/N * n(t) + 1 + b)*n(t)

Plotting the number of possibly infected, at this stage, is easy. I learned that diseases spreads fastest during the middle stages. As it reaches maximum population, the growth rate slows. The next stage would be to consider susceptibility and recovery rate.

The coding is given below:

q1_data = read.csv("a1q1.csv", header = T) #read data
length = length(q1_data$Cumulative.Cases) #take length of data
y=q1_data$Cumulative.Cases #Base vector of the cumulative cases
z=y #create a copy fo the cumulative case vector
z=c(z,NA) #add NA value to the copy vector
rss=0 #initialize variable to hold residual sum square value

m = seq(0, 2,0.001) #range of B values tested
rss_list = vector() #create vector to hold the rss from different b values
for (b in m)
{
for(a in 1:6)
{
z[a+1] = z[a]*(1+b)
rss = (z[a]-y[a])^2 + rss
}
rss_list=c(rss_list,rss)
rss=0 #reset rss value
}
plot(m,rss_list, type="l", xlab="Transmission Parameter", ylab="Squared Residual with Observed Data",ylim=c(0,10000),xlim=c(0,0.8))

m[which.min(rss_list)] #find b value that corresponds to the minimum
title("Graphical Representation of Least Square Regression")

Monday, 6 June 2016

Thoughts on Oracle, SQL and the Certification Exam

Having an in-depth knowledge of SQL is useful for anyone who wants to excel in data mining and predictive analysis. Knowing SQL is also necessary for Business Intelligence (BI) . For example, my friend actively uses SQL as part of his BI summer position at Bell, a Canadian telecommunications giant.

I decided to get certified in SQL because I wanted to learn more about the field I am interested in pursuing and get an edge up in the industry. Additionally, Oracle's latest database management system, Oracle Database 12c, provides greater support for cloud computing. Finally, having the certification proves I have working knowledge of SQL/PL.

One of the major obstacles I faced while preparing for the test was a lack of good resources, as opposed to Oracle's Java certifications. I highly recommend reading Database 12c Administrator Certified Associate Study Guide by Biju Thomas. I also took advantage of the free online resources provided with the book - sample tests and flashcards. I would also recommend keeping OCA Oracle Database 12c: SQL Fundamentals I Exam Guide as a reference guide and secondary resource. There are few concepts that were missing from Oracle's Exam Guide but overall, it was a solid guide.

All in all, I am happy I got this certification. I am confident of my skill set and hopefully, I can built on it in the near future.

Saturday, 5 March 2016

Effect of Pests on Sugar Cane Yield using ANOVA with R

Today, I used R to evaluate the variation in sugar cane weight caused by different pests. The goal was to understand the fundamentals of ANOVA (Analysis of Variation ) and how to interpret the results.

Sugar cane data came from a 1934 study published in the Indian Journal of Statistics. Pests were measured to have a statistically significant impact on sugar cane yield. A Post hoc Tukey test showed that, between the different types of pests, there were no significant differences.

Comparison of Sugar Cane Samples

The boxplot was the best option to graphically represent the data. The difference in mean between the control and the remaining samples strongly suggests that the pests had a significant effect on the yeild. An ANOVA on these results found a significant variation among conditions, F ratio > 1, P-value < 0.05.

Analysis of Variation Results

At a confidence coefficient of 0.95, or 95%, a post hoc Tukey test showed that the sugar yield did not change significantly depending on the type of pests. The Tukey test supported observations from the boxplot, showing that only the control group was significantly different from the remaining groups. In simpler terms, I cannot be 95% sure there is a difference in yeild between various types of pests.

Tukey Test Results

The coding is as follows:

#Variation in Sugar Yeild from Pests

#By Matthew Mano (matthewm3109@gmail.com)

#Import dataset

sugar<-read.csv("sugar.csv")

#Rename variables and remove unnecessary ones

sugar$X1[sugar$X1==1]="Control"

sugar$X1[sugar$X1==2]="TopShoot Borer"

sugar$X1[sugar$X1==3]="Stem Borer"

sugar$X1[sugar$X1==5]="Root Borer"

sugar$X1[sugar$X1==6]="Termites"

#Create a Boxplot

boxplot(X64.5~X1,data=sugar,main="Influence of Pests on Sugar Cane Yield",xlab="Pests",ylab="Sugar Cane Weight (lb)",col="gray88")

#Analysis of Variation

sugarc<-aov(X64.5 ~ X1, data = sugar)

summary(sugarc)

#TukeyHSD

TukeyHSD(sugarc)

The link for the dataset is attached: http://bit.ly/1LELKm1

P. C. Mahalanobis and S. S. Bose, “A Statistical Note on the Effect of Pests on the Yield of Sugarcane and the Quality of Cane-Juice,” The Indian Journal of Statistics, vol. 1, no. 4, pp. 399–406, 1934.

Saturday, 13 February 2016

Multivariable Regression for Concrete Compression Testing through R

Today's project focuses on creating a linear model that would describe the influence of multiple ingredients on concrete's ability to withstand loads. The linear model was built with R.

Data comes from Chung-Hua University, China. Input variables measured were cement, slag, fly ash, water, super plasticizer(SP), coarse aggregate and fine aggregates. Input variables were measured in kg/m3 of concrete. The output variable is compressive strength after 28 days, measured in MPa. Results show that water is the strongest influencer of compressive strength. Slag is the weakest influencer of compressive strength. Super plasticizer had little to no impact and was completely removed from the model. The compressive strength was determined to follow below equation:

Compressive strength
= 0.04970*(Cement) - 0.04519*(Slag) + 0.03859*(Fly ash) - 0.27055*(Water) - 0.06986*(Coarse Aggregate) - 0.05358*(Fine Aggregate)

Normalized Histogram of Residuals

The correlation coefficient shows a strong fit (R2 = 0.8962) and the probability values are low for each variable. The normalized histogram shows a normal distribution of residuals. The distribution of residuals strongly support the linear model and removes the risk of systematic error.

The problem was approached by creating a multivariable linear regression of all the input variables:

Initial Regression

A high correlation coefficient exists. Some of the probability values, however, do not show strong evidence against the null hypothesis - notably slag, fine aggregates and SP. Fortunately, the step() function only selects feasible variables.

Final Regression

The coefficients are listed in the column. The full coding are as follows:

#Multivariable regression of Concrete Compression Test

#By Matthew Mano (matthewm3109@gmail.com)

#import data

concrete<-read.csv("slump.csv")

#remove incomplete tests

concretec<-concrete[complete.cases(slump),]

#generate linear model

concreter<-lm(CS~ Cement+Slag+Fly.ash+Water+SP+CA+FA, data=concretec)

#get information of initial model

summary(concreter)

#remove unnecessary variables

concreter2=step(concreter)

#get information of secondary model

summary(concreter2)

r<-residuals(concreter2)

#graphing residuals in histogram

hist(r, prob=TRUE,main="Normalized Histograms of Residuals",xlab="Standard Deviations")

#adding reference normal curve

curve(dnorm(x, mean=mean(r), sd=sd(r)), add=TRUE, col="red")

The links to the code, csv file and original dataset are attached. If you have any ideas for improvement or would like to get in contact, please comment or email me directly at matthewm3109@gmail.com.

Link to code & csv: http://bit.ly/1QRzyjr
Link to original data: https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test

Sunday, 7 February 2016

Scatterplot Matrices to Analyse Water Parameters with R

So far, scatterplot matrices are the most useful function I have every seen in any software. Scatterplot matrices graphically summarize important relationships between vectors. Most impressively, scatterplot matrices can calculate the correlation coefficients between all possible combinations of vectors in a dataset. Also, matrices are easy to generate.

The goal for today's project is to identify physical water quality parameters with the strongest fit. The data is collected from River Avon, UK. Salinity and conductivity had a perfect fit, which was expected. Salinity and temperature had a moderate downhill (negative) linear relationship. Conductivity and temperature also had a moderate downhill (negative) linear relationship. Since conductivity, temperature and salinity likely influences each other, these parameters should be further analysed. Next steps could involve finding a regression plane between the three variables.

Water Parameters in River Avon, UK

Water parameters measured are temperature (in Celsius), pH, Conductivity (mS), Dissolved Oxygen (%) and Salinity (ppt). The reading are conducted in different locations along the river during the summer season of 2015 (June, July and August).

Coding is as follows:

#River Avon Water Parameters

#by Matthew Mano (matthewm3109@gmail.com)

water<-read.csv("waterQuality3.csv",header=T)

library("psych") #psych is a REALLY useful package

pairs.panels(water[c(3, 4, 5, 6, 7)], gap = 0) #concatenation used to identify columns for regression

As explained, coding is simple but powerful.

Link for data source: http://bit.ly/1LvM5XY
Link to download csv and r file: http://bit.ly/1LvMr0S

Sunday, 24 January 2016

Use of Bar Graphs in R Programming to Investigate Ethnicities in Toronto (Part 2)

The purpose of this post is to learn to make bar graphs in R. As practice, I used the public data of the demographics within Toronto's many neighborhoods. R found four neighborhoods with the greatest percent of population speaking Chinese, Tamil and Tagalog. Results show that visible minorities tend to live close to each other, just outside of the downtown core of a major city.

Languages in Toronto by Neighborhood

Cross referencing with Google Earth shows that, with regards to Chinese speakers, the neighborhoods were in the Northeast end of Toronto. For Tamil speakers, prominent neighborhoods were in the East end. For Tagalog, prominent neighborhoods were in the North end. Public data came from a 2011 survey conducted by Wellbeing Toronto, a program with the City of Toronto. I attached a link to all Excel and coding files.

For the coding, I had to learn to create bar graphs using the barplot function, create columns in a dataset and order values by size. My coding is as follows:

#Language in Toronto

#By Matthew Mano

lang<-read.csv("torontoLanguage.csv", header=T)

Import the file

lang$PCh<-(lang$Ch/lang$Tot)*100

lang$PTl<-(lang$Tam/lang$Tot)*100

lang$PTg<-(lang$Tag/lang$Tot)*100

Create three additional columns with the percent population

lang.newch<-lang[order(-lang$PCh),c(1,6)]

lang.newtl<-lang[order(-lang$PTl),c(1,7)]

lang.newtg<-lang[order(-lang$PTg),c(1,8)]

Order the percent population by decreasing size

par(mfrow=c(3,1))

Display 3 charts in 1 window

barplot(height=lang.newch$PCh[1:4],names.arg=lang.newch$Ne[1:4],ylab="Percent out of Total Population (%)",border=NA)

Create a barplot of the first four neighborhoods for Chinese speakers. Did the same for the other two languages.

title(main="Toronto Neighborhoods with the Greatest Percent of Chinese Speakers")

barplot(height=lang.newtl$PTl[1:4],names.arg=lang.newtl$Ne[1:4],ylab="Percent out of Total Population (%)",border=NA)

title(main="Toronto Neighborhoods with the Greatest Percent of Tamil Speakers")

barplot(height=lang.newtg$PTg[1:4],names.arg=lang.newtg$Ne[1:4],ylab="Percent out of Total Population (%)",border=NA)

title(main="Toronto Neighborhoods with the Greatest Percent of Tagalog Speakers",)

Link for downloads: http://bit.ly/20xMtvZ