-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathin_04-Tutorial7_Correlation.Rmd
More file actions
85 lines (57 loc) · 2.79 KB
/
in_04-Tutorial7_Correlation.Rmd
File metadata and controls
85 lines (57 loc) · 2.79 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
```{r setup, include=FALSE,message=FALSE,warning=FALSE}
# OPTIONS -----------------------------------------------
knitr::opts_chunk$set(echo = TRUE,
warning=FALSE,
message = FALSE)
# PACKAGES-----------------------------------------------
# Tutorial packages
library(vembedr)
library(skimr)
library(yarrr)
library(RColorBrewer)
library(GGally)
library(tidyverse)
library(plotly)
library(readxl)
library(rvest)
library(biscale)
library(tidycensus)
library(cowplot)
library(units)
library(olsrr)
data("HousesNY", package = "Stat2Data")
```
# Correlation
## Basics
To find the correlation between two variables, you can simply use the cor function e.g.
```{r}
cor(HousesNY$Price,HousesNY$Beds)
```
To see the correlation between ALL columns we can make a "correlation matrix"
Importantly, remember this website - <https://www.tylervigen.com/spurious-correlations>. Just because another variable is correlated with our response does not mean it HAS to be in the model. It simply means that you might want to consider whether there is a reason for that correlation.
Also, the correlation is a measure of the LINEAR relationship between two values... All of these scatterplots have the same correlation! (meet the datasaurus)
{width="82%"}
As you can see better in this gif
{width="70%"}
## Covariance/Correlation matrix plots
Looking at correlations is a quick (but often misleading) way to assess what is happening. Essentially we can look at the correlation between each column of data. You can simply look at the correlations of any NUMERIC columns using the corrplot code.
```{r}
library(corrplot)
# Filter to a new data frame with only numeric columns
house.numeric.columns <- HousesNY[ , sapply(HousesNY,is.numeric)]
corrplot(cor(house.numeric.columns),method="ellipse",type="lower")
```
Another one is in the ggstatsplot package - <https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/ggcorrmat.html> many more examples here.
```{r}
library(ggstatsplot)
ggcorrmat(HousesNY)
```
There are LOADS of other ways to run correlation plots here: <https://www.r-graph-gallery.com/correlogram.html> Feel free to choose a favourite.
For example, GGALLY does this with its ggpairs command. But it doesn't like working on large datasets.
```{r,message=FALSE,warning=FALSE}
# Choose column names - let's say I don't care about location
colnames(HousesNY)
# Create plot - note I have message=TRUE and warning=TRUE turned on at the top of my code chunk
ggpairs(HousesNY[,c("Price","Beds" ,"Baths","Size" , "Lot" )])
```
<br> <br>