Skip to content

fasantosa/trainORR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚂🚄 trainORR 🚄🚂

Analysing the trains' emission in the United Kingdom using the ORR dataset

IJC437 Introduction to Data Science & IJC445 Data Visualisation

This is part of IJC437 Introduction to Data Science & IJC445 Data Visualisation. The visualisation produced for IJC445 is different from IJC437; please refer to each folder. The aim of this study is to cluster the train operators based on their characteristics and then analyse the differences in emissions produced among the trains. The datasets used are Passenger rail usage with table numbers 1223, 1243, and 1253 (Office of Rail and Road, 2025) and Rail environment with table numbers 6103, 6108, and 6123 (Office of Rail and Road, 2024).

Contents

File Structure

trainORR/
├── README.md                 # General information about the project and how to use it
├── LICENSE                   # Licence applied to this project
├── data/
│   ├── train.xlsx            # Processed train characteristics data for IJC437 and IJC445
│   ├── emission.xlsx         # Processed train emissions data for IJC437 and IJC445
│   └── c_train.Rda           # Generated train cluster data from IJC437 for IJC445
├── img/
│   ├── colour_report.png     # Colour report generated using Viz Palette
│   ├── methods.jpg           # Methodology used in IJC437
│   ├── IJC437/
│   │   ├── 3d-dbscan.png     # 3D visualisation to show DBSCAN results
│   │   ├── 3d-kmean.png      # 3D visualisation to show K-means results
│   │   ├── boxplot.png       # Data distribution within each cluster based on emission variables
│   │   ├── correlation.png   # Comparison of correlations between each variable in each train type
│   │   ├── dbscan.jpg        # Results from the knee for DBSCAN
│   │   └── kmean.jpg         # Results from the elbow for K-means
│   └── IJC445/
│       ├── Fig1.png          # Visualisation of clusters' differences with a ridgeline chart 
│       ├── Fig2.png          # Sankey diagram illustrating the transition of train operator names to their respective clusters
│       ├── Fig3.png          # Differences in emissions from different clusters and train types using radar charts
│       ├── Fig4.png          # Comparison of average emissions resulting from the addition of lines denoting different train types 
│       └── composite.png     # Composite visualisations illustrating the characteristics and emissions generated by trains in the UK
├── IJC437/                   
│   └── train.R               # R code files utilised in this project for the IJC437 
└── IJC445/                   
    └── vis.R                 # R code files utilised in this project for the ICJ445

🚄 IJC437 Introduction to Data Science

Methodology

Methodology

This study used annual data from the ORR Data Portal, combining passenger rail usage and rail environment datasets after removing trains that were no longer operating or lacked complete information, resulting in 24 trains labelled by type. Data preparation was carried out in Excel due to the varied structure of the tables. Trains were grouped using K-means and DBSCAN, with data normalised and clustering performed in RStudio to identify meaningful patterns. K-means was tested with different K values, supported by the Elbow method and manual inspection, and K = 4 was selected. For DBSCAN, eps and MinPts were determined through knee-plot inspection and manual checking, resulting in eps = 0.7 and MinPts = 3. Additional analysis using boxplots and correlation matrices was conducted to explore variable relationships within each train type and to link these patterns back to the identified clusters.

Load data

Don't forget to download the files from here and save them in your working directory.

data_train <- read.xlsx("datatrain.xlsx", sheet = 1)
emission <- read.xlsx("emission.xlsx", sheet = 1)

Saving data for visualisation

You need this data to create visualisation for IJC445 Data Visualisation, or you can simply download it from here.

save(c_train, file="c_train.Rda")

Findings

This study finds that K-means clustering performs more effectively than DBSCAN, despite some previous studies suggesting that DBSCAN may be superior. The results indicate that train clusters with high operational intensity are predominantly composed of bi-mode trains, whereas short-distance services are more evenly distributed across different train types. Diesel trains are shown to generate relatively high emissions and are also associated with lower passenger occupancy, highlighting their inefficiency. In contrast, both bi-mode and electric trains demonstrate competitive performance in terms of lower operational emissions.

Future research could incorporate additional factors, such as emissions from electricity suppliers, to provide a more comprehensive assessment of emissions associated with electric and bi-mode trains. Further analysis could also be conducted under comparable operating conditions, including track length, passenger volumes, and train configurations, to better isolate emission differences. In addition, predictive methods could be developed to estimate the potential emissions reductions achieved when transitioning from diesel to bi-mode or electric trains.

🚄 IJC445 Data Visualisation

Visualisation

Report

Colour report

Report

🚂🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃🚃

About

IJC437 Introduction to Data Science & IJC445 Data Visualisation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages