This is part of IJC437 Introduction to Data Science & IJC445 Data Visualisation. The visualisation produced for IJC445 is different from IJC437; please refer to each folder. The aim of this study is to cluster the train operators based on their characteristics and then analyse the differences in emissions produced among the trains. The datasets used are Passenger rail usage with table numbers 1223, 1243, and 1253 (Office of Rail and Road, 2025) and Rail environment with table numbers 6103, 6108, and 6123 (Office of Rail and Road, 2024).
trainORR/
├── README.md # General information about the project and how to use it
├── LICENSE # Licence applied to this project
├── data/
│ ├── train.xlsx # Processed train characteristics data for IJC437 and IJC445
│ ├── emission.xlsx # Processed train emissions data for IJC437 and IJC445
│ └── c_train.Rda # Generated train cluster data from IJC437 for IJC445
├── img/
│ ├── colour_report.png # Colour report generated using Viz Palette
│ ├── methods.jpg # Methodology used in IJC437
│ ├── IJC437/
│ │ ├── 3d-dbscan.png # 3D visualisation to show DBSCAN results
│ │ ├── 3d-kmean.png # 3D visualisation to show K-means results
│ │ ├── boxplot.png # Data distribution within each cluster based on emission variables
│ │ ├── correlation.png # Comparison of correlations between each variable in each train type
│ │ ├── dbscan.jpg # Results from the knee for DBSCAN
│ │ └── kmean.jpg # Results from the elbow for K-means
│ └── IJC445/
│ ├── Fig1.png # Visualisation of clusters' differences with a ridgeline chart
│ ├── Fig2.png # Sankey diagram illustrating the transition of train operator names to their respective clusters
│ ├── Fig3.png # Differences in emissions from different clusters and train types using radar charts
│ ├── Fig4.png # Comparison of average emissions resulting from the addition of lines denoting different train types
│ └── composite.png # Composite visualisations illustrating the characteristics and emissions generated by trains in the UK
├── IJC437/
│ └── train.R # R code files utilised in this project for the IJC437
└── IJC445/
└── vis.R # R code files utilised in this project for the ICJ445
This study used annual data from the ORR Data Portal, combining passenger rail usage and rail environment datasets after removing trains that were no longer operating or lacked complete information, resulting in 24 trains labelled by type. Data preparation was carried out in Excel due to the varied structure of the tables. Trains were grouped using K-means and DBSCAN, with data normalised and clustering performed in RStudio to identify meaningful patterns. K-means was tested with different K values, supported by the Elbow method and manual inspection, and K = 4 was selected. For DBSCAN, eps and MinPts were determined through knee-plot inspection and manual checking, resulting in eps = 0.7 and MinPts = 3. Additional analysis using boxplots and correlation matrices was conducted to explore variable relationships within each train type and to link these patterns back to the identified clusters.
Don't forget to download the files from here and save them in your working directory.
data_train <- read.xlsx("datatrain.xlsx", sheet = 1)
emission <- read.xlsx("emission.xlsx", sheet = 1)You need this data to create visualisation for IJC445 Data Visualisation, or you can simply download it from here.
save(c_train, file="c_train.Rda")This study finds that K-means clustering performs more effectively than DBSCAN, despite some previous studies suggesting that DBSCAN may be superior. The results indicate that train clusters with high operational intensity are predominantly composed of bi-mode trains, whereas short-distance services are more evenly distributed across different train types. Diesel trains are shown to generate relatively high emissions and are also associated with lower passenger occupancy, highlighting their inefficiency. In contrast, both bi-mode and electric trains demonstrate competitive performance in terms of lower operational emissions.
Future research could incorporate additional factors, such as emissions from electricity suppliers, to provide a more comprehensive assessment of emissions associated with electric and bi-mode trains. Further analysis could also be conducted under comparable operating conditions, including track length, passenger volumes, and train configurations, to better isolate emission differences. In addition, predictive methods could be developed to estimate the potential emissions reductions achieved when transitioning from diesel to bi-mode or electric trains.


