Install the package using the following commands 
# for now you can install the developemental version of ltc
# first you need to install the devtools package
# in case you have not already installed
install.packages("devtools")
# and load it
library(devtools)
# then you can install the dev version of the ltc
devtools::install_github("loukesio/BarcoSim")
# and load it
library(BarcoSim)Parameters:
-
num_sequences: An integer specifying the number of DNA sequences to generate. -
seq_length: An integer representing the length of each DNA sequence. -
range_start: An integer indicating the start position of the barcoded sequence. -
range_end: An integer indicating the end position of the barcoded sequence.
library(Biostrings) # Provides tools for working with biological sequences, such as DNA, RNA, and protein sequences
library(BarcoSim) # BarcoSim: A package for simulating barcoded sequencing data
library(dplyr) # A powerful package for data manipulation and transformation,
set.seed(123) # sets the random seed to ensure the reproducibility of a random processes (generation of sequences)
# This function creates 5 parent sequences, each with 10 base pairs and a single barcode area spanning from base 3 to base 6.
df1.1 <- gpseq(num_sequences=5, seq_length=10, range_start=3, range_end=6)
df1.1 %>%
DNAStringSet()
#> DNAStringSet object of length 10:
#> width seq
#> [1] 10 CGCAGCGTAA
#> [2] 10 CGTGTTGTAA
#> [3] 10 CGGCAAGTAA
#> [4] 10 CGTCGGGTAA
#> [5] 10 CGATGCGTAA
# Create five parent sequences, each consisting of 10 base pairs, with multiple barcoded regions spanning from base 2 to base 3 and
# from base 6 to base 8.
df1.2 <- gpseq(5,10,range_start=c(2,6), range_end=c(3,8))
df1.2 %>%
DNAStringSet()
DNAStringSet object of length 5:
width seq
[1] 10 GCTTAGGACG
[2] 10 GTGTATGGCG
[3] 10 GCGTACTCCG
[4] 10 GGGTATGTCG
[5] 10 GATTAGCTCG
Created on 2023-04-15 with reprex v2.0.2
The outcome of the gpseq contains the conserved sequences from 1-2 and 7-10, and the barcode sequences from 3-6 (see Figure1). In addition with the help of the function calcSeqSim we can quantify the similarity among sequences at each base pair.
parameters:
dna_seq: A character vector of DNA sequences obtained as the output of thegpseqfunction.
Plotting sequence similarity in a sequence with a single barcoded area:
library(BarcoSim)
library(ggplot2)
library(dplyr)
df1.1 = c("CGCAGCGTAA", "CGTGTTGTAA", "CGGCAAGTAA", "CGTCGGGTAA", "CGATGCGTAA")
df1.1
#> [1] "CGCAGCGTAA" "CGTGTTGTAA" "CGGCAAGTAA" "CGTCGGGTAA" "CGATGCGTAA"
#______________________________________________
# Find sequence similarity at each position
#______________________________________________
calSeqSim(df1.1)
#> [1] 100 100 40 40 60 40 100 100 100 100
#___________________________
# plotting example :)
#____________________________
# Create the tibble
df1.1_data <- calSeqSim(df1.1) %>%
tibble() %>%
dplyr::rename(similarity=1)
# Create the geom area plot
ggplot(df1.1_data, aes(x = 1:nrow(df1.1_data), y = similarity, fill = similarity)) +
geom_area(color = "#333333", fill = "#edae49") +
xlab("Base pair (bp)") +
ylab("Percentage of Similarity") +
ggtitle("Sequnce similarity plot per bp") +
scale_x_continuous(breaks=seq(1:10)) +
theme(panel.grid = element_blank(),
panel.background = element_rect(fill = "white"),
plot.title = element_text(hjust=0.5),
axis.text = element_text(size=12),
axis.ticks.length = unit(.2, "cm")) +
geom_vline(xintercept =c(3,6), linetype="dashed")
# similarly you can plot the df1.2 data
Parameters:
-
dna_seq: A character vector of DNA sequences, obtained as the output of thegpseqfunction. -
num_replicates: An integer specifying the number of times each parent sequence should be replicated. -
error_rate: A numeric value representing the probability error rate during the replication process.
library(BarcoSim)
print(df1.1)
#> [1] "CGCAGCGTAA" "CGTGTTGTAA" "CGGCAAGTAA" "CGTCGGGTAA" "CGATGCGTAA"
#With the current parameters of the r_gpseq function, you can replicate each parent DNA sequence in dna_seq twice with a 0.1 probability error rate.
r_gpseq(dna_seq=df1.1,num_replicates=2,error_rate=0.1)
#> parent parent_seq offspring
#> 1 1 CGCAGCGTAA CGGTGCGTAA
#> 2 1 CGCAGCGTAA CGCAGCGTAA
#> 3 2 CGTGTTGTAA CGTGTTGTAC
#> 4 2 CGTGTTGTAA CGTGTTGTAA
#> 5 3 CGGCAAGTAA CGGCAAGTAA
#> 6 3 CGGCAAGTAA CCGCAAGTAA
#> 7 4 CGTCGGGTAA CGTCGGGTAC
#> 8 4 CGTCGGGTAA GTCGGGTAA
#> 9 5 CGATGCGTAA GATGCGTAA
#> 10 5 CGATGCGTAA CCATGCGTAACreated on 2023-06-24 with reprex v2.0.2
4. r_gpseq_csub replicates parent sequences with a specified error rate while allowing the specification of substitution rates for each base. It is an extension of the r_gpseq function.
dna_seq: A character vector of DNA sequences, obtained as the output of thegpseqfunction.num_replicates: An integer specifying the number of times each parent sequence should be replicated.error_rateA numeric value between 0 and 1 representing the probability error rate during the replication process.substitution_probs(list of length 5): Includes substitution probabilities for each base (A, C, G, T, and empty string).
library(BarcoSim)
df1.1 = c("GCTTAGGACG", "GTGTATGGCG", "GCGTACTCCG", "GGGTATGTCG", "GATTAGCTCG")
substitution_probs <- list("A" = 0.1, "C" = 0.2, "G" = 0.3, "T" = 0.4, " " = 0.1)
r_gpseq_csub(dna_seq=df1.1,num_replicates=2,error_rate = 0.1, substitution_probs)
#> parent parent_seq offspring
#> 1 1 GCTTAGGACG GCTT AGACG
#> 2 1 GCTTAGGACG GCTTACGACG
#> 3 2 GTGTATGGCG GTGTAGGGCG
#> 4 2 GTGTATGGCG GTGTATAGCG
#> 5 3 GCGTACTCCG GCGTGCTCCG
#> 6 3 GCGTACTCCG GCGTACGCCG
#> 7 4 GGGTATGTCG GGGTATGTCG
#> 8 4 GGGTATGTCG GGGTATGTCG
#> 9 5 GATTAGCTCG GGTTGGCTCG
#> 10 5 GATTAGCTCG GAT GGC CGCreated on 2023-06-24 with reprex v2.0.2
The classical Lotka-Volterra competition model describes how multiple species compete for limited resources:
dNᵢ/dt = rᵢNᵢ(1 - (Nᵢ + Σⱼ≠ᵢ αᵢⱼNⱼ)/Kᵢ)
Where:
- Nᵢ = Population size of species i
- rᵢ = Intrinsic growth rate of species i
- Kᵢ = Carrying capacity for species i
- αᵢⱼ = Competition coefficient (effect of species j on species i)
- t = Time
In our DNA sequence simulation, each unique sequence type represents a "species" competing for resources:
dNᵢ/dt = (bᵢ - dᵢ - cᵢ)Nᵢ
Where:
- Nᵢ = Number of individuals with sequence type i
- bᵢ = Birth rate (reproduction rate) for sequence i
- dᵢ = Death rate for sequence i
- cᵢ = Competition-induced mortality for sequence i
bᵢ = base_reproduction_rate × fitness_multiplier
Parameters:
num_replicates: Base number of offspring per parentfitness: Calculated from sequence properties (GC content, length)fitness_variance: Random variation in fitness
In the code:
# Fitness calculation
calculate_fitness <- function(sequence, base_fitness = 1.0) {
gc_content <- (str_count(sequence, "G") + str_count(sequence, "C")) / nchar(sequence)
length_factor <- nchar(sequence) / mean(nchar(dna_seq))
gc_fitness <- 1 - abs(gc_content - 0.5) # Optimal GC ≈ 50%
length_fitness <- exp(-0.1 * abs(length_factor - 1))
fitness <- base_fitness × gc_fitness × length_fitness × rnorm(1, 1, fitness_variance)
return(max(0.1, fitness))
}dᵢ = baseline_death_rate
Parameters:
death_rate: Either fixed value for all sequences or sequence-specific rates- Can be constant or vary by sequence type
In the code:
# Fixed death rate
death_rate = 0.1 # 10% baseline mortality
# Or sequence-specific
death_rate = c("AAGA" = 0.1, "AATC" = 0.3, "GGCC" = 0.05)This is where the Lotka-Volterra competition comes in:
cᵢ = α × max(0, (Σⱼ Nⱼ - K)/K)
cᵢ = α × Σⱼ≠ᵢ (sᵢⱼ × Nⱼ)
Parameters:
- α =
competition_strength(0-1): How strongly competition affects mortality - K =
carrying_capacity: Maximum sustainable population - sᵢⱼ = Sequence similarity between types i and j
- Nⱼ = Population of competing sequence type j
In the code:
# Resource competition
if (resource_competition) {
pressure <- competition_strength × (total_pop - carrying_capacity) / carrying_capacity
}
# Similarity-based competition
else {
for (j in seq_len(nrow(generation_data))) {
if (i != j) {
similarity <- sequence_similarity(seq_i, seq_j)
competitor_effect <- similarity × pop_j × competition_strength
pressure <- pressure + competitor_effect
}
}
}total_death_probᵢ = min(0.95, dᵢ + cᵢ)
The survival for each sequence type follows:
survivors = Binomial(Nᵢ, 1 - total_death_probᵢ)
Since we simulate discrete generations, the continuous differential equation becomes:
Nᵢ(t+1) = Binomial(offspring_produced, 1 - total_death_probᵢ)
Where:
offspring_produced = Nᵢ(t) × reproduction_rate × fitness_multiplier
total_death_probᵢ = death_rate + competition_pressure
| Parameter | Symbol | Code Variable | Description |
|---|---|---|---|
| Population size | Nᵢ | population_size |
Number of individuals per sequence |
| Growth rate | rᵢ | num_replicates × fitness |
Reproduction rate × fitness |
| Death rate | dᵢ | death_rate |
Baseline mortality |
| Competition strength | α | competition_strength |
Intensity of competitive effects |
| Carrying capacity | K | carrying_capacity |
Maximum sustainable population |
| Sequence similarity | sᵢⱼ | sequence_similarity() |
Genetic similarity between sequences |
| Fitness variance | σ² | fitness_variance |
Random variation in fitness |
- Discrete generations instead of continuous time
- Stochastic processes (mutations, binomial survival) rather than deterministic
- Sequence similarity as basis for competition coefficients
- Dynamic fitness based on sequence properties
- Genealogical tracking of lineages over time
- Strong selection pressure
- Rapid extinction of unfit lineages
- Winner-takes-all dynamics
- Gradual selection
- Multiple lineages can coexist
- Balanced population dynamics
- Pure drift and mutation
- All lineages persist longer
- Population grows exponentially until death rate limits
This implementation captures the essence of Lotka-Volterra competition while adding the complexity of genetic evolution, mutation, and stochastic processes that make it realistic for studying sequence evolution!



