Skip to content

loukesio/barcosim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License: MIT Last Commit GitHub Contributors Codecov test coverage Lifecycle: Experimental

Install the BarcoSim package

Install the package using the following commands

# for now you can install the developemental version of ltc
# first you need to install the devtools package 
# in case you have not already installed
install.packages("devtools") 
# and load it
library(devtools)

# then you can install the dev version of the ltc
devtools::install_github("loukesio/BarcoSim")
# and load it
library(BarcoSim)

1. Use the gpseq command to generate the parent sequences.

Parameters:

  • num_sequences: An integer specifying the number of DNA sequences to generate.

  • seq_length: An integer representing the length of each DNA sequence.

  • range_start: An integer indicating the start position of the barcoded sequence.

  • range_end: An integer indicating the end position of the barcoded sequence.

library(Biostrings) # Provides tools for working with biological sequences, such as DNA, RNA, and protein sequences
library(BarcoSim)   # BarcoSim: A package for simulating barcoded sequencing data
library(dplyr)      # A powerful package for data manipulation and transformation,


set.seed(123)       # sets the random seed to ensure the reproducibility of a random processes (generation of sequences)

# This function creates 5 parent sequences, each with 10 base pairs and a single barcode area spanning from base 3 to base 6.
df1.1 <- gpseq(num_sequences=5, seq_length=10, range_start=3, range_end=6)

df1.1 %>% 
  DNAStringSet()
  
#> DNAStringSet object of length 10:
#>      width seq
#> [1]    10 CGCAGCGTAA
#> [2]    10 CGTGTTGTAA
#> [3]    10 CGGCAAGTAA
#> [4]    10 CGTCGGGTAA
#> [5]    10 CGATGCGTAA

# Create five parent sequences, each consisting of 10 base pairs, with multiple barcoded regions spanning from base 2 to base 3 and 
# from base 6 to base 8.
df1.2 <- gpseq(5,10,range_start=c(2,6), range_end=c(3,8))

df1.2 %>% 
DNAStringSet()

DNAStringSet object of length 5:
    width seq
[1]    10 GCTTAGGACG
[2]    10 GTGTATGGCG
[3]    10 GCGTACTCCG
[4]    10 GGGTATGTCG
[5]    10 GATTAGCTCG

Created on 2023-04-15 with reprex v2.0.2

The outcome of the gpseq contains the conserved sequences from 1-2 and 7-10, and the barcode sequences from 3-6 (see Figure1). In addition with the help of the function calcSeqSim we can quantify the similarity among sequences at each base pair.

2. Use the calcSeqSim function to plot sequence similarity across the parent sequences

parameters:

  • dna_seq: A character vector of DNA sequences obtained as the output of the gpseq function.

Plotting sequence similarity in a sequence with a single barcoded area:

library(BarcoSim)
library(ggplot2)
library(dplyr)

df1.1 = c("CGCAGCGTAA", "CGTGTTGTAA", "CGGCAAGTAA", "CGTCGGGTAA", "CGATGCGTAA")
df1.1
#> [1] "CGCAGCGTAA" "CGTGTTGTAA" "CGGCAAGTAA" "CGTCGGGTAA" "CGATGCGTAA"

#______________________________________________
# Find sequence similarity at each position
#______________________________________________
calSeqSim(df1.1)
#> [1] 100 100  40  40  60  40 100 100 100 100

#___________________________
# plotting example :)
#____________________________
# Create the tibble
df1.1_data <- calSeqSim(df1.1)  %>% 
  tibble() %>% 
  dplyr::rename(similarity=1)

# Create the geom area plot
ggplot(df1.1_data, aes(x = 1:nrow(df1.1_data), y = similarity, fill = similarity)) +
  geom_area(color = "#333333", fill = "#edae49") +
  xlab("Base pair (bp)") +
  ylab("Percentage of Similarity") +
  ggtitle("Sequnce similarity plot per bp") +
  scale_x_continuous(breaks=seq(1:10)) +
  theme(panel.grid = element_blank(),
        panel.background = element_rect(fill = "white"), 
        plot.title = element_text(hjust=0.5), 
        axis.text = element_text(size=12), 
        axis.ticks.length = unit(.2, "cm")) +
  geom_vline(xintercept =c(3,6), linetype="dashed")

# similarly you can plot the df1.2 data
  

3. Use the r_gpseq command to replicate parent sequences and make a barcode data set.

Parameters:

  • dna_seq: A character vector of DNA sequences, obtained as the output of the gpseq function.

  • num_replicates: An integer specifying the number of times each parent sequence should be replicated.

  • error_rate: A numeric value representing the probability error rate during the replication process.

library(BarcoSim)

print(df1.1)
#> [1] "CGCAGCGTAA" "CGTGTTGTAA" "CGGCAAGTAA" "CGTCGGGTAA" "CGATGCGTAA"

#With the current parameters of the r_gpseq function, you can replicate each parent DNA sequence in dna_seq twice with a 0.1 probability error rate.

r_gpseq(dna_seq=df1.1,num_replicates=2,error_rate=0.1)
#>    parent parent_seq  offspring
#> 1       1 CGCAGCGTAA CGGTGCGTAA
#> 2       1 CGCAGCGTAA CGCAGCGTAA
#> 3       2 CGTGTTGTAA CGTGTTGTAC
#> 4       2 CGTGTTGTAA CGTGTTGTAA
#> 5       3 CGGCAAGTAA CGGCAAGTAA
#> 6       3 CGGCAAGTAA CCGCAAGTAA
#> 7       4 CGTCGGGTAA CGTCGGGTAC
#> 8       4 CGTCGGGTAA  GTCGGGTAA
#> 9       5 CGATGCGTAA  GATGCGTAA
#> 10      5 CGATGCGTAA CCATGCGTAA

Created on 2023-06-24 with reprex v2.0.2

4. r_gpseq_csub replicates parent sequences with a specified error rate while allowing the specification of substitution rates for each base. It is an extension of the r_gpseq function.

  • dna_seq: A character vector of DNA sequences, obtained as the output of the gpseq function.
  • num_replicates: An integer specifying the number of times each parent sequence should be replicated.
  • error_rate A numeric value between 0 and 1 representing the probability error rate during the replication process.
  • substitution_probs (list of length 5): Includes substitution probabilities for each base (A, C, G, T, and empty string).
library(BarcoSim)

df1.1 = c("GCTTAGGACG", "GTGTATGGCG", "GCGTACTCCG", "GGGTATGTCG", "GATTAGCTCG")
substitution_probs <- list("A" = 0.1, "C" = 0.2, "G" = 0.3, "T" = 0.4, " " = 0.1)

r_gpseq_csub(dna_seq=df1.1,num_replicates=2,error_rate = 0.1, substitution_probs)
#>    parent parent_seq  offspring
#> 1       1 GCTTAGGACG GCTT AGACG
#> 2       1 GCTTAGGACG GCTTACGACG
#> 3       2 GTGTATGGCG GTGTAGGGCG
#> 4       2 GTGTATGGCG GTGTATAGCG
#> 5       3 GCGTACTCCG GCGTGCTCCG
#> 6       3 GCGTACTCCG GCGTACGCCG
#> 7       4 GGGTATGTCG GGGTATGTCG
#> 8       4 GGGTATGTCG GGGTATGTCG
#> 9       5 GATTAGCTCG GGTTGGCTCG
#> 10      5 GATTAGCTCG GAT GGC CG

Created on 2023-06-24 with reprex v2.0.2

Lotka-Volterra Competition Model in DNA Sequence Evolution

Classical Lotka-Volterra Competition Equations

The classical Lotka-Volterra competition model describes how multiple species compete for limited resources:

For Species i:

dNᵢ/dt = rᵢNᵢ(1 - (Nᵢ + Σⱼ≠ᵢ αᵢⱼNⱼ)/Kᵢ)

Where:

  • Nᵢ = Population size of species i
  • rᵢ = Intrinsic growth rate of species i
  • Kᵢ = Carrying capacity for species i
  • αᵢⱼ = Competition coefficient (effect of species j on species i)
  • t = Time

Adaptation to DNA Sequence Evolution

In our DNA sequence simulation, each unique sequence type represents a "species" competing for resources:

Modified Growth Equation:

dNᵢ/dt = (bᵢ - dᵢ - cᵢ)Nᵢ

Where:

  • Nᵢ = Number of individuals with sequence type i
  • bᵢ = Birth rate (reproduction rate) for sequence i
  • dᵢ = Death rate for sequence i
  • cᵢ = Competition-induced mortality for sequence i

Implementation Components

1. Birth Rate (bᵢ)

bᵢ = base_reproduction_rate × fitness_multiplier

Parameters:

  • num_replicates: Base number of offspring per parent
  • fitness: Calculated from sequence properties (GC content, length)
  • fitness_variance: Random variation in fitness

In the code:

# Fitness calculation
calculate_fitness <- function(sequence, base_fitness = 1.0) {
    gc_content <- (str_count(sequence, "G") + str_count(sequence, "C")) / nchar(sequence)
    length_factor <- nchar(sequence) / mean(nchar(dna_seq))
    
    gc_fitness <- 1 - abs(gc_content - 0.5)  # Optimal GC ≈ 50%
    length_fitness <- exp(-0.1 * abs(length_factor - 1))
    
    fitness <- base_fitness × gc_fitness × length_fitness × rnorm(1, 1, fitness_variance)
    return(max(0.1, fitness))
}

2. Base Death Rate (dᵢ)

dᵢ = baseline_death_rate

Parameters:

  • death_rate: Either fixed value for all sequences or sequence-specific rates
  • Can be constant or vary by sequence type

In the code:

# Fixed death rate
death_rate = 0.1  # 10% baseline mortality

# Or sequence-specific
death_rate = c("AAGA" = 0.1, "AATC" = 0.3, "GGCC" = 0.05)

3. Competition-Induced Mortality (cᵢ)

This is where the Lotka-Volterra competition comes in:

Resource Competition Model:

cᵢ = α × max(0, (Σⱼ Nⱼ - K)/K)

Similarity-Based Competition Model:

cᵢ = α × Σⱼ≠ᵢ (sᵢⱼ × Nⱼ)

Parameters:

  • α = competition_strength (0-1): How strongly competition affects mortality
  • K = carrying_capacity: Maximum sustainable population
  • sᵢⱼ = Sequence similarity between types i and j
  • Nⱼ = Population of competing sequence type j

In the code:

# Resource competition
if (resource_competition) {
    pressure <- competition_strength × (total_pop - carrying_capacity) / carrying_capacity
}

# Similarity-based competition  
else {
    for (j in seq_len(nrow(generation_data))) {
        if (i != j) {
            similarity <- sequence_similarity(seq_i, seq_j)
            competitor_effect <- similarity × pop_j × competition_strength
            pressure <- pressure + competitor_effect
        }
    }
}

4. Total Death Probability

total_death_probᵢ = min(0.95, dᵢ + cᵢ)

The survival for each sequence type follows:

survivors = Binomial(Nᵢ, 1 - total_death_probᵢ)

Discrete Time Implementation

Since we simulate discrete generations, the continuous differential equation becomes:

Nᵢ(t+1) = Binomial(offspring_produced, 1 - total_death_probᵢ)

Where:

offspring_produced = Nᵢ(t) × reproduction_rate × fitness_multiplier
total_death_probᵢ = death_rate + competition_pressure

Parameter Summary

Parameter Symbol Code Variable Description
Population size Nᵢ population_size Number of individuals per sequence
Growth rate rᵢ num_replicates × fitness Reproduction rate × fitness
Death rate dᵢ death_rate Baseline mortality
Competition strength α competition_strength Intensity of competitive effects
Carrying capacity K carrying_capacity Maximum sustainable population
Sequence similarity sᵢⱼ sequence_similarity() Genetic similarity between sequences
Fitness variance σ² fitness_variance Random variation in fitness

Key Differences from Classical Model

  1. Discrete generations instead of continuous time
  2. Stochastic processes (mutations, binomial survival) rather than deterministic
  3. Sequence similarity as basis for competition coefficients
  4. Dynamic fitness based on sequence properties
  5. Genealogical tracking of lineages over time

Competition Scenarios

High Competition (α = 0.5, K = 100):

  • Strong selection pressure
  • Rapid extinction of unfit lineages
  • Winner-takes-all dynamics

Moderate Competition (α = 0.1, K = 500):

  • Gradual selection
  • Multiple lineages can coexist
  • Balanced population dynamics

No Competition (α = 0, K = ∞):

  • Pure drift and mutation
  • All lineages persist longer
  • Population grows exponentially until death rate limits

This implementation captures the essence of Lotka-Volterra competition while adding the complexity of genetic evolution, mutation, and stochastic processes that make it realistic for studying sequence evolution!

About

BarcoSim, a specialized R package designed for simulating barcoded amplicon sequences. Whether you're testing algorithms or forecasting experimental outcomes, BarcoSim provides a robust framework to mimic real-world sequencing scenarios, ensuring preparedness and enhancing the efficiency of your genomic projects."

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages