Get gurobipy LICENSE (commercial trial works for 30 days)
Install the required dependencies:
pip3 install pandas, gurobipy, numpy, matplotlibFirst, download and decompress the traces from Moirai-SOSP25-logs.
Second, organize the traces into the following folders:
- Table size files (
report-table-size-*.csv): place underMoirai/ - Workload files (from 2024/10 onward,
report-abFP-volume-table-*.csv): place underMoirai/newTraces/ - Job traces (
%Y%m%d-Presto.csvor%Y%m%d-Spark.csv): place underMoirai/jobTraces/
Run the following command to verify outputs under sample_0.050:
python3 tests.py --test=samplek --c=30 --k=0.05 --num_week=2 --rep_rate=0.002 --SparkJob traces in jobTraces/ contain per-job information rather than aggregated optimizer traces. Run the scheduler after optimization:
python3 scheduler.py --c=30 --num_week=2 --opt_path="sample_0.050"Note: This process takes ~30 minutes per week of job traces. Since the example runs for two weeks, expect ~1 hour.
Another flag is --simple to run the scheduler without simulating the traffic rate per minute. This can save you time if you do not care the traffic rate.
Note: If you re-run these commands, it won't cover the results.
python3 tests.py --test=samplek --k=1 --num_week=13 --rep_rate=0.002 --Spark --c=30
python3 scheduler.py --num_week=13 --opt_path="sample_1.000" --c=30
python3 tests.py --test=long_term --Spark
python3 tests.py --test=reorg_unaware --SparkOther useful flags (see more in --help):
tests.py--view: displays parameters without running the optimization.--opt_start_date: specifies the start date for optimization (default: 2024-10-22).
scheduler.py--debug: runs a smaller subset of traces for debugging.
cputime column in Spark traces (both the ones under jobTraces/ and newTraces/) represents the total CPU time in seconds for the job. Therefore, you should not sum them up to get the total CPU time for the job.
- Code related to
Yugongis our baseline from VLDB 2019.
python3 tests.py --test=yugong --num_week=13 --rep_rate=0.004 --c=30
python3 scheduler.py --yugong --num_week=13 --opt_path="yugong_results" --c=30Other baselines:
- Without pre-selecting replication, can we achieve enough speedup with sampling? Try k=0.001, 0.01, 0.05
python3 tests.py --test=samplek --k=0.001 --num_week=13 --rep_rate=0 --Spark --c=30
python3 tests.py --test=samplek --k=0.01 --num_week=13 --rep_rate=0 --Spark --c=30
python3 tests.py --test=samplek --k=0.05 --num_week=13 --rep_rate=0 --Spark --c=30
python3 tests.py --test=samplek --k=0.1 --num_week=13 --rep_rate=0 --Spark --c=30
python3 scheduler.py --num_week=13 --opt_path="sample_0.050" --c=30- How do other scheduling policies perform?
python3 scheduler.py --num_week=13 --opt_path="sample_1.000_rep0.002" --policy="size-aware" --c=30
python3 scheduler.py --num_week=13 --opt_path="sample_1.000_rep0.002" --policy="size-unaware" --simple --c=30- How do other replication strategies perform?
python3 tests.py --test=samplek --k=1 --num_week=1 --rep_rate=0.002 --Spark --c=50 --rep_strategy="read_traffic_density"- Baselines
python3 baselines.py --baseline="rep_x_month" --rep_rate=0.21 --c=30 # Rep 3 month
python3 baselines.py --baseline="rep_rtd" --rep_rate=0 --c=30 # No rep- Customized test
python3 tests.py --test=samplek --k=1 --num_week=1 --rep_rate=0.001 --c=10 --opt_start_date="2025-03-04" --table_size_file="report-table-size-20250310.csv"