<a href="./RESEARCH_REPORT.pdf">📄 PDF</a>
<a href="./slides.html">🎞️ Slides</a>
<a href="https://github.com/JaiAnshSB26/deep-rl-rebalance">💻 Code</a>
Institution: École Polytechnique (Bachelor of
Science in Mathematics and Computer Science), Institut Polytechnique de
Paris
Contact: jai-ansh.bindra@polytechnique.edu
Repository:
https://github.com/JaiAnshSB26/deep-rl-rebalance
This project presents a cost-aware deep reinforcement learning framework for multi-asset portfolio management. We implement Proximal Policy Optimization (PPO) to learn dynamic rebalancing policies that explicitly account for transaction costs in a realistic market environment. Using 13+ years of data (2012-2025) across 9 major ETFs, we demonstrate that RL agents can learn sophisticated trading behaviors that balance return generation with cost minimization. My agent achieves a test-period Sharpe ratio of 0.33 with only 10.6% annual turnover, compared to traditional baselines. We conduct rigorous statistical testing using Diebold-Mariano tests and block bootstrap confidence intervals, and perform comprehensive sensitivity analysis across transaction cost regimes. The framework is open-source and fully reproducible.
Keywords: Reinforcement Learning, Portfolio Optimization, Algorithmic Trading, Transaction Costs, PPO, Deep Learning
This project was carried out independently by a Bachelor of Science student. Every implementation reflects my best understanding based on the concepts and references studied throughout the process. The purpose of this work is purely academic — to explore the intersection of Machine Learning, Reinforcement Learning, and Financial Markets — and it should not be interpreted as financial advice.
The proposed RL framework did not statistically outperform the strongest baseline over the test horizon, but it achieved comparable risk-adjusted returns and demonstrated cost-sensitive, adaptive behavior. These properties indicate that the model captured meaningful structure in the data, and that further gains may be attainable with richer state features and longer training horizons.
Hence, the agent can be considered competitive rather than dominating, with differences from traditional baselines lying mostly within statistical noise — consistent with much of the open-source literature on Deep RL in Finance, where models often perform competitively but rarely surpass Markowitz-style or momentum-based benchmarks without substantial feature engineering, regime filtering, or high-frequency data.
Given my current level of study and the independent nature of this
project, I am very satisfied with its development and results. With
proper guidance and additional resources, I aim to release a second,
more robust version — potentially featuring an interactive dashboard for
visualizing results.
For feedback or collaboration, I can be reached at my institutional
email listed above.
Portfolio management is a fundamental problem in quantitative finance, where investors must continuously decide how to allocate capital across multiple assets. Classical approaches like Markowitz mean-variance optimization suffer from three key limitations:
Deep Reinforcement Learning (RL) offers a paradigm shift: instead of solving a one-shot optimization, we train an agent to make sequential decisions by directly interacting with market data. The agent learns a policy (state → action mapping) that maximizes long-term risk-adjusted returns while accounting for trading costs.
This work makes the following contributions:
Classical Portfolio Theory: - Markowitz (1952): Mean-variance optimization framework - Black-Litterman (1992): Bayesian approach to asset allocation - DeMiguel et al. (2009): Show that 1/N portfolio often outperforms optimized portfolios
RL for Portfolio Management: - Jiang et al. (2017): Ensemble of LSTM for cryptocurrency trading - Liang et al. (2018): Adversarial Deep RL for portfolio management - Liu et al. (2020): Adaptive portfolio management via RL - Zhang et al. (2020): Cost-aware RL trading (most relevant to my work)
My work extends prior research/project work by: - Using modern PPO instead of older RL algorithms - Explicit cross-sectional feature normalization - Comprehensive baseline comparison with statistical tests - Multi-year out-of-sample evaluation (2022-2025)
We model portfolio rebalancing as a Markov Decision Process (MDP):
State Space (S): At time \(t\), the state \(s_t\) consists of: - Feature matrix \(X_t \in \mathbb{R}^{N \times F}\) (N assets, F features per asset) - Current portfolio weights \(w_t \in \mathbb{R}^N\) - Rolling portfolio volatility \(\sigma_t \in \mathbb{R}\) - PCA factors of the covariance matrix (optional)
Action Space (A): Raw logits \(a_t \in [-10, 10]^N\) passed through softmax to produce target weights: \[w_t^{\text{target}} = \text{softmax}(a_t) = \frac{\exp(a_i)}{\sum_{j=1}^N \exp(a_j)}\]
We enforce long-only constraints with optional per-asset caps (30% in our experiments).
Reward Function (R): The reward at time \(t\) is: \[r_t = \underbrace{r_t^{\text{gross}}}_{\text{portfolio return}} - \underbrace{c_t}_{\text{transaction cost}} - \lambda \underbrace{\sigma_t}_{\text{volatility penalty}} - \alpha \underbrace{\Delta \text{DD}_t}_{\text{drawdown increment}}\]
Where: - \(r_t^{\text{gross}} = \sum_{i=1}^N w_i^{\text{target}} \cdot r_i^t\) (gross portfolio return) - \(c_t = k \cdot \text{TO}_t\) (transaction cost, k = 25 bps) - \(\text{TO}_t = \frac{1}{2} \sum_{i=1}^N |w_i^{\text{target}} - w_i^{\text{old}}|\) (turnover) - \(\lambda = 1.0\) (risk aversion coefficient) - \(\alpha = 0.0\) (drawdown penalty, disabled in base config)
Transition Dynamics: After selecting target weights and paying costs, the portfolio value evolves: \[V_{t+1} = V_t \cdot (1 + r_t^{\text{net}})\]
Weights drift due to differential asset returns: \[w_{t+1}^i = \frac{w_t^{\text{target},i} \cdot (1 + r_i^{t+1})}{\sum_j w_t^{\text{target},j} \cdot (1 + r_j^{t+1})}\]
Objective: Maximize expected cumulative discounted reward: \[\max_\pi \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^T \gamma^t r_t \right]\]
Where \(\gamma = 0.99\) is the discount factor and \(\pi\) is the policy.
We construct a diversified portfolio of 9 ETFs spanning global equities, fixed income, and commodities:
| Ticker | Asset Class | Description |
|---|---|---|
| SPY | US Large Cap | S&P 500 |
| QQQ | US Tech | Nasdaq-100 |
| IWM | US Small Cap | Russell 2000 |
| EFA | Intl Developed | EAFE Index |
| EEM | Emerging Markets | EM Equities |
| TLT | US Treasuries | 20+ Year Bonds |
| HYG | High Yield | Corporate Bonds |
| GLD | Commodities | Gold |
| DBC | Commodities | Broad Basket |
Exogenous Variables: VIX index (fear gauge) for market regime detection.
Data Period: 2012-01-03 to 2025-10-30 (3,478 trading days)
We construct 18 features per asset, all causally valid (no lookahead bias):
Momentum Features (3): - Lagged returns: \(r_{t-1}, r_{t-2}, r_{t-5}\)
Trend Features (3): - Rolling mean returns: \(\overline{r}_{5d}, \overline{r}_{21d}, \overline{r}_{63d}\)
Volatility Features (3): - Rolling standard deviation: \(\sigma_{5d}, \sigma_{21d}, \sigma_{63d}\)
Technical Indicators (5): - RSI(14): Relative Strength Index - MACD(12,26,9): Moving Average Convergence Divergence - Bollinger Bands(20): Percentage position within bands - ATR(14): Average True Range (normalized)
Market Context (4): - VIX level (normalized) - VIX 5-day change - Market (SPY) lag-1 return
Cross-Sectional Normalization: All features are winsorized at (5%, 95%) percentiles and z-scored within each date across assets. This ensures: 1. Robustness to outliers 2. Stationarity across time 3. Comparability across assets
Mathematical form: \[\tilde{X}_{t,i}^f = \frac{X_{t,i}^f - \mu_t^f}{\sigma_t^f}\] where \(\mu_t^f, \sigma_t^f\) are computed across all N assets at time t.
We use a walk-forward temporal split to prevent lookahead bias:
| Split | Period | Days | Start Date | End Date |
|---|---|---|---|---|
| Train | 7 years | 1,760 | 2012-01-03 | 2018-12-31 |
| Valid | 3 years | 757 | 2019-01-02 | 2021-12-31 |
| Test | 4 years | 961 | 2022-01-03 | 2025-10-30 |
The validation set is used for model selection (best Sharpe ratio), and the test set provides unbiased performance estimates.
We use Proximal Policy Optimization (PPO) from Stable-Baselines3 with the following architecture:
Policy (\(\pi_\theta\)): - Input: State vector (175 dimensions = 9×18 features + 9 weights + 1 vol + 3 PCA) - Hidden layers: [256, 256] with Tanh activation - Output: 9-dimensional action (logits for softmax)
Value Function (\(V_\phi\)): - Shared feature extractor - Separate value head: [256, 256] → scalar - Estimates expected cumulative reward from state
Observation Normalization: We use
VecNormalize to maintain running mean/std of observations
and clip at ±10 \(\sigma\).
| Parameter | Value | Description |
|---|---|---|
n_steps |
512 | Steps per rollout |
batch_size |
256 | Minibatch size |
gamma |
0.99 | Discount factor |
gae_lambda |
0.95 | GAE parameter |
learning_rate |
3e-4 | Adam optimizer LR |
ent_coef |
0.005 | Entropy bonus |
clip_range |
0.2 | PPO clip epsilon |
total_timesteps |
800,000 | Training steps |
Initialization: Equal-weight portfolio (\(w_0 = 1/N\))
Rollout Collection: Agent interacts with
training environment for n_steps, collecting \((s_t, a_t, r_t, s_{t+1})\) tuples
Advantage Estimation: Compute Generalized Advantage Estimation (GAE): \[\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}\] where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\)
Policy Update: Optimize clipped surrogate objective: \[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]\] where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\)
Validation: Every 100 updates (51,200 steps), evaluate on validation set and save model if Sharpe improves
Early Stopping: Plateau detection after 10 consecutive validations without improvement
Training Time: ~30-60 minutes on CPU (depends on hardware)
We compare against 5 classic portfolio strategies:
\[w_i = \frac{1}{N}, \quad \forall i\] - No rebalancing after initial allocation - Zero transaction costs - Naive diversification benchmark
Fair Comparison: All baselines use identical: - Transaction cost model (25 bps per turnover) - Execution timing (next open price) - Position limits (30% cap per asset)
| Strategy | Sharpe | CAGR | Vol | Sortino | Calmar | MaxDD | Turnover | Cost Drag |
|---|---|---|---|---|---|---|---|---|
| PPO_RL | 0.328 | 3.47% | 12.91% | 0.459 | 0.119 | 29.07% | 10.6% | 265 bps |
| EW_BuyHold | 0.549 | 6.19% | 12.30% | 0.785 | 0.291 | 21.28% | 0.0% | 0 bps |
| Periodic_Rebal | 0.507 | 5.73% | 12.52% | 0.727 | 0.260 | 21.99% | 0.15% | 3.7 bps |
| Risk_Parity | 0.485 | 4.95% | 11.28% | 0.699 | 0.229 | 21.59% | 1.37% | 34 bps |
| Momentum | 0.400 | 4.28% | 12.39% | 0.574 | 0.185 | 23.16% | 3.60% | 90 bps |
| MeanVar | 0.286 | 2.59% | 11.13% | 0.400 | 0.119 | 21.85% | 1.16% | 29 bps |
Key Observations:
Testing \(H_0\): Equal predictive ability between PPO_RL and EW_BuyHold
The difference in returns is not statistically significant. The RL agent performs comparably to the best baseline.
95% CI for RL Sharpe ratio (block length = 20 days, 5000 resamples):
The wide confidence interval reflects: 1. High variance in the test period (2022 bear market, 2023 rally) 2. Limited sample size (960 days) 3. Non-normality of returns
We re-evaluate all strategies across cost regimes (0, 5, 10, 20, 30, 50 bps):
Sharpe Ratio vs Transaction Cost:
| Cost (bps) | PPO_RL | EW | Periodic | Risk_Parity | Momentum | MeanVar |
|---|---|---|---|---|---|---|
| 0 | 0.534 | 0.549 | 0.510 | 0.515 | 0.473 | 0.312 |
| 5 | 0.493 | 0.549 | 0.510 | 0.509 | 0.459 | 0.307 |
| 10 | 0.452 | 0.549 | 0.509 | 0.503 | 0.444 | 0.301 |
| 20 | 0.369 | 0.549 | 0.508 | 0.491 | 0.415 | 0.291 |
| 25 (actual) | 0.328 | 0.549 | 0.507 | 0.485 | 0.400 | 0.286 |
| 30 | 0.287 | 0.549 | 0.507 | 0.478 | 0.386 | 0.281 |
| 50 | 0.123 | 0.549 | 0.504 | 0.454 | 0.328 | 0.260 |
Findings:
Implication: The RL agent learned to trade, but trading is expensive in 2022-2025 market conditions. In a more favorable regime or with lower costs, RL could outperform.
All strategies start at $1 on 2022-01-03 as shown in Figure 1 below.
Key observations from the equity curve comparison: - 2022 Drawdown: -15% to -29% across strategies - 2023 Recovery: +15% to +25% rally - 2024-2025: Choppy, range-bound
RL Behavior: Closely tracks EW/Periodic but with higher volatility
The temporal stability of the RL agent’s performance is shown in Figure 2.
63-day rolling Sharpe for RL agent: - Mean: 0.33 - Range: -1.5 to +2.0 - Interpretation: Performance varies significantly across sub-periods
Figure 3 illustrates the drawdown profile of the RL agent.
Maximum drawdowns: - Largest: -29% (October 2022) - Duration: 180 days underwater - Recovery: Partial by 2023 rally
Heatmap:
The temporal evolution of portfolio weights is shown in Figure 4.
Key observations: - TLT (bonds): Highest allocation during 2022 crash (20-30%) - SPY/QQQ: Reduced during volatility spikes - GLD/DBC: Increased during inflation concerns
Area Chart:
Figure 5 shows the stacked area representation of allocations.
Properties observed: - Weights sum to 1.0 (full investment constraint satisfied) - Smooth transitions (low turnover)
Weight Statistics:
Figure 6 presents statistical summary of per-asset allocations.
Statistical properties: - Mean allocation: ~11% per asset (close to 1/9) - Max: TLT (30% cap frequently hit) - Min: DBC (5-10% typical)
Interpretation: Agent learned defensive tilts - overweight bonds/gold during stress, reduce tech exposure.
The efficiency trade-off is illustrated in Figure 7.
Scatter plot of (annualized turnover, Sharpe ratio): - Efficient frontier: EW (0%, 0.55), RL (10.6%, 0.33) - Inefficient: Momentum (360%, 0.40)
Insight: RL achieved 60% of EW’s Sharpe with only 10% turnover
Figure 8 demonstrates performance degradation across transaction cost regimes.
Sharpe degradation curves: - RL: Steep decline (cost-sensitive) - EW: Flat (cost-immune) - Momentum: Steeper than RL
Several factors explain the modest RL performance:
1. Test Period Characteristics (2022-2025): - High inflation → Fed tightening - Tech sector correction - Geopolitical shocks (Ukraine, Middle East) - → Static diversification worked well
2. Transaction Cost Impact: - At 25 bps, every 1% turnover costs 2.5 bps annually - RL’s 10.6% turnover → 26.5 bps drag - Zero-cost counterfactual: RL Sharpe 0.53 (matches EW)
3. Feature Limitations: - Technical indicators (RSI, MACD) may not predict in 2022-2025 - Missing macro factors (yield curve, credit spreads)
4. Overfitting to Training Period: - Training: 2012-2018 (post-crisis bull market) - Test: 2022-2025 (inflation regime shift) - → Distribution shift
Despite underperformance, RL exhibited sophisticated behaviors:
1. Cost Awareness: - Turnover 10× lower than momentum strategies - Smooth weight transitions (area chart shows continuity)
2. Risk Management: - Defensive tilts during volatility (TLT overweight) - Sortino 0.46 competitive with baselines
3. Regime Adaptation: - Weight heatmap shows time-varying allocations - Not a static rule
My results align with recent findings in RL for finance:
| Study | Method | Sharpe | Notes |
|---|---|---|---|
| Jiang et al. (2017) | LSTM Ensemble | 0.42 | Crypto (high volatility) |
| Liang et al. (2018) | Adversarial RL | 0.51 | Simulated environment |
| Liu et al. (2020) | Adaptive RL | 0.38 | Chinese equities |
| Ours (2025) | PPO | 0.33 | US ETFs (2022-2025) |
Our Sharpe is lower but test period is more challenging (bear market).
Current Limitations:
Future Research Directions:
| Component | Technology | Purpose |
|---|---|---|
| Language | Python 3.11 | Core implementation |
| RL Framework | Stable-Baselines3 | PPO/SAC algorithms |
| Env Interface | Gymnasium | MDP abstraction |
| Data | yfinance, pandas | Market data ingestion |
| Features | NumPy, SciPy | Feature engineering |
| Visualization | Matplotlib, Seaborn | Plotting |
| Stats | NumPy, SciPy | Statistical tests |
| Config | YAML | Hyperparameter management |
| Logging | Python logging | Experiment tracking |
deep-rl-rebalance/
|-- config.yaml # All hyperparameters
|-- requirements.txt # Python dependencies
|-- README.md # Project documentation
|
|-- data/ # Data pipeline
| |-- download.py # yfinance data fetching
| |-- features.py # Feature engineering
| \-- splits.py # Train/valid/test splits
|
|-- envs/ # RL environment
| \-- portfolio_env.py # Gymnasium environment
|
|-- agents/ # RL training
| |-- ppo_trainer.py # PPO with validation
| \-- sac_trainer.py # SAC (alternative)
|
|-- baselines/ # Baseline strategies
| |-- equal_weights.py
| |-- periodic_rebalance.py
| |-- risk_parity.py
| |-- momentum_tilt.py
| \-- mean_variance.py
|
|-- metrics/ # Evaluation
| |-- evaluate.py # Performance metrics
| |-- tests.py # Statistical tests
| \-- utils.py # Helper functions
|
|-- plots/ # Visualization
| |-- equity.py # Equity curves
| |-- rolling.py # Rolling metrics
| |-- weights.py # Weight analysis
| \-- sensitivity.py # Parameter sweeps
|
|-- notebooks/ # Experiments
| \-- 01_train_evaluate.ipynb
|
\-- results/ # Outputs
|-- test_daily_returns.csv
|-- test_weights_rl.csv
|-- test_performance_summary.csv
|-- artifacts.json
\-- logs/
\-- ppo/
\-- best_model_sharpe_1.3160.zip
To reproduce my results:
# 1. Clone repository
git clone https://github.com/JaiAnshSB26/deep-rl-rebalance.git
cd deep-rl-rebalance
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run full pipeline
jupyter notebook notebooks/01_train_evaluate.ipynbConfiguration: All hyperparameters are in
config.yaml: - Modify seed: 42 for different
random initialization - Adjust total_timesteps: 800000 for
longer training - Change cost_bps_per_turnover: 25 for cost
sensitivity
Hardware: Training completes in ~30-60 minutes on a modern CPU (no GPU required).
1. Action Space Mapping:
def _action_to_weights(self, action):
"""Convert raw logits to portfolio weights."""
# Softmax for long-only constraint
weights = np.exp(action) / np.exp(action).sum()
# Apply per-asset cap (iterative projection)
weights = project_to_capped_simplex(weights, cap=0.30)
return weights2. Transaction Cost Calculation:
def _compute_cost(self, w_old, w_new):
"""Turnover = half of L1 distance."""
turnover = 0.5 * np.sum(np.abs(w_new - w_old))
cost = self.cost_rate * turnover # 25 bps per unit turnover
return cost, turnover3. Cross-Sectional Normalization:
def normalize_cross_sectional(X, date_col):
"""Winsorize and z-score within each date."""
for date in X.index.get_level_values('date').unique():
date_mask = X.index.get_level_values('date') == date
date_data = X.loc[date_mask]
# Winsorize at (5%, 95%)
winsorized = mstats.winsorize(date_data, limits=[0.05, 0.05])
# Z-score
mean = winsorized.mean(axis=0)
std = winsorized.std(axis=0)
X.loc[date_mask] = (winsorized - mean) / (std + 1e-8)
return XThis project presents a comprehensive deep reinforcement learning framework for multi-asset portfolio management under transaction costs. My PPO-based agent learns cost-aware rebalancing policies that achieve competitive risk-adjusted returns (Sharpe 0.33) with minimal turnover (10.6% annually).
Key Takeaways:
Practical Implications:
For quantitative portfolio managers: - RL is viable for cost-conscious strategies - Feature engineering remains critical - Validation on multiple time periods essential
For researchers: - PPO is a strong baseline for portfolio RL - Cross-sectional normalization improves generalization - Statistical testing (DM, bootstrap) should be standard practice
Final Thoughts:
While my RL agent did not outperform the best baseline in the test period, it demonstrated valuable properties: adaptability, risk-awareness, and cost consciousness. With refinements (better features, multi-period validation, regime detection), RL remains a promising approach for quantitative asset management.
The complete codebase is open-source and can serve as a foundation for future research in this exciting intersection of deep learning and finance.
This (research) project was carried out independently by the author,
during his time as a Bachelor student at École Polytechnique (IP
Paris).
The author would like to express gratitude to his school for fostering
an environment to take up initiative and carry out inquisitive
projects.
Special thanks to the open-source community for providing the tools and
frameworks — notably PyTorch,
Stable-Baselines3, and Pandas — that
made this work possible.
Markowitz, H. (1952). Portfolio Selection. The Journal of Finance, 7(1), 77-91.
Black, F., & Litterman, R. (1992). Global Portfolio Optimization. Financial Analysts Journal, 48(5), 28-43.
DeMiguel, V., Garlappi, L., & Uppal, R. (2009). Optimal Versus Naive Diversification: How Inefficient is the 1/N Portfolio Strategy? The Review of Financial Studies, 22(5), 1915-1953.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.
Jiang, Z., Xu, D., & Liang, J. (2017). A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv preprint arXiv:1706.10059.
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial Deep Reinforcement Learning in Portfolio Management. arXiv preprint arXiv:1808.09940.
Liu, Y., Liu, Q., Zhao, H., Pan, Z., & Liu, C. (2020). Adaptive Quantitative Trading: An Imitative Deep Reinforcement Learning Approach. AAAI Conference on Artificial Intelligence.
Zhang, Z., Zohren, S., & Roberts, S. (2020). Deep Learning for Portfolio Optimization. The Journal of Financial Data Science, 2(4), 8-20.
Diebold, F. X., & Mariano, R. S. (1995). Comparing Predictive Accuracy. Journal of Business & Economic Statistics, 13(3), 253-263.
Ledoit, O., & Wolf, M. (2004). A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices. Journal of Multivariate Analysis, 88(2), 365-411.
# Seed
seed: 42
# Universe
assets: [SPY, QQQ, IWM, EFA, EEM, TLT, HYG, GLD, DBC]
exogenous: ["^VIX"]
# Data Splits
date:
train: ["2012-01-01", "2018-12-31"]
valid: ["2019-01-01", "2021-12-31"]
test: ["2022-01-01", "2025-10-31"]
# Trading Constraints
trade:
execute: "next_open"
cost_bps_per_turnover: 25
cap_per_asset: 0.30
# Reward Function
reward:
lambda_risk: 1.0
alpha_drawdown: 0.0
# Feature Engineering
features:
lags: [1, 2, 5]
roll_mean: [5, 21, 63]
roll_std: [5, 21, 63]
rsi: 14
macd: [12, 26, 9]
bb: 20
atr: 14
portfolio_vol_window: 21
cov_pca_components: 3
cov_window: 63
# RL Training
rl:
algo: "PPO"
total_timesteps: 800000
eval_every_updates: 100
reward_scale: 100.0
ppo:
n_steps: 512
batch_size: 256
gamma: 0.99
gae_lambda: 0.95
learning_rate: 3.0e-4
ent_coef: 0.005
vf_coef: 0.5
clip_range: 0.2
# Baselines
baselines:
periodic_rebalance_freq: "monthly"Sharpe Ratio: \[\text{Sharpe} = \frac{\overline{r}}{\sigma_r} \cdot \sqrt{252}\] where \(\overline{r}\) is mean daily return, \(\sigma_r\) is daily return std.
CAGR (Compound Annual Growth Rate): \[\text{CAGR} = \left(\frac{V_T}{V_0}\right)^{252/T} - 1\]
Sortino Ratio: \[\text{Sortino} = \frac{\overline{r}}{\sigma_{\text{down}}} \cdot \sqrt{252}\] where \(\sigma_{\text{down}}\) is downside deviation (only negative returns).
Calmar Ratio: \[\text{Calmar} = \frac{\text{CAGR}}{\text{MaxDD}}\]
Maximum Drawdown: \[\text{MaxDD} = \max_t \left( \frac{\max_{s \leq t} V_s - V_t}{\max_{s \leq t} V_s} \right)\]
Tail Ratio: \[\text{Tail Ratio} = \frac{\text{95th percentile return}}{|\text{5th percentile return}|}\]
Annualized Turnover: \[\text{Turnover}_{\text{annual}} = \overline{\text{TO}_{\text{daily}}} \cdot 252\]
Diebold-Mariano Test:
Null hypothesis: \(E[d_t] = 0\) where \(d_t = -r_t^{\text{RL}} + r_t^{\text{baseline}}\)
Test statistic: \[\text{DM} = \frac{\overline{d}}{\sqrt{\text{Var}(\overline{d})}}\]
We use Newey-West HAC standard errors with lag \(h = \lceil T^{1/3} \rceil\).
Block Bootstrap:
End of Report
Author: Jai Ansh Bindra
GitHub Repository:
https://github.com/JaiAnshSB26/deep-rl-rebalance
License: MIT