In this tutorial, we are going to learn how to use ray with TensorTrade in order to create a profitable algorithm on a predictable sine curve. You may be asking yourself, why use something so simple when the real world data is much more difficult to predict? Now this is a very good question and there is a simple answer.

“The man who moves a mountain begins by carrying away small stones.”

- Confucius

Before trying to jump into the world of complex trading environments, a simple sine curve can be used to perform a sanity check on your trading algorithm. The reward and action scheme you use should be able to make money on the predictable pattern of a sine curve. If it does not, then you know there is no possibility success will be found on a more complex environment. There are some answers we would like to know fast before we waste time and resources in developing an algorithm which are and one of them is, does our RewardScheme correctly specify the goal we are after?

In this tutorial we will propose a new reward scheme and action scheme and show that you can actually have the reward scheme be dependent on the action scheme. This will be done through the use of a DataFeed. We first, however, need to install ray.

!pip install ray==0.8.7
!pip install symfit


We will begin by defining the two instruments we want to define our portfolio with. We will be using the U.S. dollar and fake coin called TensorTrade Coin.

from tensortrade.oms.instruments import Instrument

USD = Instrument("USD", 2, "U.S. Dollar")
TTC = Instrument("TTC", 8, "TensorTrade Coin")


Now let us look at the curve we will be using to define our price. png

Ideally, what we will be expecting from our agent is that it will be able to sell at the peaks and buy at the troughs. We will move on to defining our action scheme. The ActionScheme we are going to build is going to be extremely simple. There are only 3 states that our agent can be in which are buy, sell, and hold. We will make use of a new function in the library, proportion_order. This function enables the user to make an order that can take a percentage of funds at a particular wallet and send it to another. Therefore, we want to structure a way to only have two actions in our scheme and use them as indicators to move our funds to the opposite wallet.

from gym.spaces import Discrete

Order,
proportion_order,
)

registered_name = "bsh"

def __init__(self, cash: 'Wallet', asset: 'Wallet'):
super().__init__()
self.cash = cash
self.asset = asset

self.listeners = []
self.action = 0

@property
def action_space(self):
return Discrete(2)

def attach(self, listener):
self.listeners += [listener]
return self

def get_orders(self, action: int, portfolio: 'Portfolio'):
order = None

if abs(action - self.action) > 0:
src = self.cash if self.action == 0 else self.asset
tgt = self.asset if self.action == 0 else self.cash
order = proportion_order(portfolio, src, tgt, 1.0)
self.action = action

for listener in self.listeners:
listener.on_action(action)

return [order]

def reset(self):
super().reset()
self.action = 0


Next, we want to create our reward scheme to reflect how well we are positioned in the environment. Essentially, we want to make a mapping that shows how we would like our rewards to be reflected for each state we are in. The following shows such a mapping:

State Price Up Price Down
All funds in cash wallet (0) - +
All funds in asset wallet (1) + -

The signs in the table show what we would like the sign of the rewards to be. The position-based reward scheme (PBR) achieves this mapping.

from tensortrade.env.default.rewards import TensorTradeRewardScheme

registered_name = "pbr"

def __init__(self, price: 'Stream'):
super().__init__()
self.position = -1

r = Stream.sensor(price, lambda p: p.value, dtype="float").diff()
position = Stream.sensor(self, lambda rs: rs.position, dtype="float")

reward = (r * position).fillna(0).rename("reward")

self.feed = DataFeed([reward])
self.feed.compile()

def on_action(self, action: int):
self.position = -1 if action == 0 else 1

def get_reward(self, portfolio: 'Portfolio'):
return self.feed.next()["reward"]

def reset(self):
self.position = -1
self.feed.reset()


Finally, we would like to make sure we can see if the agent is selling at the peaks and buying at the troughs. We will make a quick Renderer that can show this information using Matplotlib.

import matplotlib.pyplot as plt

class PositionChangeChart(Renderer):

def __init__(self, color: str = "orange"):
self.color = "orange"

def render(self, env, **kwargs):
history = pd.DataFrame(env.observer.renderer_history)

actions = list(history.action)
p = list(history.price)

sell = {}

for i in range(len(actions) - 1):
a1 = actions[i]
a2 = actions[i + 1]

if a1 != a2:
if a1 == 0 and a2 == 1:
else:
sell[i] = p[i]

sell = pd.Series(sell)

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

fig.suptitle("Performance")

axs.plot(np.arange(len(p)), p, label="price", color=self.color)
axs.scatter(sell.index, sell.values, marker="^", color="red")

env.action_scheme.portfolio.performance.plot(ax=axs)
axs.set_title("Net Worth")

plt.show()


Train
Now in order to use our custom environment in ray we must first write a function that creates an instance of the TradingEnv from a configuration dictionary.

import ray
import numpy as np
import pandas as pd

from ray import tune
from ray.tune.registry import register_env

def create_env(config):
x = np.arange(0, 2*np.pi, 2*np.pi / 1001)
y = 50*np.sin(3*x) + 100

x = np.arange(0, 2*np.pi, 2*np.pi / 1000)
p = Stream.source(y, dtype="float").rename("USD-TTC")

coinbase = Exchange("coinbase", service=execute_order)(
p
)

cash = Wallet(coinbase, 100000 * USD)
asset = Wallet(coinbase, 0 * TTC)

portfolio = Portfolio(USD, [
cash,
asset
])

feed = DataFeed([
p,
p.rolling(window=10).mean().rename("fast"),
p.rolling(window=50).mean().rename("medium"),
p.rolling(window=100).mean().rename("slow"),
p.log().diff().fillna(0).rename("lr")
])

reward_scheme = PBR(price=p)

action_scheme = BSH(
cash=cash,
asset=asset
).attach(reward_scheme)

renderer_feed = DataFeed([
Stream.source(y, dtype="float").rename("price"),
Stream.sensor(action_scheme, lambda s: s.action, dtype="float").rename("action")
])

environment = default.create(
feed=feed,
portfolio=portfolio,
action_scheme=action_scheme,
reward_scheme=reward_scheme,
renderer_feed=renderer_feed,
renderer=PositionChangeChart(),
window_size=config["window_size"],
max_allowed_loss=0.6
)
return environment



Now that the environment is registered we can run the training algorithm using the Proximal Policy Optimization (PPO) algorithm implemented in rllib.

analysis = tune.run(
"PPO",
stop={
"episode_reward_mean": 500
},
config={
"env_config": {
"window_size": 25
},
"log_level": "DEBUG",
"framework": "torch",
"ignore_worker_failures": True,
"num_workers": 1,
"num_gpus": 0,
"clip_rewards": True,
"lr": 8e-6,
"lr_schedule": [
[0, 1e-1],
[int(1e2), 1e-2],
[int(1e3), 1e-3],
[int(1e4), 1e-4],
[int(1e5), 1e-5],
[int(1e6), 1e-6],
[int(1e7), 1e-7]
],
"gamma": 0,
"observation_filter": "MeanStdFilter",
"lambda": 0.72,
"vf_loss_coeff": 0.5,
"entropy_coeff": 0.01
},
checkpoint_at_end=True
)


After training is complete, we would now like to get access to the agents policy. We can do that by restoring the agent using the following code.

import ray.rllib.agents.ppo as ppo

# Get checkpoint
checkpoints = analysis.get_trial_checkpoints_paths(
trial=analysis.get_best_trial("episode_reward_mean"),
metric="episode_reward_mean"
)
checkpoint_path = checkpoints

# Restore agent
agent = ppo.PPOTrainer(
config={
"env_config": {
"window_size": 25
},
"framework": "torch",
"log_level": "DEBUG",
"ignore_worker_failures": True,
"num_workers": 1,
"num_gpus": 0,
"clip_rewards": True,
"lr": 8e-6,
"lr_schedule": [
[0, 1e-1],
[int(1e2), 1e-2],
[int(1e3), 1e-3],
[int(1e4), 1e-4],
[int(1e5), 1e-5],
[int(1e6), 1e-6],
[int(1e7), 1e-7]
],
"gamma": 0,
"observation_filter": "MeanStdFilter",
"lambda": 0.72,
"vf_loss_coeff": 0.5,
"entropy_coeff": 0.01
}
)
agent.restore(checkpoint_path)


Now let us get a visualization of the agent’s decision making on our sine curve example by rendering the environment.

# Instantiate the environment
env = create_env({
"window_size": 25
})

# Run until episode ends
episode_reward = 0
done = False
obs = env.reset()

while not done:
action = agent.compute_action(obs)
obs, reward, done, info = env.step(action)
episode_reward += reward

env.render() png

From the rendering, you can see that the agent is making near optimal decisions on the environment. Now we want to put the agent in an environment it is not used to and see what kind of decisions it makes. We will use for our price, an order 5 Fourier series fitted to a randomly generated Geometric Brownian Motion (GBM). We will use the symfit library to do this.

from symfit import parameters, variables, sin, cos, Fit

def fourier_series(x, f, n=0):
"""Creates a symbolic fourier series of order n.

Parameters
----------
x : symfit.Variable
The input variable for the function.
f : symfit.Parameter
Frequency of the fourier series
n : int
Order of the fourier series.
"""
# Make the parameter objects for all the terms
a0, *cos_a = parameters(','.join(['a{}'.format(i) for i in range(0, n + 1)]))
sin_b = parameters(','.join(['b{}'.format(i) for i in range(1, n + 1)]))

# Construct the series
series = a0 + sum(ai * cos(i * f * x) + bi * sin(i * f * x)
for i, (ai, bi) in enumerate(zip(cos_a, sin_b), start=1))
return series

def gbm(price: float,
mu: float,
sigma: float,
dt: float,
n: int) -> np.array:
"""Generates a geometric brownian motion path.

Parameters
----------
price : float
The initial price of the series.
mu : float
The percentage drift.
sigma : float
The percentage volatility.
dt : float
The time step size.
n : int
The number of steps to be generated in the path.

Returns
-------
np.array
The generated path.
"""
y = np.exp((mu - sigma ** 2 / 2) * dt + sigma * np.random.normal(0, np.sqrt(dt), size=n).T)
y = price * y.cumprod(axis=0)
return y

def fourier_gbm(price, mu, sigma, dt, n, order):

x, y = variables('x, y')
w, = parameters('w')
model_dict = {y: fourier_series(x, f=w, n=order)}

# Make step function data
xdata = np.arange(-np.pi, np.pi, 2*np.pi / n)
ydata = np.log(gbm(price, mu, sigma, dt, n))

# Define a Fit object for this model and data
fit = Fit(model_dict, x=xdata, y=ydata)
fit_result = fit.execute()

return np.exp(fit.model(x=xdata, **fit_result.params).y)


Now we can make the evaluation environment and see how the agent performs.

def create_eval_env(config):
y = config["y"]

x = np.arange(0, 2*np.pi, 2*np.pi / 1000)
p = Stream.source(y, dtype="float").rename("USD-TTC")

coinbase = Exchange("coinbase", service=execute_order)(
p
)

cash = Wallet(coinbase, 100000 * USD)
asset = Wallet(coinbase, 0 * TTC)

portfolio = Portfolio(USD, [
cash,
asset
])

feed = DataFeed([
p,
p.rolling(window=10).mean().rename("fast"),
p.rolling(window=50).mean().rename("medium"),
p.rolling(window=100).mean().rename("slow"),
p.log().diff().fillna(0).rename("lr")
])

reward_scheme = PBR(price=p)

action_scheme = BSH(
cash=cash,
asset=asset
).attach(reward_scheme)

renderer_feed = DataFeed([
Stream.source(y, dtype="float").rename("price"),
Stream.sensor(action_scheme, lambda s: s.action, dtype="float").rename("action")
])

environment = default.create(
feed=feed,
portfolio=portfolio,
action_scheme=action_scheme,
reward_scheme=reward_scheme,
renderer_feed=renderer_feed,
renderer=PositionChangeChart(),
window_size=config["window_size"],
max_allowed_loss=0.6
)
return environment

# Instantiate the environment
env = create_eval_env({
"window_size": 25,
"y": fourier_gbm(price=100, mu=0.01, sigma=0.5, dt=0.01, n=1000, order=5)
})

# Run until episode ends
episode_reward = 0
done = False
obs = env.reset()

while not done:
action = agent.compute_action(obs)
obs, reward, done, info = env.step(action)
episode_reward += reward


The following are a few examples of the rendering that came from the evaluation environment. png png png png

As you can see, the agent has been able to make correct decisions on some of the price curves, but not all of them. The last curve shows some of the shortcomings of the agent. The agent seemed to have stop making decisions to move its wealth to capitalize on the changes in price. In the first three, however, it was able to make those decisions. The reason for this is most likely the change in the frequency of the curve. The last chart has a price curve that is volatile, containing many local maxima and minima as opposed to the first three charts.

What have we learned?

• Make an environment in TT.
• Create a custom ActionScheme, RewardScheme, and Renderer in TT.
• Using a simple price curve to understand more about actions and rewards in our use case.
• Using ray to train and restore an agent.
• Making evaluation environments for providing insight into the decisions of an agent.

See you in the next tutorial!