Lessons learned building an ML trading system that turned $5k into $200k

One of my recent side projects was building an automated trading system for the crypto markets. To be fair, I probably spent more time on this than on my full-time job, so calling it a side project may not be completely accurate. The internet is full of people ready to teach you about trading. Most are trying to sell you something, and many are mistaking random chance for skill. Coming from a technical background in scientific research and software engineering, I tried to ignore anything with little scientific validity, like technical analysis, or anything that looked like marketing BS. After a lot of iterations, I managed to build and deploy a system that turned my $5k investment into around $200k of pre-tax profit over a 12-month period while staying largely market neutral, i.e. not relying on ups or downs. The best run was a 4-month period without a single losing day. I did have losses on shorter time scales, but very rarely on a daily level.

In this post I want share some of the problems encountered and lessons learned. I will try to strike a balance between providing useful information while not revealing specific implementation details.

Can we predict the market?

A common misconception is that the market cannot be predicted and that hedge fund managers are no better than dart-throwing monkeys. Many academic research papers back up this claim with data. This is an overly simplistic view. Just because some markets cannot be predicted under some experimental settings, such as equities traded on a daily basis, this does not mean no market can be predicted in any setting. Let us try to get an intuitive understanding of what it means to predict the market. To do so, we have to understand the market participants.

  • Retail investors: The average person. He or she may be buying BTC because a friend recommended it, or selling mined BTC to get some cash. 
  • Institutional Investors: Organizations or high net-worth individuals who trade large quantities. They can be responsible for big market movements.
  • Professional Human Traders: These people are actively trying to beat the market. They may make trades based on news, some combination of technical analysis indicators, or gut feeling.
  • Algorithms: Trading algorithms receive market data, make decisions, and place orders automatically.

To anyone looking at the data, retail and institutional investor activity looks random. A Bitcoin miner may be cashing out, or someone may be buying up a large quantity of BTC because due to insider information. There is no way we could ever predict such market activity. For our purposes, it is just random noise. Sometimes we will be lucky and be on right side of the market when such random activity moves the price, and sometimes we end up on the wrong side. Over a big enough time period this should come out to net zero.

Professional human traders and algorithms are more interesting to us. They are essentially the same as both of of them follow a set of rules to make decisions. As a result of that, they leave patterns in the data. If we knew their rules, it would be trivial to come up with a exploitative strategy to make money. For example, if we knew that some algorithm buy X amount when a MACD signal, a type of nonsense but widely-used technical analysis indicator, reaches its threshold, we just need to slightly modify the parameters to buy before the algorithm does, and then sell after the algorithm drove up the price with its buy. Of course this is an overly simplified example. Most algorithms are more complicated, such ML-based models, and we are ignoring liquidity, latency, fees, and other aspects. The point is that any market participant making consistent rule-based decisions can be exploited if we know how.

An important aspect of the above is time scale. Looking at daily prices, market activity looks more random than if we looked at the data on a per-second scale. The reason for that is quite intuitive. Over longer time scales, such as days and weeks, market activity is a result of complex interactions between political news, legal rulings, public sentiment, social hype, business decisions, and so on. Large random trades by institutional investors can also have a big impact, but they don't happen very often. These are not things that can be modeled or exploited by algorithms. We don't have data for black-swan events, making it impossible to model and predict them algorithmically. However, if we zoom into the market activity for a single hour, minute, or second, we can often see patterns. These patterns are a result of algorithmic rule-based trading. Our goal is exploiting such patterns to make a profit.

In other words, instead of thinking of beating the market, let's think of making a profit as exploiting a large-enough population of other market participants.

What's your edge?

Buy and sell transactions happen between two or more market participants and in order for us to make a profit, someone else must make a loss. What makes us better than all the other traders who are also trying to be profitable? In trading, a competitive advantage is called an edge and may come from various places:

  • Latency: We have a faster connection to the exchange than others. This means we can observe new data faster and submit orders before others. In the financial markets, institutions spend millions of dollars to minimize latency to exchanges.
  • Infrastructure: Our infrastructure may be more fault-tolerant, higher performance, or handle edge cases betters than the competition.
  • Data: We may have better data than others, where better can mean many things. Our data may be collected more reliably with fewer outages, come from a different API, or cleaned and post-processed more carefully. Reconstructing the Limit-Order Book from raw real-time API data can be an error-prone process due to noisy, delayed, or duplicate data.
  • Model: We may be able to build a better predictive model based on patterns in the data. Perhaps we use some fancy new Deep Learning techniques, have a better optimization function (more on that below), better features, or a different training algorithm.
  • Market Access: We may have access to a market not everyone can trade in. For example, certain exchanges in South Korea require you to be a citizen to get access. Many international exchanges don't accept U.S. citizens as customers to avoid dealing with the IRS.

A common mistake is to focus on the model because it's sexy. I have talked to many people who tried to build a profitable trading system using a fancy (and now commoditized) Deep Learning and/or Reinforcement Learning algorithm. Most of them fail. They believe their edge comes from the model and neglect the other ingredients. They don't optimize server placement, use open source software to place orders and collect data, and use easily accessible public datasets for training. Their model may in fact give them a small edge, but it's not big enough to make up for all the other mistakes.

It's hard to pin down where my own edge comes from. It's likely a combination of all the above. I do have some fancy Machine Learning models, but the biggest edge probably comes from the effort put into building the infrastructure. Almost any trading-related open source software is suboptimal. It's useful to play around with for learning purposes, but not suited for serious production usage. I started out with using open source components, but after many iterations I ended up building custom components for everything, including real-time data collection and cleaning, backtesting and simulation, order management and normalization, monitoring, and live trading.

Picking a market

A market is an asset traded on a specific exchange. For example, one market is BTC/USDT traded on Binance. Another one is BTC/USD traded on Coinbase. There exist hundreds of different cryptocurrency exchanges, each trading dozens of assets. How do we choose where to trade? The ideal market has high liquidity, low trading fees, fast and reliable APIs, and good security.

Let's talk about market liquidity. Whole books have been written on defining and measuring liquidity, but it can be roughly understood as the volume we can trade without significantly affecting the market price. Liquid markets have a small spread and little slippage. They offer low trading costs while allowing us to trade large volumes. An easy-to-measure proxy metric for liquidity is trade volume. Markets with high trade volume often, but not always, have high liquidity. Because it is such a commonly used metric to make decisions, many cryptocurrency exchanges use fake volumes to make themselves look better than they are. Unfortunately, there is no public ranking of exchanges that's reliable, even though attempts such as cer.live exist. A common mistake is to rely on sites such as CMC's exchange ranking, which is useless and driven by exchanges paying advertising fees to get listed. In the end, any publicly available ranking is prone to being gamed by the exchanges, either by outright paying money to the maintainers, or by manipulating their data. Whenever an exchange ranking becomes popular, it's probably only a matter of time before the exchanges, many of which are swimming in cash, are offering enough to the owners to get listed. On top of that, exchange ranking sites don't care how accurate their data is. They are not trading based on their data, they are marketing machines.

The only reliable way to evaluate markets is to collect and analyze data yourself. What does the order book like like? Do trades look real or fake? What are the spread and slippage distributions? While some exchanges are blatant in their use of algorithms to fake trade data, others employ more sophisticated techniques to make their data look real. In such cases, we won't find out that we are dealing with fake data until we actually start trading on the exchange ourselves.

Understanding trading costs

To be profitable, our trades must be good enough to offset all trading costs. Exchange trading fees are obvious to most people, but costs such as slippages tend to be neglected despite being crucial. Let's say we buy a quantity qty of BTC and sell it again after some time period. Here is what we would pay in pure trading costs:

trade_cost(qty) = (2 * qty * exchange_fee) + (qty * spread) + slippage_buy(qty) + slippage_sell(qty)

(2 * qty * exchange_fee) is the exchange fee. For example, an exchange may charge 0.01% trading fees on each trade. Since we are buying and selling we're making two trades and paying the fee twice.

(qty * spread) is the bid-ask spread we are paying for buying at the ask and selling at the bid. Even if the market does not move at all, we are still buying at a slightly higher price than we are selling at. The spread changes over time, and incorporating it into our trading decisions is crucial. If we are trading BTC/USDT on Binance, the spread distribution over single day may look something like this, ranging from less than $1 most of the time up to $5+ dollars at illiquid times.

Binance spread distribution

slippage_buy(qty) and slippage_sell(qty) are price slippages as a result of insufficient market liquidity. The larger the quantity we are trading, the more slippage cost we are paying because full quantity cannot be filled at the best price. For BTC/USDT on Binance, the slippage distribution looks something like shown below. Each row corresponds to a specific trade size, ranging from 0.1 to 2.0 BTC, and the x-axis shows the cost as % of price paid. Trading during times of low liquidity means that slippage costs can easily dominate exchange fees or spreads, as we can see from the extremes of the distribution.

Binance slippage cost distribution

How much do these costs matter? It depends. If we are trading once a day and betting on large market movements we can ignore most of these costs. 0.5% trading fees are bearable when we are betting on 10%+ price movements. The shorter the time scales we are trading on, the more crucial these costs become. The market does not make large movements within a few seconds and our trades can easily be dominated by trading costs, wiping out any profit. The other important factor is the quantity we are trading. Larger quantities mean more profit, but also more fees. Exchange fees and spread may scale linearly with quantity, but the slippage does not and can lead to bad surprises. Using book-crossing limit orders instead of market orders is one way to protect oneself against large slippage costs, but requires additional infrastructure to manage partial order fills and cancellations. Again, the shorter the time scale we are trading on, the lower the quantity we can profitably trade without getting wiped out by trading costs.

Time scale

How do we decide if we should trade based on high frequency data or make a single trade per day? To understand to tradeoffs, let's look at the extremes.

On short time scales, such as milliseconds, large market movements don't occur. Each of our trades can result in only tiny profit only, but we can make a lot of them. And as discussed above, high trading costs are likely going to destroy us. Even if we could perfectly predict the market on a millisecond-scale, such a model would not be useful. Sending an HTTP request to the exchange and waiting for it to be processed by the exchange matching engine typically takes tens to hundreds of milliseconds. By the time our order is processed, the market has changed significantly and our prediction is outdated.

The other extreme would be trading based on something closer to daily prices. Market movements from one day to the next are large enough for us to completely ignore trading costs. Latencies don't matter either. However, such long-term market movements are probably driven by complex real-world interactions such as news and social behavior or other random events like institutional investor activity. If we are relying on pure pattern matching (Machine Learning), we can't hope to make good predictions on such time scales. We also don't have much data. A few hundred examples are not enough data to train any modern algorithms. We also can't reliably test and evaluate our algorithms. With so much noise in the data, testing on a few hundred data points is like throwing darts.

This means we need to find a balance between significantly large returns to cover the trading costs, big enough datasets, and the ability to recognize patterns in the data. Short time scales tend to have more patterns and examples, but we need to be careful about trading costs and latencies, which in turn depend on market liquidity and exchange APIs.

An alternative to using a intervals based on a natural clock (seconds) is to use intervals based on some other measure, such as trade volume. For example, instead of defining a tick as 1 second, we could define it as 1.0 BTC traded, which could happen within one second or one minute, depending on how busy the market is. The intuition here is that we want to act more frequently during high-activity periods (high volume traded) and less frequently during low-activity periods (low volume traded). Aggregating data based on volume also results in more normalized data distributions of features and labels, which is helpful for training ML algorithms. Such approaches come with their own drawbacks. For example, when acting based on volume traded we may be too late already. Acting after volume spike means that the market has moved already. Ideally we want to place an order before the other market participants, i.e. before the trade volume picks up.

In the end, time scale, and how to define time, is hyperparameter that must be optimized on a per-market basis. A highly liquid market with low fees and low API latencies allows us to trade profitably on much shorter time scales than a less liquid market with higher fees.

Optimization function

To train a Machine Learning model on market data we need to pick an optimization metric. An obvious choice would be to train a regression model on raw prices. The problem with prices is that they are nonstationary. The vast majority of modern Machine Learning techniques require, or work best with, stationary data and assume that the data distribution does not change over time, both within the training set, and across training, validation and test sets. That's why in finance we typically model returns instead of prices, where the return r(t) at time t is defined as:

r(t) = (p(t) / p(t-1)) - 1

It's is simply the percentage the price has moved. A return greater than 0 means the price moved up, and a return less than 0 means the price went down. You can define the timescale t however you like, as discussed in the previous section. You may calculate minutely returns, daily returns, or returns based on a volume clock.

An even better metric are log-returns. They measure the same thing, but are closer to normally distributed and have a few convenient statistical properties useful for training ML algorithms:

logr(t) = log(p(t)) - log(p(t-1))

Training a regression model on log-returns on some fixed time scale is one optimization function we could pick. It's a pretty standard one. But there are many other possibilities. For example, in Advances in Financial Machine Learning, the author discusses how to pick sensible thresholds and transform the data to convert the regression into a classification problem.

There is one aspect of the above formula that we conveniently glanced over. It's the price p(t). In practice, there are several ways we could define p(t). When we see a single price for an asset such as BTC, it typically refers to the midprice. However, the midprice is a synthetic quantity, not a price we can actually trade at. When buying, we are paying more than the midprice. When selling, we are getting less than midprice. As discussed earlier, we also pay slippage costs as a function of order quantity. Thus, the price is really a function of time, side (buy or sell) and the order quantity, p(t, s, q). The above formula should really be something like this:

logr(t, quantity) = log(p(t, BUY, quantity)) - log(p(t-1, SELL, quantity))

There are other definitions of price we could use for modeling purposes, such as the microprice. How much does picking an accurate representation of price matter? Again, it depends on time scale and market liquidity. Modeling returns based on midprice may be good enough in very liquid markets with low slippage costs, but completely useless in illiquid ones.

Training vs. Backtesting vs. Live Trading

The typical workflow for building a trading algorithm looks something like this:

Data collection
-> Data preprocessing and cleaning
-> Feature construction
-> Model training
-> Backtesting
-> Live trading

What we optimize for during model training, such the accurate prediction of log-returns, is only a proxy metric for what we truly are about, which is Profit or Loss (PnL). That's why backtesting is crucial. In the context of automated trading, backtesting refers to running a full-fledged simulation of the market using a trained model and a historical data stream. Commercially available backtesting software can be quite expensive, especially if geared towards high-frequency trading. At the very least, such backtesting software can simulate latencies, non-standard order types, exchange commission structures and slippages. It may also automatically optimize hyperparameters and output charts and statistics to evaluate the model.

But no matter how good the backtesting software, it is still fundamentally different from a live environment.

TrainingBacktestingLive
MetricProxy MetricPnLPnL
Slippage-PredictableUnpredictable
Latencies-FixedHighly variable
Market ImpactNoNoYes
Data DistributionEstimatedEstimatedReal
API IssuesNoNoYes

While backtesters may simulate latencies, the real world is significantly more unpredictable. Latencies in the real world may be stable during low-activity periods and spike during high-activity periods. They can also have a seasonality to them. The same is true for APIs. In simulation everything works perfectly, but in the real world we run into API issues, request throttling, and random order rejections during busy periods. And usually it's those busy periods when our actions matter the most. In the real world we also have market impact - we influence other market participants. We can't simulate this. Backtesting is also fundamentally limited by the data we have. The historical data we obtain from exchange APIs is often noisy and incomplete - there is no guarantee it truly reflects the current state of the exchange.

Backtesting thus serves mostly as a filter, or an optimistic estimate. If our model does not perform well in backtesting, there is little chance it would do well in a live scenario. But a model performing well in a controlled backtest is not guaranteed to do well in the real world. Unless you are careful, backtesting is also prone to overfitting and can yield spurious results. Live trading will punish you for this.

This is one reason why many academics papers on trading are not very useful in practice. They final step in finance research is often backtesting on historical data. If the model does well, the researchers declare success, conveniently ignoring the fact that their model would probably never be profitable in a production environment. Academic researchers don't have access to live trading infrastructure to test their models. If they did, and their new algorithm performed well in the real-world, they certainly would not publish a paper about it and give away their edge.

Training (supervised) Machine Learning models for trading is hard. In many other ML use case, train-test performance directly correlates with live performance. For example, if we train a recommendation system with proper train/validation/test splits and the data distribution does not change significantly over time, we can be pretty sure that a model performing well on the test set also does so in production. In trading, our training, backtesting, and live environments are so different that we can't make any guarantees. We can only hope that a trained model, which uses some kind of proxy metric, does well in backtesting. And then we need to hope again that the model still does well in a live environment.

Other challenges

This post is already longer than I wanted it to be, but there are still many challenges we have not touched upon. Upcoming blog posts may go into more detail on some of these:

  • Non-IID noisy data: Market data is not independent and identically distributed, making it more challenging to train accurate ML models. There is also large amount of noise. There may be patterns in market activity, but they are hidden within a lot of random activity.
  • Order book reconstruction: Order book reconstruction is a common bottleneck in trading and backtesting infrastructure. How can we do this efficiently? Exchange APIs are often unreliable and have jitter, how can we deal with this?
  • Minimizing latencies: What are some of the tricks to minimize end-to-end latency between receiving data and sending orders to the exchanges?
  • Feature construction: Which features are the most useful for our ML algorithms to model the data, and how can we efficiently construct them in real-time?
  • Order Management: With spotty and unreliable exchange APIs we must manage orders ourselves. What does a typical order management system look like?
  • Fault Tolerance: What happens when things go wrong in a live setting and how can we recover?

Thoughts on Arbitrage and Market Making strategies

In this post we discussed one specific type of trading strategy, a liquidity-taking strategy that tries to profit from price movements. Two other common types of trading strategies are arbitrage and market making. Without going into too much detail, I want to share some thoughts on their viability in the current crypto markets.

Arbitrage, taking advantage of price difference between exchanges, is perhaps the most popular trading strategy in the crypto markets. Most exchanges trade the same assets, so arbitrage makes sense. It is intuitive, easy to understand, and easy to implement. A quick Google search will flood you with crypto arbitrage bots, SaaS services, tutorials, books, and gurus ready to explain how to make a quick buck. That alone should make you skeptical. There is nothing wrong with arbitrage in general, but you must ask yourself: What is your edge? The barrier of entry is so low and thousands of people, and some very sophisticated trading companies, are doing the same thing. From what I've seen, informative prices are often mistaken for arbitrage opportunities. When the BTC price on exchange A is lower than on exchange B, it's likely a reflection of risk, not an arbitrage opportunity. Exchange A may be less secure than exchange B, less regulated, or less reputable. The lower price then reflects the risk you are taking for storing money on that exchange. For international arbitrage, price differences often reflect the volatility of a country's fiat currency, or the regulations and limitations around cashing out and moving large amounts of fiat out of the country. These are not arbitrage opportunities. These are informative prices.

Market Making is the opposite of a liquidity-taking strategy. Instead of taking liquidity, betting on market movements, and paying the spread, we can provide liquidity, protect against market movements, and profit from the spread. In the financial markets, professional market making firms are some of the most profitable operations in existence. Market Making in the crypto markets is a viable strategy, but can be difficult to pull off if you don't have professional Market Making experience. Due to the accumulation of inventory it can be risky, and unreliable exchange APIs, high latencies, and jitter are less forgivable in a market making scenario than they are in a liquidity-taking strategy. Market making also requires significantly more complex infrastructure for inventory and risk management. The competition here is also stiff. Many professional market making firms from the financial markets have moved into crypto. On the bright side, many exchanges are actively trying to recruit market makers with favorable fee structures and commissions.

Resources

The trading industry is one of the most secretive industries I've ever been involved in. Most of what you find online is noise, or gurus trying to sell something. Most successful traders I've talked to have worked for a professional trading company, and that's where they learned the ropes. They have no incentive to share any of their knowledge online, and sharing has never been part of the culture in finance. As a result, the whole field can seem complex and overwhelming to newcomers. The best way to learn is probably by doing. Unlike in the financial markets, where trading infrastructure and high-frequency data can cost millions of dollars, trading in th crypto markets is available to anyone and can be used as learning environment. Nevertheless, here are a few resources I found helpful.

  • Advances in Financial Machine Learning - A no-nonsense book that covers the application of Machine Learning techniques in the financial markets. It is more on the academic side and some of it is not very practical. While I don't agree with everything in this book, it's an excellent introduction to various challenges and pitfalls you encounter when building trading systems.
  • An Introduction to High-Frequency Finance - Covers a lot of common terminology and methods used in automated trading. Can be quite academic and hard to digest at times, but worth a read.
  • Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets - Not about the technical aspects of trading, but written by an options trader, it teaches you how to think about random chance, both in the markets and in life.
  • arXiv q-fin - Reading recent research papers is a great way to come up with new ideas or learn how to think about a problem. Read them because they are interesting and educational, but don't pay too much attention to the results. As mentioned earlier, academics have little incentive to publish something that works in practice.

Closing Thoughts

I hope that I was able to give some insight into problems that may come up when building automated trading systems. You may be disappointed that this post was more focused on problems than solutions. But that's for a reason. There are no universal solutions to complex problems that work in all cases. What's important is to fully understand the facets of a problem, and then make the reasonable decision specific to your context.

This is my first post, so I am not sure where to go from here. I would love to hear your feedback in the comments.



comments powered by Disqus