April 26, 2021
In our previous blog post we introduced a benchmarking tool for Lightning node implementations. The primary goal of the benchmark was to get a feel for how suitable the current implementations are for high-frequency payment processing. Assuming the test setup is representative for real-life (future) use cases such as streaming podcasts by the minute, we could conclude that the required performance level isn’t met yet.
As a follow-up we wanted to zoom in on node performance for lnd specifically, as lnd is the node software that we run ourselves and, according to some metrics, captures the larger part of the market. The purpose of this post is to shed light on the still unrealized potential of lnd as a node implementation and the Lightning protocol in general.
Before moving any further, we first need to make a note about the benchmark itself. The previously reported results were obtained with 100 concurrent processes that launch payments. We chose this number because increasing it led to failed payments (timeouts) and/or a lower throughput (contention between the payments). It was a practical choice that gave the implementations under test the opportunity to show their best performance.
Ideally this kind of tuning shouldn’t be necessary. Nodes can throttle traffic on their own to always realize the best possible performance. This would allow the benchmark to run many more processes and prevent the situation where a node is underperforming because it isn’t load tested to the maximum.
The question is how many processes are needed for maximum stress testing? We think that 966 per channel is a good number. In an optimal Lightning implementation, payments can be processed in batches of 483 payments. With 966 processes per channel, there is always a new batch of payments ready to go when the previous batch completes. In our test configuration with 10 channels, that would mean 9660 processes. To avoid running into tcp connection limits, processes can share tcp connections. Because the grpc protocol multiplexes multiple (streaming) calls over a single connection, this doesn’t affect the degree of parallelism.
For the performance experiments that we wanted to conduct, we expected higher throughputs and thus a risk of under-loading with the original 100 processes. Therefore we increased the process count to 9660 and applied a patch to lnd to prevent the aforementioned payment errors and contention.
This patch adds a so-called overflow queue. It is a queue where payments are stored temporarily until a slot on the commitment transaction becomes available. This queueing is basically free in terms of performance. Without the patch, overflowing payments would keep looking for new routes and build new onion packets, which has a significant effect on performance.
The overflow queue was part of lnd prior to v0.10.0 and removed in this PR. All we needed to do was to revert.
As mentioned above, Lightning allows batching of payments. The degree of batching is controlled by parameters that aren’t configurable in the latest lnd release v0.12.1. The next major release will include these parameters as configuration options.
As we are looking for maximum performance, we found it justified to cherry-pick those commits onto our test branch and increase the values to promote batching. We chose a 100 ms commit time with a maximum of 300 channel updates per batch. These parameters are probably not optimal, but will at least give a higher level of batching then what the defaults accomplish.
With the overflow queue re-added and updated batching parameters, we re-ran the
lnd-bbolt-keysend benchmark. We also used longer run-in times and measurement intervals to be less sensitive to start-up effects and speed fluctuations.
For the payments number 20.000 through 30.000, this results in a score of 51 transactions per second. We take this number as our baseline for the evaluation of various modifications.
The execution profiles that we showed previously pointed to cryptographic operations as one of the main factors that influence transaction throughput. Crypto operations are generally dictated by the protocol and there isn’t much that can be done to go around that without compromising the current security guarantees. What remains is optimization of the cryptographic functions, employing parallelization where possible and avoiding the pre-calculation of anything that may turn out unneeded.
The other important factor that emerged from the profile is database access. One element of database access is the so-called
sync call. What sync does is wait for previous write operations to fully complete. Writing to disk is usually fast because the data to be written is cached in memory. This doesn’t guarantee that the data is actually written to disk. If a power outage would happen, the data could be lost. And for Lightning in particular, this can have dramatic consequences. If a node forgets that it revoked a previous commitment transaction for example, it risks a penalty being applied with the channel capacity as its maximum amount. In today’s post-wumbo world, this could be a multi-btc loss. That is why it is imperative for node software to use the sync call.
Unfortunately sync is expensive. This post shows the sync performance for various storage devices. Rotational disks score badly, which isn’t surprising given that a mechanical arm needs to move to a position to flip magnetic fields on a platter in order to durably store data. SSDs do better, but for the highest rates one needs to go to devices with battery-backed caches. With the python script from the post above, we measured the sync performance of the storage device of our test machine (100 GB zonal pd-ssd on gcloud) and got 650 syncs/second. And that is just for single-byte writes.
It is safe to say that sync calls are a scarce resource and ideally a node implementation would do as few as possible per payment. How does lnd fare in terms of sync requirements in our benchmark? One way to find out is to snoop on the sync calls at the operating system level (1). We measured 700 syncs/second for the total of sender and receiver node at a rate of 51 transactions/second. That means that on average a single payment requires ~14 sync calls. Interestingly the sync rate is higher than what we got from the python script. This could possibly be explained by cases of ‘empty’ syncs, either because nothing was actually written or because another thread already happened to have synced the data.
Because not all sync calls are equal and some could even be empty, the sync rate alone isn’t directly indicative of performance. Intuitively it does seem high, in particular because the Lightning protocol allows payments to be batched and because bolt can group multiple write transactions together. More on that later.
A more direct way to find out what the impact of sync is, is to not sync. All data is still written, but the checkpoints where the node waits for the writing to be fully completed are skipped. A small patch to lnd accomplishes this. Needless to say, this should never be done in production because it is unsafe and can make you lose funds.
The effect of no sync is spectacular. On the benchmark, the rate goes up from 51 to 361 transactions/second, an over 7-fold increase. Even though the absolute sync rate may not tell the full story, it is clear that syncing plays a major role in node performance. It is probably safe to say that for a node implementation to reach optimal performance, it must be very restrictive about the use of sync calls.
The alternative to approach this performance level today is via the mentioned low-latency battery-backed storage devices. But there is a price to that and it may not be easy to find such devices as a standard cloud product. This makes it a less attractive or non-feasible option for smaller players in the space, potentially leading to more centralization.
Before continuing the examination of lnd, it is worth taking a moment to think about what the absolute minimum number of syncs per payment is in the optimal case.
To do so, we need to pull in a bit of low-level protocol knowledge. The basic idea of Lightning is that there is a bitcoin transaction, the “commitment” transaction, that is updated over and over again without being broadcasted to the chain. This is where Lightning’s speed and fee savings come from. To discourage the publication of an outdated transaction, there is a penalty clause on the transaction that can be activated via a secret. This secret must be revealed after a new version of the transaction has been received.
The cycle that updates a transaction consists of two steps:
create new transaction (“Commitment Signed”, “CS”)
revoke old transaction by revealing its secret (“Revoke and Acknowledge, “RAA”)
There is however not a single transaction. Because of the penalty mechanism, each party needs to have its own transaction. So to update the state of a channel, it actually takes two of the above cycles, one for each party’s version of the commitment transaction.
To complete a payment, the channel state needs to be updated twice. In the first update, an HTLC is offered to the recipient. Then in the second update the recipient settles the HTLC with the sender.
If we’re talking about a series of payments, the final update cycle where the settle becomes fully locked-in can be used to already add the next htlc. Putting that all together results in the following repeated sequence of messages being exchanged:
When does a node need to write its state to disk? It isn’t necessary to write to disk when messages are received. The node could just as well not have received them (network connection lost) and the protocol defines how to re-establish a connection when (some) messages didn’t make it across. Writing to disk does need to happen before sending messages. Otherwise a node may forget what it signed or revoked and experience the potentially costly consequences of that. In total this makes for 4 sync calls for the two nodes together to make a single payment.
A typical node writes to the database for more than just updating the channel state. In lnd for example, separate records are kept for payments and invoices. But by combining these updates with the channel state updates, there is no need for extra disk syncs.
Another feature of Lightning is update batching. A channel state update can add or settle up to 483 htlcs at once. This means that in the ideal case, we can complete 483 payments with just 4 syncs. At a theoretical 0.008 syncs/payment, the gap with the actual sync requirement of 14 syncs/per payment is massive.
Further reduction of this number is possible by operating multiple channels. If the state updates of all channels are lined up (possibly via a simple clock tick), all disk writes can be covered under a single sync call.
The benchmark that we developed aims to fully saturate the available channels, so at least in theory the cost of syncing should be negligible. In reality however it seems to be the biggest bottleneck that currently exists. There must be enormous potential to increase transaction rates.
Of course it is easier said than done to refactor an application for minimum syncs. Database transactions are often executed in different subsystems and in general it isn’t straight-forward to unify those in a single transaction.
We’ve tried our hand at optimizing lnd to see how much effort it would take to realize a substantial improvement. We wanted to identify low hanging-fruit and discover the changes that would require deeper cuts. The general approach has been to look at cpu profiles of both sender and receiver, try to understand why some code shows up heavily and figure out a way to reduce this.
During the process, we didn’t limit ourselves to only the reduction of sync calls. We took a shot at anything that looked like a candidate for optimization. The complete changeset can be seen here. This is a summary of the types of changes that we looked at:
Don’t read data from the database if that data is already present in memory. In its most extreme form, only read data from disk on startup. The size of the data that lnd needs actively is limited and should easily fit in memory. This allows the database layer to be implemented as a write-through cache.
At high transaction rates it is important to keep the reporting of progress information on payments over rpc light-weight. No database lookups.
Update the database in a single transaction if possible. Creating an invoice and then immediately updating that same invoice causes unnecessary sync calls. The same applies to initiating a payment and then updating it with information about the first payment attempt.
Try to pull in all database changes into the channel state update transaction. Update the invoice database together with the settlements of the htlcs. The same should be possible for updates to the payment records.
A stream of payments will probably consist of many successes once an initial route has been found. It isn’t necessary to keep updating node reputations on disk.
Use database batch calls to group concurrent transactions and flush them to disk in one go.
Minimize non-database operations within the context of a database transaction.
Parallelize (crypto) processing of payments in the same batch.
Cache private keys to avoid pricey key derivation. There is a tradeoff with security, but whether the associated threat model applies to lnd is debatable.
Lazy execution of crypto operations. For example the costly initialization of the code that encrypts a failure message only needs to be executed on failure.
Allowing hltc additions and resolutions to be batched together in a single channel state update.
We didn’t implement all of these changes completely. And the part that we did implement is definitely not production ready. So do not run this in production. But we do believe that the proposed changes are fair in the sense that they could be made production ready without unreasonable effort and also without loss of functionality.
Running the benchmark on the optimized lnd code increases the transaction throughput from our baseline of 51 tps to 237 tps. At this rate, lnd is executing around 120 syncs/s. This works out to 0.5 sync per payment, down from 14.
With these optimizations in place, we can take a look at the receiver-side cpu profile again. Database operations, including sync calls, are shown in purple. For those who have studied the flame graph from the previous blog post in detail, it will be clear that the proportion of time spent on accessing the database has become a lot smaller. If we would have skipped the crypto optimizations, this effect would have been even more pronounced.
Because we didn’t just optimize for syncs and also implemented changes like parallelized crypto, it is also interesting to measure transaction throughput with syncing disabled. The underlying thought is that this may give an indication of a lower bound of lnd’s payment processing capability when syncing would be fully optimized.
Without syncing on an lnd version that is still not fully optimized, we measure 925 tps on the benchmark.
Even though the transaction rate that can be achieved with the latest lnd release may not be overly impressive, the future does look bright.
Because batching was designed into the Lightning protocol, there is potential to complete payments with an absolute minimum number of sync calls. This would move the performance spotlight to cpu usage for crypto operations. And fortunately cpus are easy to scale, especially on cloud deployments.
With a modest amount of effort, we were able to realize realistic and significant speed ups in lnd. Moving these speed ups to production however will require developer attention and prioritization. Lightning development resources are extremely scarce and there are tons of other areas that need work as well. It is totally understandable that hard decisions need to be made.
For us though as a company building on Lightning, it is comfortable to know that there doesn’t seem to be a fundamental roadblock to scaling transaction throughput to the next order of magnitude. More concretely, we have indications that 1000+ transactions per second with lnd on a standard cloud machine is reachable given sufficient dedication.