Update 31: Stress Test Results

Hi all!

First off: THANK YOU SO MUCH TO EVERYONE!

The participation and vibe during the tests was amazing, and we were blown away by the response.

It was a unique experience for all of us.

For this (April’s) update, we’re focusing on the results of the Public Stress Test (April 28th and 29th).

May’s update will contain all of the pretty images and nuanced update details you’re used to. Apologies in advance, if you’re just here for the pictures. But, rejoice at the fact there’ll be twice as many in the next update.

The truth is… there are a lot of images and stuff to wrangle from last month (and even more coming this month), and rather than try and get it all in here, I’ll leave it for future-me to deal with in a couple of weeks.

Pre-Stressing

Prior to the Stress Test, we asked folks on Discord to give us an indication as to whether or not they’d be interested in participating.

As the number grew beyond 500 or so we started to take notice, but assumed that not everyone would actually show.

We continue to get word out via our mailing list and other channels, but were unsure of the impact prior to opening up registration and pre-install.

We watched the registration database quickly climb past 2,000 and realized that we were going to definitely achieve a fair bit of Stress during our test.

Given that we’d simulated several hundred clients in a zone before, but only had around 100 true clients in a zone prior to the test, we discussed a variety of contingencies to deal with the potential smash of traffic into our Login Server and Game Servers. This included considering different datacenters & prepping additional servers.

We also gained some good data from the initial push in registrations early into pre-registration, and updated that process & fixed a few bugs.

The Stress Test

When we opened the doors on Friday night (Saturday morning for some of us), the influx was pretty amazing to watch. The first server’s Night Harbor zone quickly surpassed 500 people and to our delight, nothing was outright crashing. Technical details will be listed below, but essentially as the number hit roughly 1,000 players in the zone, things degraded significantly, but allowed us to understand adjustments that we’d enact in the next stress test session on Saturday (still Saturday for some of us… it was a long day).

Our contingencies for the volume of players in the first stress tests included increasing capacity of our servers (increased memory, CPUs, and network configs) and opened up Servers 2 & 3. During this time we ran 3 different server configs to get a better understanding of what adjustments to make in the future.

We also tweaked our logging, and increased our DB size and configs to address some of our bottlenecks that were creating lag with certain activities. Again, you can read more details below.

Folks distributed themselves pretty well, and while there were still issues with lag, overall we were really happy with the test and technical performance.

Saturday we tested a new build based on the learnings from the previous test. We continued to run different configs on the servers in an effort to learn more.

One of the changes we made introduced some positional update lag, so we fixed that early in the second test. This stabilized things a bit, and we made the call later in the session to extend the test for 4 more hours.

Overall, the Open Stress Test went much better than expected and it’s put us in a good position to drive things forward.

We’ll continue to focus on the development of the Proof of Concept, and with the Stress Test behind us, we’re in a position to do more gameplay focused testing in the future.

For now, here are some fun stats, followed by a technical breakdown of the two tests.

Technical Analysis

Findings from the Stress Test:

Segmenting server clusters too much was problematic due to extended query times.
Logging was not optimized enough for a high user count.
Server requirements were a bit higher than expected.
Database requirements were MUCH higher than expected (mainly required active connections).
TradeComplete had an issue (This was debugged. It is a problem with MySQL not allowing a specific database query that MariaDB allows, so will move to MariaDB for session 2).
TradeComplete where coin is part of the trade is bugged.

Mitigations within the first Stress Test session:

Consolidated servers to:
Server 1 -> Digital Ocean VMs with dedicated CPU cores (the only 2 server cluster)
- world-1 Basic Regular Intel Shared vCores, 1 core / 1GB Ram ($6/mo)
- zone-3 CPU-Optimized - Regular Intel Dedicated vCores, 8 core / 16GB RAM ($168/mo)
Server 2 -> Single Digital Ocean VM
- zone-4 CPU-Optimized - Regular Intel Dedicated vCores, 8 core / 16GB RAM ($168/mo)
Server 3 -> The old FnF Server (single physical box)
- Physical 1x Intel Quad-Core Xeon E3-1230v2 3.30GHz / 16GB RAM (EUR 34.99/mo)
Upgraded Database:
- From: Managed Database - 1vCPU / 1GB RAM / 75 Connections ($15/mo)
- To: Managed Database - 2vCPU / 4GB RAM / 225 Connections ($60/mo)
Disabled Detailed Logging

Investigations within the Stress Test session:

Created a small addon that looks at flow of packets, what are the packets being sent and received, and how often.
Observed server internal FPS (holding steading at >30 fps, so no issues here).
Identified systems that are being underserved by throttling (Navigation, Aggro).
Identified systems that did not scale well (Positional packet creation and sending, it performed much better than FnF stress test, but needed serious work).
Observed packets that are being sent without really needing to be sent (too often or no change has occurred).
Observed clients spamming the server with Request Spawn Entity packets, which occurs when it sees entity updates that it does not have. This was happening because the client wasn't happy with the time it took to receive packets, requesting more, and creating more of a traffic jam.

Mitigations for session 2:

Slowed down player -> server packet updates.
Time-sliced generation of sector packets and time sliced server -> client updates while speeding them up slightly.
Increased time allocation to Navigation and made it more aggressively path-find.
Did not touch Aggro YET. The theory was that optimizations to the other systems will allow Aggro to have enough time within a frame to behave as expected.
Throttled the request spawn entity packets. Now they are only allowed ONCE every 10 seconds (for one entity) as they are a fallback for when it really deems an entity is either truly missing or the data is completely messed up.

Modifications to server setup for Session 2:

Server 1 (DO Region NYC1):
- Migrate world-1 files and settings to zone-3 and delete world-1.
- zone-3 (rename to world-1) CPU-Optimized - Regular Intel Dedicated vCores, 8 core / 16GB RAM ($168/mo)
Server 2 (DO Region NYC1):
- zone-4 (rename to world-2) CPU-Optimized - Regular Intel Dedicated vCores, 8 core / 16GB RAM ($168/mo)
Server 3 (Leaseweb Region Amsterdam-1)
- Migrate to a new server box
- Physical 2x AMD 24-Core EPYC 7401 / 64GB/ 2x480GB Mirror SSD
Databases in NYC region moved to a DO VM in NYC1 and moved from MySQL to MariaDB.
Change server names to identify region (US 1, US 2, EUW 1, and EUW 2)

Fixes and Observations during second Stress Test:

Found that positional updates were too slow from server -> client (throttled too much), and so allowed more frame time to these updates in the patches.
Found that the positional updates are still not fast enough. Did the needed profiling during the test. This has been fixed post test and is being merged to FnF for further testing.
Found that positional updates sometimes resend a stale packet. Tracking if a packet was stale is very slow. However, optimizations made after the stress test make it very unlikely to hit this scenario. This was the stuttering of mobs observed in Night Harbor.
Positional update latency is directly affected by how many sectors of a zone are “active”, ie. Visible to a player. Simulated testing activating Night Harbor (most populous zone) with all of its sectors marked as “active” (similar to Valley of Death) showed a worst case scenario which has been optimized and running with ample headroom.
No other lag was observed. Actions, combat, and the new Navigation all seemed adequate for this phase of development.

Coverage within the Community

During the tests the team tried to hop around to some of the streams that were taking place.

We want to thank everyone that took the time to stream or make content related to the test. It was wild to see the Monsters & Memories Twitch category light up and we’ve tried to watch the various videos as we’ve found them.

*Thanks to Brown Butcher for this image.*

Here are a few links to some of the videos of the test. Apologies if we’ve missed any:

Disclaimer: opinions, associations, and language shared in the videos linked above belong solely to the content creators, and were not solicited or endorsed by Niche Worlds Cult, LLC - despite being pretty cool for the most-part.

In closing…

Again THANK YOU for participating in the test!

Hopefully, the data provided above helps reiterate our commitment to transparency and open development.

Of the various critiques that can be leveled at any project during development: design decisions, art style, probability of success, etc… our goal is to ensure that us BS’ing the community is never one of them.

On that note, you can follow our progress via our new Twitch category (five of our team members stream) and the (320+) VODs we post on YouTube.

Join the Discord and Mailing List if you haven’t already.

See you again in a few weeks!