PDA

View Full Version : Infrastructure overhaul of 2017



Vlada
12-05-2017, 09:22 AM
Infrastructure overhaul of 2017

December 2017
lordkator



Infrastructure overhaul of 2017

As many of the community are aware on August 9, 2017 Basilisk experienced an extended unplanned (https://www.swgemu.com/forums/showthread.php?t=198679) outage due to disk issues on the server. As that process unfolded TheAnswer promised (https://www.swgemu.com/forums/showthread.php?t=198679&page=43&p=1432292#post1432292) the community we would share what happened and what we planned to do about it.

TL;DR (Summary)

We lost disks on the original Basilisk server causing us to do manual work to restore the game server databases, this work resulted in the restoration of the server on August 16, 2017.

As of December 4, 2017 all services have been moved to a new environment that is much more robust, has considerably more resources and is designed for us to handle hardware outages much more easily in the future. Our new environment includes redundant servers, faster internet access, more CPU power, more RAM and improved storage redundancy and speed.

The completion of this migration provides a stable platform for the community for many years into the future.

What happened?

When the original server for Basilisk was deployed in 2006 it was a state of the art machine and had experienced a number of hardware upgrades over the years.

The system was configured with multiple Solid State Disks (SSD) setup to keep two on-line live copies of the data (RAID 1 mirror (https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_1)). The goal of such a setup is that when one disk fails we can replace it before data is lost and rebuild the copies. This setup also provides higher read rates because the system can ask both disks for different data at the same time.

The week before this incident one of the mirrored disks failed, our hosting provider failed to notify us and meanwhile we did not get the emails from the sever alerting us to disk issues. In a sad twist of fate the second disk that was mirroring this failed drive also started to fail a week later. The odds of two disks failing within such a short timeframe are fairly rare.

Because of the size of Basilisk's database (over 600 Gigs) we were doing backups on an ad-hoc basis to the Nova server's disks. This meant any restore would lose significant amounts of player progress. With this in mind TheAnswer worked with low-level filesystem debugging tools to extract the database files from the failing drive. This was a painful, slow process that required many iterations to get the data back to a usable state. Much of it was manual and each step could take many hours to run before the results are known and decisions are made on the next step. After many sleepless nights TheAnswer was able to get Basilisk back online on August 16.


How do we avoid this in the future?

In response to this event we took inventory of all our services and did an analysis of our current setup. As you can imagine after 10 plus years the project had accumulated many services and servers to run our community. This setup was very difficult to maintain due to the many dependencies between various services and the underlying software, operating systems and hardware.

After debating various paths forward the team decided it was time to overhaul our infrastructure. We decided to re-build from scratch on new bare metal servers from packet.net (https://www.packet.net/features/) and utilize an open-source technology called kubernetes (https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/) to manage the services as individual movable containers. We would deploy our servers on top of ZFS (https://github.com/zfsonlinux/zfs/wiki/FAQ#what-is-zfs-on-linux) storage pools which would allow us to have modern data safety and management tools.

Deploying on packet.net gives us an incredible amount of flexibility, rather than opening tickets and asking for new machines or emailing back and forth we can just launch new resources using the packet.net API. In addition to that we have "reserved" three servers that allow us to run our infrastructure and provide on-line ready to run redundancy for our services.

Containerizing our services and using kubernetes to manage them allows us the ability to quickly re-schedule services on other hardware if we lose a node or it becomes overloaded with work. The industry is rapidly turning to support kubernetes (originally a Google technology) and by standardizing on this system we can leverage other providers if needed in the future or quickly expand our footprint in any of packet.net's datacenters.

By utilizing ZFS for our storage system we are able to make instantaneous snapshots of the data underlying a service. We setup these storage volumes using very high speed non-volatile memory (PCIe NVMe) and join those in redundant and high-speed configurations (RAID 10). For most services we were able to deploy packet.net's block store volumes. These are high-speed (PCIe NVMe) network attached volumes that allow us to quickly move between servers if a server crashes or becomes overloaded with work.

This combination of our hosting, containerization and storage strategy provides us with many options that were not available to us before the overhaul. This investment should power the project's needs for many years to come and will make it easier for the team to manage existing services and provide new and exciting capabilities to the community in the future.

We expect the short term financial impact to be a bit higher as the services transition and overlap plus bandwidth to copy files into the new environment. Over time we predict the costs to be about the same as our previous setup with easily a 10x increase in capacity and capabilities.

Status?

As of December 2, 2017, all services have been moved to the new infrastructure. Basilisk has been happily running on the new hardware since September 24, 2017 and Nova followed not long after that. We have moved everything from forums, support site, jenkins, gerrit and various other servers to the new infrastructure.

We have daily snapshot backups that are pushed to external block store volumes so we can lose a host completely and have worst case one day of lost progression. Meanwhile we have deployed database logging on Basilisk and Nova so that every transaction is saved in storage in a way that we can roll-forward a database crash if needed by re-playing the changes that happend since the prior copy of the database.

We maintain daily snapshots for a week both locally on the server's disks and remotely on blockstore volumes attached over the network.


What's it look like?

Here is a simplified diagram of our current environment for your viewing pleasure:

https://www.swgemu.com/images/swgemu-infrastructure.png (https://www.swgemu.com/images/swgemu-infrastructure-large.png)

Random Stats

Migrated 16 distinct services
Over 3 Terabytes of data migrated
3,009,329 lines of PHP
18,640,175 lines of C++
102,682,949 lines of Lua
12 copies of sceneobjects.db in various folders
4 people actually read this far in this post.
Next Steps

We've started creating alert bots that send messages to a channel the staff can monitor for issues so they can help escalate as needed.

We will be adding more alerts, testing some deep storage solutions (AWS S3 Glacier and the like) and adding more tools so other members of the staff can help with various tasks w/o being Unix admin experts.



~lordkator DevOps Engineer

bigevil
12-05-2017, 11:05 AM
Very sexy big V. Don't know what much of that means, but what I understand sounds very good. Thanks team for all the great work, and do TA for that painful, painful sounding data rebuild. All respect. :)

Toglco
12-05-2017, 11:12 AM
Thank you to TA and all the staff for all of your work over the years!

Lolindir
12-05-2017, 11:20 AM
TA in beastmode... Good stuff.


Oooh, I'm one of the 3....

fortuneseeker
12-05-2017, 11:28 AM
You can make that 4 now!

A lot of that was to technical for me, but one thing that did come through was the amount of gratitude we owe TheAnswer, lordkator and all members of the SWGemu team. SWG was my favourite game for the duration of its life and to be able to play the best version of it years after it shutdown is a privilege. Thanks to everyone involved :)

lordkator
12-05-2017, 11:51 AM
I can go deeper into details if people are interested, I was trying to balance not making the update too technical and making it accessible for the broader community.

algebuckina
12-05-2017, 12:13 PM
Looks and sound pretty brilliant, also thanks a bunch TA for spending so much time fixing our crap.
I can go deeper into details if people are interested, I was trying to balance not making the update too technical and making it accessible for the broader community.keen for this.

nee2earth
12-05-2017, 03:03 PM
4 people actually read this far in this post.



Make that 5.

-----------


I can go deeper into details if people are interested,

I'm always interested in details, so please do.

---

p.s. I realize i'm "persona non grata" now but i'm still a longtime devoted member of this Community and therefore would still like to sincerely THANK YOU TA (and oru) for your tireless efforts & execution.

1.0 or bust

Newsound
12-05-2017, 03:12 PM
Thank you TA!! Monuments should be erected dedicated to his amazingness. Nothing too crazy, something tasteful. Like a plaque inside Theed library or so. Thank you guys so much~!!

Miztah
12-05-2017, 03:28 PM
While TA definitely deserves to be commended for his with recovering the database, this infrastructure rework is lordkator's baby. The time and effort he has put into improving our setup for the community is much greater than most realize, and a large factor in why basilisk has been so stable lately.

Vlada
12-05-2017, 03:36 PM
While TA definitely deserves to be commended for his with recovering the database, this infrastructure rework is lordkator's baby. The time and effort he has put into improving our setup for the community is much greater than most realize, and a large factor in why basilisk has been so stable lately.

QFE!!!

Walking carpet
12-05-2017, 04:11 PM
awsome work folks :)

bigevil
12-05-2017, 06:12 PM
While TA definitely deserves to be commended for his with recovering the database, this infrastructure rework is lordkator's baby. The time and effort he has put into improving our setup for the community is much greater than most realize, and a large factor in why basilisk has been so stable lately.

Lordkator for president of interwebz. :) Good work friend. Seems to be so many layers of improvement.

Lolindir
12-05-2017, 06:17 PM
Lordkator for president of interwebz. :) Good work friend. Seems to be so many layers of improvement.QFE

shilo
12-05-2017, 06:31 PM
Thanks team!

Telex
12-05-2017, 06:50 PM
Wow! For volunteers, you guys really know how to let everyone know how a project is going. I wish I had you guys available to work on something that I have to deal with in RL. Excellent descriptions/explanations indeed. I haven't had a chance to play in ages, but it's probably time I took another look soon =)

aucob
12-05-2017, 08:11 PM
Wow, amazing work. Thanks to everyone involved!!

RM706
12-05-2017, 08:19 PM
Impressive, most impressive!

Ellyssia
12-05-2017, 10:29 PM
Very nice. Very nice indeed. Good work to all.

Dakkus
12-05-2017, 11:03 PM
freakin amazing, thanks @TheAnswer for the hard word on recovering the data at that time. Mastered every prof on my main and would hate to regrind that any time soon ;_; Cheers to the team for your continued work over the years!

Evil Cyborg 10
12-06-2017, 02:43 AM
Damn, this just proves the insane amount of work that goes into this project behind the scenes. I've said it before but thank you guys for all the hard work you put in.

Scurby
12-06-2017, 04:51 AM
Very nice. Very nice indeed. Good work to all.

Hi Elly :)

bigevil
12-06-2017, 11:45 AM
I've taken to Googleing (Googling? LOL) some of what Lordkator wrote about to understand it more. Pretty impressive stuff. Not to gush repetitively, but feels very sleek and very pro.

Lordkator....you are shiny and chrome. :)

http://xtupload.com/image.php?id=F941_5A27E5DD&gif (http://xtupload.com/share.php?id=F941_5A27E5DD)

Jackleware
12-06-2017, 02:49 PM
Excellent write up! I appreciate the work that goes into upgrading the environment on a scale such as this. There are a lot of moving parts that are beyond just having a play-test server (Basalist) and a test-test server (Nova) running SWGEmu code. It was killing me this past August when there were posts getting all upset over backups and notifications. Crap happens. It is what you do after the crap happens that makes the difference.

navaho
12-08-2017, 01:18 AM
6 people

susanmarblesoft
12-08-2017, 02:38 AM
Beautiful and elegant infrastructure :)

I used to be a DBA for a major motion picture studio here in the US, so I totally appreciate how you've designed things. Having a roll-forward feature is priceless.

First-class work, ~lordkator

stevengw
12-08-2017, 11:12 PM
Would be an amazing system to work with and probably expensive. Hope it improves paths and pings for us Western Europeans.

7 people!

Thanks

Serrabell
12-09-2017, 10:16 AM
Awesome stuff, thank all of you so much :o

naveed
12-09-2017, 03:19 PM
Reading the description it mentioned that Bas has been moved to the new hardware but the diagram shows Bas outside the k8s cluster? Has Bas not been containerized yet?

lordkator
12-09-2017, 11:53 PM
Reading the description it mentioned that Bas has been moved to the new hardware but the diagram shows Bas outside the k8s cluster? Has Bas not been containerized yet?

No, due to the incident we decided it was higher priority to move it and test containerized version of Nova and other stuff.

Also basilisk db is 600G and requires an incredible amount of IO bandwidth, we wanted to make sure we could bring it up clean first and have the ability to A/B test in a container, with network volumes etc.

That said basilisk in a container is on my backlog...

lordkator
12-09-2017, 11:54 PM
Would be an amazing system to work with and probably expensive. Hope it improves paths and pings for us Western Europeans.

Basilisk moved on September 24, 2017 so you should be getting the new pings now, if you're still having issues I would look to your local ISP. It's hosted in the same DC as most of the high frequency financial trading systems and has massive pipes to all over the world.

naveed
12-10-2017, 04:05 PM
No, due to the incident we decided it was higher priority to move it and test containerized version of Nova and other stuff.

Also basilisk db is 600G and requires an incredible amount of IO bandwidth, we wanted to make sure we could bring it up clean first and have the ability to A/B test in a container, with network volumes etc.

That said basilisk in a container is on my backlog...

I feel your pain, currently implementing OpenShift at work which is Red Hats PaaS built on Kubernetes. Our newer microservice style apps migrated nicely but our older monolithic apps not so much. Nice work!

SweetPea
12-14-2017, 08:39 AM
[phead]Infrastructure overhaul of 2017


4 people actually read this far in this post.



~lordkator DevOps Engineer

i readit all....don't understand what i read but....i did read it! 😁

IxeStarwind
12-14-2017, 01:46 PM
so what's the monthly cost for all this compared to the old system?

Lolindir
12-14-2017, 02:39 PM
so what's the monthly cost for all this compared to the old system?

Its in the post....


Infrastructure overhaul of 2017

December 2017
lordkator



Infrastructure overhaul of 2017



We expect the short term financial impact to be a bit higher as the services transition and overlap plus bandwidth to copy files into the new environment. Over time we predict the costs to be about the same as our previous setup with easily a 10x increase in capacity and capabilities.

Random Stats

4 people actually read this far in this post.

PRONG
12-16-2017, 12:18 PM
Make that 6 people (or wherever we are now)

Sound
12-19-2017, 01:11 AM
Very insightful. Thank you Vlada. Always nice reading about the behind the scenes. Thank you for the transparency and incredible work and effort committed to the project.

dauko
12-19-2017, 09:51 AM
Infrastructure overhaul of 2017

Random Stats

4 people actually read this far in this post.




I'm one of those ;)

Qui-larek
12-20-2017, 11:51 PM
LordKator - Maybe your the one qualified enough to work out why I started to DC all the time about 2yrs ago after previously never DC'ing at all

lordkator
12-26-2017, 04:36 PM
LordKator - Maybe your the one qualified enough to work out why I started to DC all the time about 2yrs ago after previously never DC'ing at all

Most likely you're having internet provider issues, we've not seen widespread DC issues either at the old data center or our new location.

jradar71
12-28-2017, 03:31 AM
Excellent news... Standing by for Basilisk wipe and Production release:) Oh how I dream of the day!

SlickSith
12-29-2017, 07:06 AM
You guys are awesome.