• Vlada

    by Published on 12-05-2017 11:37 AM   

    Infrastructure overhaul of 2017

    December 2017
    lordkator




    Infrastructure overhaul of 2017

    As many of the community are aware on August 9, 2017 Basilisk experienced an extended unplanned outage due to disk issues on the server. As that process unfolded TheAnswer promised the community we would share what happened and what we planned to do about it.

    TL;DR (Summary)

    We lost disks on the original Basilisk server causing us to do manual work to restore the game server databases, this work resulted in the restoration of the server on August 16, 2017.

    As of December 4, 2017 all services have been moved to a new environment that is much more robust, has considerably more resources and is designed for us to handle hardware outages much more easily in the future. Our new environment includes redundant servers, faster internet access, more CPU power, more RAM and improved storage redundancy and speed.

    The completion of this migration provides a stable platform for the community for many years into the future.

    What happened?

    When the original server for Basilisk was deployed in 2006 it was a state of the art machine and had experienced a number of hardware upgrades over the years.

    The system was configured with multiple Solid State Disks (SSD) setup to keep two on-line live copies of the data (RAID 1 mirror). The goal of such a setup is that when one disk fails we can replace it before data is lost and rebuild the copies. This setup also provides higher read rates because the system can ask both disks for different data at the same time.

    The week before this incident one of the mirrored disks failed, our hosting provider failed to notify us and meanwhile we did not get the emails from the sever alerting us to disk issues. In a sad twist of fate the second disk that was mirroring this failed drive also started to fail a week later. The odds of two disks failing within such a short timeframe are fairly rare.

    Because of the size of Basilisk's database (over 600 Gigs) we were doing backups on an ad-hoc basis to the Nova server's disks. This meant any restore would lose significant amounts of player progress. With this in mind TheAnswer worked with low-level filesystem debugging tools to extract the database files from the failing drive. This was a painful, slow process that required many iterations to get the data back to a usable state. Much of it was manual and each step could take many hours to run before the results are known and decisions are made on the next step. After many sleepless nights TheAnswer was able to get Basilisk back online on August 16.


    How do we avoid this in the future?


    In response to this event we took inventory of all our services and did an analysis of our current setup. As you can imagine after 10 plus years the project had accumulated many services and servers to run our community. This setup was very difficult to maintain due to the many dependencies between various services and the underlying software, operating systems and hardware.

    After debating various paths forward the team decided it was time to overhaul our infrastructure. We decided to re-build from scratch on new bare metal servers from packet.net and utilize an open-source technology called kubernetes to manage the services as individual movable containers. We would deploy our servers on top of ZFS storage pools which would allow us to have modern data safety and management tools.

    Deploying on packet.net gives us an incredible amount of flexibility, rather than opening tickets and asking for new machines or emailing back and forth we can just launch new resources using the packet.net API. In addition to that we have "reserved" three servers that allow us to run our infrastructure and provide on-line ready to run redundancy for our services.

    Containerizing our services and using kubernetes to manage them allows us the ability to quickly re-schedule services on other hardware if we lose a node or it becomes overloaded with work. The industry is rapidly turning to support kubernetes (originally a Google technology) and by standardizing on this system we can leverage other providers if needed in the future or quickly expand our footprint in any of packet.net's datacenters.

    By utilizing ZFS for our storage system we are able to make instantaneous snapshots of the data underlying a service. We setup these storage volumes using very high speed non-volatile memory (PCIe NVMe) and join those in redundant and high-speed configurations (RAID 10). For most services we were able to deploy packet.net's block store volumes. These are high-speed (PCIe NVMe) network attached volumes that allow us to quickly move between servers if a server crashes or becomes overloaded with work.

    This combination of our hosting, containerization and storage strategy provides us with many options that were not available to us before the overhaul. This investment should power the project's needs for many years to come and will make it easier for the team to manage existing services and provide new and exciting capabilities to the community in the future.

    We expect the short term financial impact to be a bit higher as the services transition and overlap plus bandwidth to copy files into the new environment. Over time we predict the costs to be about the same as our previous setup with easily a 10x increase in capacity and capabilities.

    Status?


    As of December 2, 2017, all services have been moved to the new infrastructure. Basilisk has been happily running on the new hardware since September 24, 2017 and Nova followed not long after that. We have moved everything from forums, support site, jenkins, gerrit and various other servers to the new infrastructure.

    We have daily snapshot backups that are pushed to external block store volumes so we can lose a host completely and have worst case one day of lost progression. Meanwhile we have deployed database logging on Basilisk and Nova so that every transaction is saved in storage in a way that we can roll-forward a database crash if needed by re-playing the changes that happend since the prior copy of the database.

    We maintain daily snapshots for a week both locally on the server's disks and remotely on blockstore volumes attached over the network.


    What's it look like?


    Here is a simplified diagram of our current environment for your viewing pleasure:



    Random Stats
    • Migrated 16 distinct services
    • Over 3 Terabytes of data migrated
    • 3,009,329 lines of PHP
    • 18,640,175 lines of C++
    • 102,682,949 lines of Lua
    • 12 copies of sceneobjects.db in various folders
    • 3 people actually read this far in this post.
    Next Steps

    We've started creating alert bots that send messages to a channel the staff can monitor for issues so they can help escalate as needed.

    We will be adding more alerts, testing some deep storage solutions (AWS S3 Glacier and the like) and adding more tools so other members of the staff can help with various tasks w/o being Unix admin experts.


    ~lordkator DevOps Engineer
    by Published on 11-17-2017 12:04 AM   

    Q. What is SWGEmu?


    A. SWGEmu is an acronym/abbreviation for Star Wars Galaxies Emulator.
    Star ...
    by Published on 09-03-2017 11:16 AM
    Article Preview

    Configuring mIRC


    It usually only takes a few minutes to get started with mIRC.
    The following
    ...
    by Published on 08-14-2017 02:18 PM   

    Basilisk extended downtime

    Posted August 9th 2017
    The SWGEmu Staff




    Basilisk extended downtime


    While doing a routine backup before the merge one of the HDD's died, unfortunately that was the second one we lost in a matter of weeks. Right now TheAnswer and the rest of development staff are discussing the options and trying to figure out how best to try to recover all Basilisks data. Of course, with two HDD's gone, there is a strong possibility that what they recover is too corrupt and unusable or they may not even manage to recover all the necessary data to restore Basilisk in its original state before merging it with Publish 9 code and we will be forced to do a full database wipe. Of course its still early to give any predictions but one thing is for sure, we are not at a point where we can definitely say that we will be able to move forward one way or the other.


    We apologize for the extended downtime that is probably going to last for days.


    No matter the the outcome, we thank you for your dedication and your continuous support. Of course we will keep you posted as things progress. As soon as we have any info, you will have it too.

    Update 6:

    Quote Originally Posted by Miztah View Post
    Alright folks, here's our current status. From what we can tell, our database is in good enough shape to proceed with another Publish 9 merge attempt. There's still a lot of potential for corruption or database issues to crop up that we haven't seen, so we're going to be taking some extra precautions in case those issues do pop up. Because of this, when the server is taken down, expect an extended downtime before it comes back up with the new publish.

    Quote Originally Posted by Miztah View Post
    Our current schedule is looking like Basilisk will be taken down approximately 12 hours from now. That timeframe may change between now and then. Once it's taken down, we'll start the process of backing up the database again. With the size of the database, this alone will take a fair amount of time, so we're not going to give an ETA on when it will be back up. Once Basilisk does come up, the server's new navmesh system will begin building meshes for all static and player cities as well as POI's, which will also take some time. Expect the server to remain locked for a while after it comes up.

    TLDR: When we take the server down, expect the update process to take a while to complete, especially with the issues we ran into over the past few days. We'll get the server up and running with Pub9 as quickly as we can.


    Update 5:


    Quote Originally Posted by TheAnswer
    Its a 48 hour test, so its ongoing. But, no bad news so far.
    UPDATE 4:

    Quote Originally Posted by Miztah View Post
    Alright folks, update time. We've managed a recovery of the database and we've got everything copied over to our new SSD's. We're letting Bas run for the weekend as a stress test on the database to ensure there is no corruption that didn't pop up initially. This means that for this weekend, Basilisk is going to remain on Publish 8. We're looking to give the Publish 9 merge another shot on Monday, assuming all goes well over the next 48 hours. We'll update regarding Pub9 closer to Monday when we know for sure.

    In the meantime, continue on Basilisk as you normally would. We'll be keeping an eye out on our side for any potential stability issues.

    Keep in mind that we're still doing integrity checks on the database, if we run into any database corruption between now and then it's very likely that any progress you make this weekend will be lost when we revert to a backup. This is not guaranteed to occur, but be aware that it is definitely a possibility.

    UPDATE 3:

    Quote Originally Posted by TheAnswer
    We finished uploading 1.1TB of data and now we're going though it.
    UPDATE 2:

    Quote Originally Posted by Miztah View Post
    To clarify a bit, Basilisk was up but still in the process of validating the integrity of most of the database objects. The server is currently running through our backup HDD's, which is causing the server's loading to take a lot longer than it normally would on SSD's. We shut down the login server because any actions triggered on Bas will delay the database check, and it's going to take long enough as it is.
    Quote Originally Posted by Miztah View Post
    The good news is, the server didn't explode using the recovered database. There's no current bad news, but as of right now, that's the only good news. We still have to verify that all of the objects in the database are actually valid and load properly, and it will take time. The possibility of a wipe still remains if we find too much corrupted data. Time will tell, but in the meantime you'll see Basilisk go online and offline multiple times, but you won't be able to access it. Those still connected will not be able to reconnect when they disconnect, and in the interest of getting this testing done as quickly as possible, they should disconnect on their own anyway.


    UPDATE:


    Quote Originally Posted by Ivojedi View Post
    About the loading status: We are attempting to load a test db instance to check for data integrity. This should give us an indication of whether we'll need to wipe or not. It's loading on the HDDs so it's slow and will take time. Thanks for the continued patience.


    ~The SWGEMu Staff
    by Published on 08-09-2017 04:27 PM   

    Basilisk extended downtime

    Posted August 9th 2017
    The SWGEmu Staff




    Basilisk extended downtime


    While doing a routine backup before the merge one of the HDD's died, unfortunately that was the second one we lost in a matter of weeks. Right now TheAnswer and the rest of development staff are discussing the options and trying to figure out how best to try to recover all Basilisks data. Of course, with two HDD's gone, there is a strong possibility that what they recover is too corrupt and unusable or they may not even manage to recover all the necessary data to restore Basilisk in its original state before merging it with Publish 9 code and we will be forced to do a full database wipe. Of course its still early to give any predictions but one thing is for sure, we are not at a point where we can definitely say that we will be able to move forward one way or the other.


    We apologize for the extended downtime that is probably going to last for days.


    No matter the the outcome, we thank you for your dedication and your continuous support. Of course we will keep you posted as things progress. As soon as we have any info, you will have it too.


    Update 5:


    Quote Originally Posted by TheAnswer
    Its a 48 hour test, so its ongoing. But, no bad news so far.
    UPDATE 4:

    Quote Originally Posted by Miztah View Post
    Alright folks, update time. We've managed a recovery of the database and we've got everything copied over to our new SSD's. We're letting Bas run for the weekend as a stress test on the database to ensure there is no corruption that didn't pop up initially. This means that for this weekend, Basilisk is going to remain on Publish 8. We're looking to give the Publish 9 merge another shot on Monday, assuming all goes well over the next 48 hours. We'll update regarding Pub9 closer to Monday when we know for sure.

    In the meantime, continue on Basilisk as you normally would. We'll be keeping an eye out on our side for any potential stability issues.

    Keep in mind that we're still doing integrity checks on the database, if we run into any database corruption between now and then it's very likely that any progress you make this weekend will be lost when we revert to a backup. This is not guaranteed to occur, but be aware that it is definitely a possibility.

    UPDATE 3:

    Quote Originally Posted by TheAnswer
    We finished uploading 1.1TB of data and now we're going though it.
    UPDATE 2:

    Quote Originally Posted by Miztah View Post
    To clarify a bit, Basilisk was up but still in the process of validating the integrity of most of the database objects. The server is currently running through our backup HDD's, which is causing the server's loading to take a lot longer than it normally would on SSD's. We shut down the login server because any actions triggered on Bas will delay the database check, and it's going to take long enough as it is.
    Quote Originally Posted by Miztah View Post
    The good news is, the server didn't explode using the recovered database. There's no current bad news, but as of right now, that's the only good news. We still have to verify that all of the objects in the database are actually valid and load properly, and it will take time. The possibility of a wipe still remains if we find too much corrupted data. Time will tell, but in the meantime you'll see Basilisk go online and offline multiple times, but you won't be able to access it. Those still connected will not be able to reconnect when they disconnect, and in the interest of getting this testing done as quickly as possible, they should disconnect on their own anyway.


    UPDATE:


    Quote Originally Posted by Ivojedi View Post
    About the loading status: We are attempting to load a test db instance to check for data integrity. This should give us an indication of whether we'll need to wipe or not. It's loading on the HDDs so it's slow and will take time. Thanks for the continued patience.


    ~The SWGEMu Staff
    by Published on 04-18-2017 08:13 AM   

    SWGEmu Publish 9 - "Secrets of the Force, Redux" loading screen competition

    April 2017
    The SWGEmu Staff




    The SWGEmu Publish 9 - "Secrets of the Force, Redux" loading screen competition


    Hear ye! Hear ye!

    Publish 9 is looming
    Its just around the corner.
    You have waited far too long,
    SWGEmu Staff has worked too damn hard.
    And now... Now we are so close.
    We can almost taste it.
    So we need to spice things up.
    We need to make it even more... grand!
    But we need your help,
    We need your personal touch.

    So come one, come all.
    Release your inner artist.
    And let your Star Wars geekiness shine.
    Make us a damn fine Publish 9 themed loading screen.
    Winner gets a potato!!!
    And bragging rights.
    Gets to be immortalized through their masterpiece.

    Oh and, you might want to throw in a bit of lore
    to accompany your work of art.
    You know, just to explain
    where the inspiration for it came from.












    Rules are:


    1. Loading screen has to be Star Wars (Galaxies) themed
    2. Has to be Publish 9 themed (I'm gonna go with glowbats and space wizards and ****)
    3. No profanity
    4. No anything against SWGEmu rules and policies
    5. If you uses someone else work, give credit where credit is due
    6. Don't be sore losers



    NOTE: As you are well aware we are a fickle bunch, so this announcement is still and always will be subject to change... Or not.




    Winner(s) will be announced same time Publish 9 is announced. Probably.




    ~The SWGEmu Staff
    Page 1 of 7 1 2 3 4 5 ... LastLast