lordkator
10-22-2018, 02:27 AM
TL;DR: We survived, much blood and sweat but things are better, and we're generally happier for having made the time and glad the community was so supportive of our extended window (https://www.swgemu.com/forums/showthread.php?t=216206) of downtime.
Details below:
As many of you know, we had a planned extended maintenance of 36 hours this weekend. I had a lot of lofty goals for what I hoped to achieve during that window, we didn't hit everything, but we got the critical stuff done. While working through the process, we discovered one node was having issues with system hangs under heavy network load. I put that behind me and plowed on with the goal to address that system on Sunday.
Much of what we achieved during the extended maintenance window should be invisible to the community. That said over time we will all reap the rewards as the new setups will allow us to move faster, have more flexibility and automate more of what we have to do to keep the Basilisk beast well fed and cared for as we trudge ahead towards the goal posts.
A lot of what we do in the back end is magic to many people so without going into a ton of details it's hard to enumerate all the specifics in a way that will help a broad people with context. So with that in let me outline some significant changes that we achieved and what we expect the impact of those will be:
1) We cleaned up databases that were bloated since the great SSD corruption event of 2017. This process included rebuilding the database of all those little objects you play with (Houses, backpacks, weapons) which was running north of 576 Gigabytes with 447 Million objects! , and we had to dump it and reload it from scratch. That process took 16 hours of wall clock time to run. However, it resulted in the database shrinking down to 376 Gigabytes! Reducing the database file size will help us in many ways, one of which is as we clone Basilisk to TC-Prime for staging tests we won't have to move as much data, another is our backups and the pressure on our internal disks on the servers will be lighter.
2) We wanted to clean up the mail.db file, it has over 53 Million emails in it, and due to the volume of email on Basilisk, it has not been keeping up with purging old emails. We attempted to take a couple of big bites out of that, but it was becoming clear we needed a better strategy, and we decided to stop and put that on the back burner.
3) After Pub 9 there were some issues with vendors going poof, so the team commented out some code in an attempt to debug this issue. A side effect was it left lots of "ghosted" items in the database (145 Thousand), and we had hoped to clean those up this weekend. After some discussion and given all the other things we were changing we decided to hold off, we will circle back on this one in a future window, perhaps before the next big publish.
4) Migrate to a container: This is hard to describe for people who don't work in this field, but the concept of a container is like having a suitcase you can throw everything you need to go on vacation. These suitcases (containers) then can be shuffled around, piled up and spread out across physical servers without the headaches of managing all the dependencies, security and other typical issues we run into when managing systems. To date Basilisk has been the last galaxy to stay running outside in the cold, sort of like throwing all your vacation clothes on the bed before you leave for the airplane and trying to walk through security with them, its messy, hard to manage and easy to lose things, not to mention embarrassing when people see your ugly underwear! This weekend we achieved the near impossible by building a full pipeline that allows us to create images with Basilisk's code base that can run in a container with its massive almost 400Gigs of database files.
5) Stretch goal: Automated builds - The container achievement was significant because now we have within our grasp the ability to run a full clone in TC-Prime of Basilisk for staging new changes so people can test and verify things are working before we push to Basilisk. We also worked together to hammer out the build process (how code goes from developer's brain to files to code on the server) and automate most of it in our build system. Most people are not aware but for quite some time we could empower our core devs to push code at Nova through its source code repository and the system would build, unittest and restart Nova automatically. This setup has been a core part of empowering the many changes Miztah (https://www.swgemu.com/forums/member.php?u=76132) has made in recent months because he was able to quickly test locally, push to Nova and work with the QA team to test in a very short timeframe without any intervention from other members of the team. Now that we have the foundation for that in Basilisk it gives us the ability to push code changes at Basilisk in a much more reliable and straightforward way, including setting up automatic snapshots in case a push breaks things and we need to roll back or compare data from a previous state of the system. This change will empower more of the core dev team to push changes to Basilisk (after review of course) and give us a much more robust framework for automation. It will also allow us to have a bit more freedom to assign other team members to debug, restart Basilisk and other routine tasks. Being able to manage Basilisk in this way is the thing I'm the proudest of because it brings us squarely into modern day deployment processes that will help us move faster as a team.
Now for the not so great news, while we were working on this process, we discovered one of our servers is having problems with lockups under heavy loads. Operations like cloning TC-Prime or Backups were causing that server to lock up randomly. It would often recover, but sometimes the length of the lock-up was enough to cause services to fail. While working on Basilisk tasks, this happened at 2:30 am right in front of me, I was able to work with our infrastructure provider, and we decided its time to retire that machine. These issues do not put us at any immediate risk, so we continued on our prepared path and put it on the back burner. Today (Sunday, October 21st) I spent the bulk of my day deploying a new server and pushing some containers (suitcases?) to it to test. If that system proves stable over the next couple of days, I will shuffle stuff around onto the new server. Once complete we can return the old server to our provider and let them tear into it with a blow torch. This situation is an unfortunate thing, but it is an example of why we try to keep at least two beefy machines up and running, if things get bad we can quickly shuffle the containers onto the other computer and figure out a right path forward. I don't expect most you will even notice the moves, in general, there might be a couple of minutes of forums downtime here or there and other things that you might not see if you blink. That said stuff happens, and we will be on the lookout for issues as we migrate. If anything this is a big plus to our current containerized deployment, moving things is much easier if something goes wrong and we are now taking advantage of those investments with this process.
We had originally intended much of this work to be so we could optimize and downsize our operations, clearly the community surprised all of us with its generosity and outpouring of support. With that in mind, we want to reiterate a big Thank You to all of you who support us with your time and even your donations. I can't speak for other on staff but for me, the reaction of the extended community really fired me up and made me feel better about the countless hours I work behind the scenes to keep everything "magic" for the community and our developers.
Thank you and let me remind you all, we're part of a community, stop for a second before you lash out in a post or in a chat channel, think about the effects of your statements and maybe re-write your post a couple times before hitting submit. It might take you an extra 30 seconds but it might also save someone from quitting the community and us all losing out on their contributions. I guess what I'm saying is be gentle to each other, real life is hard enough, we don't need it here, this is where we all dream of the hope the Star Wars stories tell about and enjoy a small break from the brutal reality that is real life.
[Queue video of a ghostly figure walking off into the distance...]
Details below:
As many of you know, we had a planned extended maintenance of 36 hours this weekend. I had a lot of lofty goals for what I hoped to achieve during that window, we didn't hit everything, but we got the critical stuff done. While working through the process, we discovered one node was having issues with system hangs under heavy network load. I put that behind me and plowed on with the goal to address that system on Sunday.
Much of what we achieved during the extended maintenance window should be invisible to the community. That said over time we will all reap the rewards as the new setups will allow us to move faster, have more flexibility and automate more of what we have to do to keep the Basilisk beast well fed and cared for as we trudge ahead towards the goal posts.
A lot of what we do in the back end is magic to many people so without going into a ton of details it's hard to enumerate all the specifics in a way that will help a broad people with context. So with that in let me outline some significant changes that we achieved and what we expect the impact of those will be:
1) We cleaned up databases that were bloated since the great SSD corruption event of 2017. This process included rebuilding the database of all those little objects you play with (Houses, backpacks, weapons) which was running north of 576 Gigabytes with 447 Million objects! , and we had to dump it and reload it from scratch. That process took 16 hours of wall clock time to run. However, it resulted in the database shrinking down to 376 Gigabytes! Reducing the database file size will help us in many ways, one of which is as we clone Basilisk to TC-Prime for staging tests we won't have to move as much data, another is our backups and the pressure on our internal disks on the servers will be lighter.
2) We wanted to clean up the mail.db file, it has over 53 Million emails in it, and due to the volume of email on Basilisk, it has not been keeping up with purging old emails. We attempted to take a couple of big bites out of that, but it was becoming clear we needed a better strategy, and we decided to stop and put that on the back burner.
3) After Pub 9 there were some issues with vendors going poof, so the team commented out some code in an attempt to debug this issue. A side effect was it left lots of "ghosted" items in the database (145 Thousand), and we had hoped to clean those up this weekend. After some discussion and given all the other things we were changing we decided to hold off, we will circle back on this one in a future window, perhaps before the next big publish.
4) Migrate to a container: This is hard to describe for people who don't work in this field, but the concept of a container is like having a suitcase you can throw everything you need to go on vacation. These suitcases (containers) then can be shuffled around, piled up and spread out across physical servers without the headaches of managing all the dependencies, security and other typical issues we run into when managing systems. To date Basilisk has been the last galaxy to stay running outside in the cold, sort of like throwing all your vacation clothes on the bed before you leave for the airplane and trying to walk through security with them, its messy, hard to manage and easy to lose things, not to mention embarrassing when people see your ugly underwear! This weekend we achieved the near impossible by building a full pipeline that allows us to create images with Basilisk's code base that can run in a container with its massive almost 400Gigs of database files.
5) Stretch goal: Automated builds - The container achievement was significant because now we have within our grasp the ability to run a full clone in TC-Prime of Basilisk for staging new changes so people can test and verify things are working before we push to Basilisk. We also worked together to hammer out the build process (how code goes from developer's brain to files to code on the server) and automate most of it in our build system. Most people are not aware but for quite some time we could empower our core devs to push code at Nova through its source code repository and the system would build, unittest and restart Nova automatically. This setup has been a core part of empowering the many changes Miztah (https://www.swgemu.com/forums/member.php?u=76132) has made in recent months because he was able to quickly test locally, push to Nova and work with the QA team to test in a very short timeframe without any intervention from other members of the team. Now that we have the foundation for that in Basilisk it gives us the ability to push code changes at Basilisk in a much more reliable and straightforward way, including setting up automatic snapshots in case a push breaks things and we need to roll back or compare data from a previous state of the system. This change will empower more of the core dev team to push changes to Basilisk (after review of course) and give us a much more robust framework for automation. It will also allow us to have a bit more freedom to assign other team members to debug, restart Basilisk and other routine tasks. Being able to manage Basilisk in this way is the thing I'm the proudest of because it brings us squarely into modern day deployment processes that will help us move faster as a team.
Now for the not so great news, while we were working on this process, we discovered one of our servers is having problems with lockups under heavy loads. Operations like cloning TC-Prime or Backups were causing that server to lock up randomly. It would often recover, but sometimes the length of the lock-up was enough to cause services to fail. While working on Basilisk tasks, this happened at 2:30 am right in front of me, I was able to work with our infrastructure provider, and we decided its time to retire that machine. These issues do not put us at any immediate risk, so we continued on our prepared path and put it on the back burner. Today (Sunday, October 21st) I spent the bulk of my day deploying a new server and pushing some containers (suitcases?) to it to test. If that system proves stable over the next couple of days, I will shuffle stuff around onto the new server. Once complete we can return the old server to our provider and let them tear into it with a blow torch. This situation is an unfortunate thing, but it is an example of why we try to keep at least two beefy machines up and running, if things get bad we can quickly shuffle the containers onto the other computer and figure out a right path forward. I don't expect most you will even notice the moves, in general, there might be a couple of minutes of forums downtime here or there and other things that you might not see if you blink. That said stuff happens, and we will be on the lookout for issues as we migrate. If anything this is a big plus to our current containerized deployment, moving things is much easier if something goes wrong and we are now taking advantage of those investments with this process.
We had originally intended much of this work to be so we could optimize and downsize our operations, clearly the community surprised all of us with its generosity and outpouring of support. With that in mind, we want to reiterate a big Thank You to all of you who support us with your time and even your donations. I can't speak for other on staff but for me, the reaction of the extended community really fired me up and made me feel better about the countless hours I work behind the scenes to keep everything "magic" for the community and our developers.
Thank you and let me remind you all, we're part of a community, stop for a second before you lash out in a post or in a chat channel, think about the effects of your statements and maybe re-write your post a couple times before hitting submit. It might take you an extra 30 seconds but it might also save someone from quitting the community and us all losing out on their contributions. I guess what I'm saying is be gentle to each other, real life is hard enough, we don't need it here, this is where we all dream of the hope the Star Wars stories tell about and enjoy a small break from the brutal reality that is real life.
[Queue video of a ghostly figure walking off into the distance...]