PDA

View Full Version : Maintenance Overview for October 20/21, 2018



lordkator
10-22-2018, 02:27 AM
TL;DR: We survived, much blood and sweat but things are better, and we're generally happier for having made the time and glad the community was so supportive of our extended window (https://www.swgemu.com/forums/showthread.php?t=216206) of downtime.

Details below:

As many of you know, we had a planned extended maintenance of 36 hours this weekend. I had a lot of lofty goals for what I hoped to achieve during that window, we didn't hit everything, but we got the critical stuff done. While working through the process, we discovered one node was having issues with system hangs under heavy network load. I put that behind me and plowed on with the goal to address that system on Sunday.

Much of what we achieved during the extended maintenance window should be invisible to the community. That said over time we will all reap the rewards as the new setups will allow us to move faster, have more flexibility and automate more of what we have to do to keep the Basilisk beast well fed and cared for as we trudge ahead towards the goal posts.

A lot of what we do in the back end is magic to many people so without going into a ton of details it's hard to enumerate all the specifics in a way that will help a broad people with context. So with that in let me outline some significant changes that we achieved and what we expect the impact of those will be:

1) We cleaned up databases that were bloated since the great SSD corruption event of 2017. This process included rebuilding the database of all those little objects you play with (Houses, backpacks, weapons) which was running north of 576 Gigabytes with 447 Million objects! , and we had to dump it and reload it from scratch. That process took 16 hours of wall clock time to run. However, it resulted in the database shrinking down to 376 Gigabytes! Reducing the database file size will help us in many ways, one of which is as we clone Basilisk to TC-Prime for staging tests we won't have to move as much data, another is our backups and the pressure on our internal disks on the servers will be lighter.

2) We wanted to clean up the mail.db file, it has over 53 Million emails in it, and due to the volume of email on Basilisk, it has not been keeping up with purging old emails. We attempted to take a couple of big bites out of that, but it was becoming clear we needed a better strategy, and we decided to stop and put that on the back burner.

3) After Pub 9 there were some issues with vendors going poof, so the team commented out some code in an attempt to debug this issue. A side effect was it left lots of "ghosted" items in the database (145 Thousand), and we had hoped to clean those up this weekend. After some discussion and given all the other things we were changing we decided to hold off, we will circle back on this one in a future window, perhaps before the next big publish.

4) Migrate to a container: This is hard to describe for people who don't work in this field, but the concept of a container is like having a suitcase you can throw everything you need to go on vacation. These suitcases (containers) then can be shuffled around, piled up and spread out across physical servers without the headaches of managing all the dependencies, security and other typical issues we run into when managing systems. To date Basilisk has been the last galaxy to stay running outside in the cold, sort of like throwing all your vacation clothes on the bed before you leave for the airplane and trying to walk through security with them, its messy, hard to manage and easy to lose things, not to mention embarrassing when people see your ugly underwear! This weekend we achieved the near impossible by building a full pipeline that allows us to create images with Basilisk's code base that can run in a container with its massive almost 400Gigs of database files.

5) Stretch goal: Automated builds - The container achievement was significant because now we have within our grasp the ability to run a full clone in TC-Prime of Basilisk for staging new changes so people can test and verify things are working before we push to Basilisk. We also worked together to hammer out the build process (how code goes from developer's brain to files to code on the server) and automate most of it in our build system. Most people are not aware but for quite some time we could empower our core devs to push code at Nova through its source code repository and the system would build, unittest and restart Nova automatically. This setup has been a core part of empowering the many changes Miztah (https://www.swgemu.com/forums/member.php?u=76132) has made in recent months because he was able to quickly test locally, push to Nova and work with the QA team to test in a very short timeframe without any intervention from other members of the team. Now that we have the foundation for that in Basilisk it gives us the ability to push code changes at Basilisk in a much more reliable and straightforward way, including setting up automatic snapshots in case a push breaks things and we need to roll back or compare data from a previous state of the system. This change will empower more of the core dev team to push changes to Basilisk (after review of course) and give us a much more robust framework for automation. It will also allow us to have a bit more freedom to assign other team members to debug, restart Basilisk and other routine tasks. Being able to manage Basilisk in this way is the thing I'm the proudest of because it brings us squarely into modern day deployment processes that will help us move faster as a team.

Now for the not so great news, while we were working on this process, we discovered one of our servers is having problems with lockups under heavy loads. Operations like cloning TC-Prime or Backups were causing that server to lock up randomly. It would often recover, but sometimes the length of the lock-up was enough to cause services to fail. While working on Basilisk tasks, this happened at 2:30 am right in front of me, I was able to work with our infrastructure provider, and we decided its time to retire that machine. These issues do not put us at any immediate risk, so we continued on our prepared path and put it on the back burner. Today (Sunday, October 21st) I spent the bulk of my day deploying a new server and pushing some containers (suitcases?) to it to test. If that system proves stable over the next couple of days, I will shuffle stuff around onto the new server. Once complete we can return the old server to our provider and let them tear into it with a blow torch. This situation is an unfortunate thing, but it is an example of why we try to keep at least two beefy machines up and running, if things get bad we can quickly shuffle the containers onto the other computer and figure out a right path forward. I don't expect most you will even notice the moves, in general, there might be a couple of minutes of forums downtime here or there and other things that you might not see if you blink. That said stuff happens, and we will be on the lookout for issues as we migrate. If anything this is a big plus to our current containerized deployment, moving things is much easier if something goes wrong and we are now taking advantage of those investments with this process.

We had originally intended much of this work to be so we could optimize and downsize our operations, clearly the community surprised all of us with its generosity and outpouring of support. With that in mind, we want to reiterate a big Thank You to all of you who support us with your time and even your donations. I can't speak for other on staff but for me, the reaction of the extended community really fired me up and made me feel better about the countless hours I work behind the scenes to keep everything "magic" for the community and our developers.

Thank you and let me remind you all, we're part of a community, stop for a second before you lash out in a post or in a chat channel, think about the effects of your statements and maybe re-write your post a couple times before hitting submit. It might take you an extra 30 seconds but it might also save someone from quitting the community and us all losing out on their contributions. I guess what I'm saying is be gentle to each other, real life is hard enough, we don't need it here, this is where we all dream of the hope the Star Wars stories tell about and enjoy a small break from the brutal reality that is real life.

[Queue video of a ghostly figure walking off into the distance...]

Misk Brebran
10-22-2018, 03:29 AM
Thanks for the synopsis! You are all Clarke’s magicians of technology and thank you so much for the effort you put forth to keep the project running full speed ahead!!

As for the community, it rocks! Yeah, some bicker and moan, but thats family, yo! And after looking a the donations so far, those gestures speak volumes about the community.

Much respect for you all and happy I had the opportunity and chose to become a part of the emu. Only wish I could have joined sooner.

/toast

Cheers!!

nee2earth
10-22-2018, 03:39 AM
TL;DR: We survived, much blood and sweat but things are better, and we're generally happier for having made the time and glad the community was so supportive of our extended window (https://www.swgemu.com/forums/showthread.php?t=216206) of downtime.

Details below:

As many of you know, we had a planned extended maintenance of 36 hours this weekend. I had a lot of lofty goals for what I hoped to achieve during that window, we didn't hit everything, but we got the critical stuff done. While working through the process, we discovered one node was having issues with system hangs under heavy network load. I put that behind me and plowed on with the goal to address that system on Sunday.

Much of what we achieved during the extended maintenance window should be invisible to the community. That said over time we will all reap the rewards as the new setups will allow us to move faster, have more flexibility and automate more of what we have to do to keep the Basilisk beast well fed and cared for as we trudge ahead towards the goal posts.

A lot of what we do in the back end is magic to many people so without going into a ton of details it's hard to enumerate all the specifics in a way that will help a broad people with context. So with that in let me outline some significant changes that we achieved and what we expect the impact of those will be:

1) We cleaned up databases that were bloated since the great SSD corruption event of 2017. This process included rebuilding the database of all those little objects you play with (Houses, backpacks, weapons) which was running north of 576 Gigabytes with 447 Million objects! , and we had to dump it and reload it from scratch. That process took 16 hours of wall clock time to run. However, it resulted in the database shrinking down to 376 Gigabytes! Reducing the database file size will help us in many ways, one of which is as we clone Basilisk to TC-Prime for staging tests we won't have to move as much data, another is our backups and the pressure on our internal disks on the servers will be lighter.

2) We wanted to clean up the mail.db file, it has over 53 Million emails in it, and due to the volume of email on Basilisk, it has not been keeping up with purging old emails. We attempted to take a couple of big bites out of that, but it was becoming clear we needed a better strategy, and we decided to stop and put that on the back burner.

3) After Pub 9 there were some issues with vendors going poof, so the team commented out some code in an attempt to debug this issue. A side effect was it left lots of "ghosted" items in the database (145 Thousand), and we had hoped to clean those up this weekend. After some discussion and given all the other things we were changing we decided to hold off, we will circle back on this one in a future window, perhaps before the next big publish.

4) Migrate to a container: This is hard to describe for people who don't work in this field, but the concept of a container is like having a suitcase you can throw everything you need to go on vacation. These suitcases (containers) then can be shuffled around, piled up and spread out across physical servers without the headaches of managing all the dependencies, security and other typical issues we run into when managing systems. To date Basilisk has been the last galaxy to stay running outside in the cold, sort of like throwing all your vacation clothes on the bed before you leave for the airplane and trying to walk through security with them, its messy, hard to manage and easy to lose things, not to mention embarrassing when people see your ugly underwear! This weekend we achieved the near impossible by building a full pipeline that allows us to create images with Basilisk's code base that can run in a container with its massive almost 400Gigs of database files.

5) Stretch goal: Automated builds - The container achievement was significant because now we have within our grasp the ability to run a full clone in TC-Prime of Basilisk for staging new changes so people can test and verify things are working before we push to Basilisk. We also worked together to hammer out the build process (how code goes from developer's brain to files to code on the server) and automate most of it in our build system. Most people are not aware but for quite some time we could empower our core devs to push code at Nova through its source code repository and the system would build, unittest and restart Nova automatically. This setup has been a core part of empowering the many changes Miztah (https://www.swgemu.com/forums/member.php?u=76132) has made in recent months because he was able to quickly test locally, push to Nova and work with the QA team to test in a very short timeframe without any intervention from other members of the team. Now that we have the foundation for that in Basilisk it gives us the ability to push code changes at Basilisk in a much more reliable and straightforward way, including setting up automatic snapshots in case a push breaks things and we need to roll back or compare data from a previous state of the system. This change will empower more of the core dev team to push changes to Basilisk (after review of course) and give us a much more robust framework for automation. It will also allow us to have a bit more freedom to assign other team members to debug, restart Basilisk and other routine tasks. Being able to manage Basilisk in this way is the thing I'm the proudest of because it brings us squarely into modern day deployment processes that will help us move faster as a team.

Now for the not so great news, while we were working on this process, we discovered one of our servers is having problems with lockups under heavy loads. Operations like cloning TC-Prime or Backups were causing that server to lock up randomly. It would often recover, but sometimes the length of the lock-up was enough to cause services to fail. While working on Basilisk tasks, this happened at 2:30 am right in front of me, I was able to work with our infrastructure provider, and we decided its time to retire that machine. These issues do not put us at any immediate risk, so we continued on our prepared path and put it on the back burner. Today (Sunday, October 21st) I spent the bulk of my day deploying a new server and pushing some containers (suitcases?) to it to test. If that system proves stable over the next couple of days, I will shuffle stuff around onto the new server. Once complete we can return the old server to our provider and let them tear into it with a blow torch. This situation is an unfortunate thing, but it is an example of why we try to keep at least two beefy machines up and running, if things get bad we can quickly shuffle the containers onto the other computer and figure out a right path forward. I don't expect most you will even notice the moves, in general, there might be a couple of minutes of forums downtime here or there and other things that you might not see if you blink. That said stuff happens, and we will be on the lookout for issues as we migrate. If anything this is a big plus to our current containerized deployment, moving things is much easier if something goes wrong and we are now taking advantage of those investments with this process.

We had originally intended much of this work to be so we could optimize and downsize our operations, clearly the community surprised all of us with its generosity and outpouring of support. With that in mind, we want to reiterate a big Thank You to all of you who support us with your time and even your donations. I can't speak for other on staff but for me, the reaction of the extended community really fired me up and made me feel better about the countless hours I work behind the scenes to keep everything "magic" for the community and our developers.

Thank you and let me remind you all, we're part of a community, stop for a second before you lash out in a post or in a chat channel, think about the effects of your statements and maybe re-write your post a couple times before hitting submit. It might take you an extra 30 seconds but it might also save someone from quitting the community and us all losing out on their contributions. I guess what I'm saying is be gentle to each other, real life is hard enough, we don't need it here, this is where we all dream of the hope the Star Wars stories tell about and enjoy a small break from the brutal reality that is real life.

[Queue video of a ghostly figure walking off into the distance...]

Post of the Year ^ imo

Nothin' else to say but THANK YOU cuz I for one really really appreciate this level of explanation, elaboration, & enumeration.

./tiphat LK & Team.

Hakry
10-22-2018, 05:12 AM
Thank you and let me remind you all, we're part of a community, stop for a second before you lash out in a post or in a chat channel, think about the effects of your statements and maybe re-write your post a couple times before hitting submit. It might take you an extra 30 seconds but it might also save someone from quitting the community and us all losing out on their contributions. I guess what I'm saying is be gentle to each other, real life is hard enough, we don't need it here, this is where we all dream of the hope the Star Wars stories tell about and enjoy a small break from the brutal reality that is real life.



QFE on this last part for sure!!


A huge than you to you LK for the hard work you put in overall and within the last week, that includes helping me make sure my event went without issue!


Also a big thank you to the rest of the Devs as well. Ivo, TA, Miz, Reoze. Everyone really came together this week and helped us all out for the good of the project and the community! More so then just preparing for new player Events, but moving forward as a whole!


Truly a remarkable experience to be a part of.

Mistress Aerea
10-22-2018, 05:25 AM
Thank you for the extensive description. And thank you even more for all the hard work and dedication.

/icecream Devs

fortuneseeker
10-22-2018, 08:22 AM
I don't visit the forums very often and post even less but seeing continued hard work of a small number of people keep the game I love alive is one of the real joys in my life - I can't thank you guys enough :)

lgarms3348
10-22-2018, 04:55 PM
Thank you for everything!!

Lolindir
10-22-2018, 05:39 PM
Thank you lordkator for your hard work and thank you to the community who stepped up when the project needed it

Ps. Would it be possible to have the server automatically delete, lets say 6 months old in game mail from the system, so that it doesn't accumulate so much? If people wish to keep old mail, they can just use the command /mailSave

jimmy
10-22-2018, 06:13 PM
Thank you for your time and hard work /tiphat.

Ps. Would it be possible to have the server automatically delete, lets say 6 months old in game mail from the system, so that it doesn't accumulate so much? If people wish to keep old mail, they can just use the command /mailSave... good point.
Thanks never knew this command just used it perfect! soon all my chars will have a clutter free mail now that i got mail i want to keep on pc.

nee2earth
10-22-2018, 06:59 PM
Ps. Would it be possible to have the server automatically delete, lets say 6 months old in game mail from the system, so that it doesn't accumulate so much? ,

I think it already does that...

https://review.swgemu.com/#/c/4518/

...however, from what i understand based on LK post...



2) We wanted to clean up the mail.db file, it has over 53 Million emails in it, and due to the volume of email on Basilisk, it has not been keeping up with purging old emails..

...perhaps it just can't handle the sheer volume? Or maybe just needs some refinement?

Not sure, but i am sure that TA or Ivo will fix by the time 1.0 comes out.

Ivojedi
10-22-2018, 09:24 PM
On the mail thing, we added mail expiration with publish 8. The way we implemented it, it checks and deletes old mails during server startup while the mail database is being loaded. We had to add a throttle to the number of mails it will delete each startup because of the sheer number of mails that were already in the database by that point and each would need to be loaded into RAM before deletion (garbage collection doesn't remove them from RAM right away so RAM is a limiting factor). I had to set the throttle fairly low when testing on Nova because it had less RAM than Bas (on our old setup, before we migrated to our current host). The intention was that the throttle would be upped when we merged to Bas, but that seems to have not happened. It has been deleting 25,000 mails each startup since publish 8. Given how infrequent Bas restarts have become, the expiration was not even keeping up with the number of new mails that were being sent so the database continued to grow.

The obvious first step was to up the throttle on Bas, which we have done. Bas was restarted a few times during the maintenance with different throttles to see what was feasible with our current setup. TA also attempted to modify the code so that the mail could be deleted directly from the db without being loaded into RAM but his first go at it didn't work as planned and, as LordKator mentioned, the decision was made to worry about other, more pressing matters so that the maintenance could conclude on schedule. The higher throttle should allow Bas to make headway in reducing the number of mails in the db, but it will still take a long time before it can catch up. The current implementation would likely be fine if it was in place from the start of Bas, but as LK mentioned, we need to come up with a better way to get the backlog cleaned up.

Serrabell
10-22-2018, 09:33 PM
Thanks so much for the hard work, you all are an amazing bunch :)

Misk Brebran
10-22-2018, 10:03 PM
Ps. Would it be possible to have the server automatically delete, lets say 6 months old in game mail from the system, so that it doesn't accumulate so much? If people wish to keep old mail, they can just use the command /mailSave

What about setting up a survey to see what ppl think? I think its s good idea and id even argue to goto 3 months, especially with the /saveMail command. Are there other commands we as the community can use to reduce our footprints on the server short of deleting our WP, datapad or items/houses?

What about implimenting an account login time limit b4 structures become removed? Or reduce the sheer amount of environmental flora (non-collidable), or is that just randomly generated to the client on start up? Just spit balling here...

nee2earth
10-22-2018, 11:22 PM
Are there other commands we as the community can use to reduce our footprints on the server ?

Other than the /emptyMail command that Miztah implemented a while back, i don't believe so.

--

What about implimenting an account login time limit b4 structures become removed?

As far as i'm aware, that type of functionality is governed by the CREDITS already paid up (or taken from inventory of owner) on the structure itself.

This is 1 of the reasons why SOE decided upon using the whole 'House Pack-Up (http://swgcraft.org/forums/viewtopic.php?t=1530)' EVENT every few months, for any houses/structures that had been marked *CONDEMNED* by the Empire (due to lack of credits maintenance) for too long.

----

Or reduce the sheer amount of environmental flora (non-collidable), or is that just randomly generated to the client on start up? .

That's client side, but iirc everyone can individually reduce/adjust it within the 'Terrain' tab of in-game OPTIONS (ctrl o )

Praxi34
11-26-2018, 02:24 PM
Fantastic and detailed letter to the community lordKator. Your time to explain things us mere mortals couldn't shows your passions and devotion to the project! Well done to you and all the team. Keep up the good work as always

#GettingCloserTo1.0

Bishop Will
12-06-2018, 11:41 PM
One more step foward