CPU Spike of Death – Part 2

Out of memory – fixed?

After digging through many logs – memory dumps and spending about an entire week debugging what is happening to the site – we have discovered that the forms that allow you to set Image URL and Faction URL were being loaded with large amounts of data in the form of a base encoded images.
Shout out to Ozzard, PostShalom, Brethern, DC, tinkan in the community that have been so very helpful taking a look at this.

If you have experience with JVM’s and diagnosing memory issues – please join us on Discord and help us out.

Analysis

Heap dump after heap dump showed many, many ‘data:image/png/jpg’ in the Java heap stack – problem was – where are they coming from. These images were overloading the heap stack and causing the site not to respond.

Sources of the images

At first I thought they were being collected by the character sheet PDF process and the PDF tools were not releasing the memory.

How it works – the character sheet PDF generation process uses the Image URL and Faction URL during the creation of the character sheet. The server loads a PDF template – then fetches the images and inserts them into the character sheet. Neat little function.

We now have put some checks in place to limit those inbound images to 128K.

Nope I was wrong – Not the PDFs – The issue still exists.

After a restart and taking a heap dump – We were still seeing lots of images in the heap stack – and no way could this be happening after a restart – especially in a test environment – no users were hitting the site but me.

So where were they coming from? They were loading minutes after a restart and in a test environment that had no traffic.

Now we look at everything – digging everywhere that a image could be loaded.

We discovered that the form fields allowed any length of text and there were no input checks.

Apparently the original author didn’t think of what a user might do if they had unlimited size access to these fields.

Inheriting the code – we didn’t think to check it either –

Stupid us.

Query this

Now that we wipe the egg off our face – how do we find this.

Climbing the thirty thousand foot wall that is Datomic (the database) and learning how to query it directly – we performing a few hundred queries.

Bingo – we found that Image URL’s and Faction URL’s were the culprit – these were matching the images in the heap dumps.

Users were copy pasting a image from their browsers directly into the Image URL and Faction URLs!

Big long strings were being inserted into the database – thousands – like over fifteen thousand.

Well that starts to make sense.

Above is just a sample of data:image but there was other garbage in the table as well.

Lot’s of malformed url’s and just bad data that was being used to try to exploit browsers.

We even found some data from Wikipedia in the Image URL *laugh*

Goes to show you users will do anything to a free service.

This wasn’t the original intent of these Text boxes since these are loaded as you hit the site and remain in memory until your log off.

The intention was to use a image URL – like

https://imageserver/mycharacter.jpg

Not insert the base64 encoded image directly in the text box.

A Quick fix – limit those text boxes to 255 characters – and the notes and description to 50k characters. You will now notice that these do not accept anything larger than that.

Clean up

Writing a quick script to dump the invalid URLs – we have purged everything that isn’t a URL in Image URL and Faction URL’s.

Sorry – They were killing the memory on the server – had to be done.

Are we done?

No unfortunately we are not – taking another heap dump from this morning we are still seeing images being auto loaded in the background.

Looks like Datomic (the database that never forgets) is holding these in it’s index causing them to be loaded into the heap.

So now – we look into why this is happening and where are THESE coming from.

Good news is the site seems to be running better – and we haven’t had any crashes in a 12 hour period and our heap memory hasn’t been exhausted – because these are not being directly loaded anymore – but some background index is holding them.

Test will be this weekend when everyone plays D&D and the site gets some serious traffic.

So this bleeds into Part 3.

We have discovered some issues – put some checks in – addressed them along the way – but still have more searching to do.

I think this is the same problem that Orcpub2.com had over the years – memory leak or just bad coding causing it to crash. Either way – we are moving towards a more stable site.

If you have experience with JVM’s and deep diagnosing memory issues – please join us on Discord and help us out. Give me a @thDM and let me know you want to help.

Appreciate all the help from everyone on this.

Till next time…

-thDM.

This website uses cookies to ensure you get the best experience on our website. Learn More