Friday, April 5, 2019

Alienvault and its database

A few weeks ago, I discovered the hard way that Alienvault 5.7.1 does not clean old events from the database. And like many others, I found out that the patch to fix that, 5.7.2, would not install if there were more than 200 million events in the database.

Ensued a week-long marathon of cleaning the database, erasing files, removing the old events, splitting the queries into several sub queries as the database fought back. Or more precisely, there were so many things to delete that the DELETE queries failed due to the LOCK table being full.

Then I installed 5.7.2 and I thought my ordeal was over. It is not.

A couple of days ago, I got the dreaded email "disk space CRITICAL." Less than 10% disk space is available, which meant I had only a few hours before the system would start running into issues. So, clean-up time it was. Again.

Several people have had the same issue, but I found no information on how to solve the issue for a longer period of time, so I looked into the MySQL database to see if something could be done to avoid the system from crashing every couple of weeks.

Of all the databases, the one that takes the most space is alienvault_siem, and specifically three tables: acid_event, extra_data and po_acid_event. Each weighted more than 70GB on the disk, with extra_data occupying 174GB for about 196 million rows. My limit is set to 80 million and 14 days of retention in Alienvault which we now know is not correctly enforced.

DELETE-ing from a table does not reduce its size: this frees slots for new rows, but the size of the backing file will stay the same. To shrink the file to the minimum, the command OPTIMIZE TABLE is required: due to InnoDB being used, in-place operation is not possible: a temporary table is created, then it replaces the original table. This means that a fair amount of free space is required on the disk as one needs up to twice the size of the table to optimize. It also means that this may run into all kinds of fun problems if the system is really active.

But there was still the issue of locking too many rows at once and having the whole query fails. Time for some SQL.

By experimenting, I found that deleting about 5 million rows at once was fine, so I started by issuing commands such as DELETE FROM extra_data ORDER BY event_id LIMIT 2500000 and repeat it a certain number of times. This had drawbacks, for example I needed to be present to submit more queries once the set of current ones was exhausted. I could have created a source file for that tough, but that was not very dynamic and still required me to get a count of the rows: new rows are added all the time to this table, making 80 million a moving target.

Instead, I opted to use a SQL stored procedure for two reasons: I could set the target in a WHILE condition and, more importantly, I could run it automatically on a schedule to prevent the database to fill up the disk in the first place, and do it consistently without me needing to manually intervene.

Two tables need clean up: extra_data and acid_event. I saw that periodic queries are run that clean po_acid_event, though the way it is done may lead to the query failing to complete if you have more than 10 million rows added over the course of one day. Unfortunately, the WHERE clause is a simple timestamp < ......, and if it fails once, it may start failing consistently, leading again to the disk filling up.

The first procedure removes rows from extra_data until there are less than 85 million rows in the table by blocks of 2.5 million rows. Interestingly, it is faster to delete 2.5 million rows than it is to get the count of all the rows in the table.

DELIMITER $$
DROP PROCEDURE IF EXISTS clean_extra_data$$
CREATE PROCEDURE clean_extra_data()
BEGIN
 DECLARE x INT;
 DECLARE xtarget INT;
 SET x = 0;
 SELECT count(event_id) - 85000000 INTO xtarget FROM extra_data;
 WHILE xtarget > x DO
  DELETE FROM extra_data ORDER BY event_id LIMIT 2500000;
  SET x = x + 2500000;
 END WHILE;
END $$
DELIMITER ;

The second procedure removes the entries from acid_event by blocks of 2.5 million rows as long as the time the procedure is started minus 16 days is greater than the minimum timestamp present in the database.

DELIMITER $$
DROP PROCEDURE IF EXISTS clean_acid_event$$
CREATE PROCEDURE clean_acid_event()
BEGIN
 DECLARE xlimit DATETIME;
 DECLARE xmin DATETIME;
 SELECT NOW() - INTERVAL 16 DAY INTO xlimit;
 SELECT MIN(timestamp) INTO xmin FROM acid_event;
 WHILE xlimit > xmin DO
  DELETE FROM acid_event WHERE timestamp < xlimit LIMIT 2500000;
  SELECT MIN(timestamp) INTO xmin FROM acid_event;
 END WHILE;
END $$
DELIMITER ;
Initially, I thought adding the OPTIMIZE TABLE statement to shrink the files. After consideration, this is a potential source of problems: for example I have had MySQL shutting down under very low disk space, and as the OPTIMIZE TABLE query requires loads of additional space, this was just begging for more troubles.

Now, let's schedule these. It should run often enough so the execution of each does not take too long and the backing file does not grow out of control, but not too often to avoid impacting MySQL. I have settled to every Saturday around 10pm: my instance collects a maximum of 5 million events on a normal day, which totals to a maximum of 35 million rows to delete from each table every week. On my system, deleting 2.5 million rows from extra_data takes about 2 minutes, and counting the number of rows in the table around 15 minutes. For 35 million rows, that should run in under an hour. Removing entries from acid_event takes more time, around 10 minutes per block of 2.5 million rows: deleting 35 million rows will take around three hours, still doable.
DELIMITER $$
DROP PROCEDURE IF EXISTS clean_old_data$$
CREATE PROCEDURE clean_old_data()
BEGIN
 CALL clean_acid_event()
 CALL clean_extra_data()
END $$
DELIMITER ;
CREATE EVENT alienvault_is_worse_than_a_teen
 ON SCHEDULE EVERY 1 WEEK
  STARTS '2019-04-06 22:00:00'
  ENDS '2019-06-30 00:00:00'
 DO
  CALL clean_old_data()
Lastly, the words of caution: this works for me as a temporary fix until Alienvault proposes a real solution to this problem. I am pretty certain this band-aid may create a lot of other issues down the road such as orphaned entries or events without data. There is no guarantee that this won't break something or won't render your system unusable, you use at your own risks.