Todd Lyons Chronicle of fixing broken replication, and a question
Jul 26, 2011; 07:18
Todd Lyons
Chronicle of fixing broken replication, and a question
Yeah it happens to all of us. Some master->slave replication system breaks and goes unnoticed. In my case, I had a monitoring script, but it was being called with a --verbose flag, and internally that "verbose" was being mapped to "debug", and the routine to notify sysadmins of broken replication doesn't fire if debug is set, so warning emails were not being sent out. You might ask, why is this a problem? Just restart replication, skipping problem statements. Well, the big issue was that I have automatic purge of binlogs set. Over the course of a few days, the purging of binlogs surpassed where the replication stopped. Damn...
This is a mysql 5.0.77 (CentOS 5.x) installation with maatkit and xtrabackup utilities available. So there's really only one thing to do. xtrabackup to the rescue!
1. I did a full copy of the running master database server using xtrabackup to a backup server via nfs. It took 2 hours, of which the last 15 minutes did a write lock of the entire server as it copied over the *.frm files and the few myisam tables. This was the most troublesome part as it was visible to both users and website owners :-( 2. On the slave, I did the apply-logs step (via nfs to the backup server), taking another 2 hours. 3. I renamed the old directory on the slave server and created a new mysql directory, owned by the correct user. 4. I copied the backup data from the backup server to this new directory. 5. I deleted the iblogfile* files so that mysql would create them on startup based on the sizes set in the my.cnf. 6. I started it up. It still wanted to do a crash recovery since the ibdata and iblog files didn't match, but that was expected. After a few minutes of scanning (we use innodb_table_per_file, so it had a lot of tablespace spread out over lots of files), it started up and was accepting connections. Looking good so far. 7. I did 'cat xtrabackup_slave_info'. For some reason it had this: CHANGE MASTER TO MASTER_LOG_FILE='', MASTER_LOG_POS *UH OH* *CRAP* *$#^@%&!*
** See note below
8. Calm down, think about things a bit. Obviously I didn't compensate for the fact that I was doing the original backup on the master, but the apply-log on the slave (and --slave-info option doesn't have the data available on the 9. I know about what time I started the backup. I know that this is a database server dedicated to ecommerce shopping carts. As a result, there is a steady flow in of data as customers and web_robots access the shopping carts. 10. Knowing the data in #9, I look at the statements being replicated with mysqlbinlog on the master with --start-datetime="2011-07-25 12:00:00". A quick query of the data in that database on the newly-recovered-but-not-yet-replicating slave found me a last timestamp of 13:16.49. 11. I adjust my timestamp and trace data from that point until I find one that didn't get inserted into the database by the replication process (meaning, this was when I started the backup). I determined that the timestamp of data not yet in the database was 13:16:52. 12. Comments in mysqlbinlog output at that timestamp indicated it was at position 487727085. 13. I repeated the mysqlbinlog using --start-position 487727085 and verified that it was the same output as #12. 14. I did a change master statement on the slave, setting the position to 487727085, I started replication, and it caught up after about 20 minutes. 15. If #14 would have had issues, I did have mk-slave-restart available to force it to skip problems and restart replication.
So everything is alright. There are no issues. But there is a question.
Is there a better way to do this? Is there something automated that will "compare THIS binlog to THAT server and tell me where the data was no longer being inserted" ? I'm looking to see how others deal with a complete restart of a running system like this.
** It turns out that xtrabackup gave me confidence of the log position I had determined. Running the backup on the master, it also made this file: # cat xtrabackup_binlog_info mysqld-bin.000259 487727085 which of course confirmed the log position I determined in step #12.
To be honest, the biggest step in the sequence above is likely #8. The ability to assess a situation without freaking out (for too long) is a big part of a sysadmin job IMHO.
Regards... Todd -- If Americans could eliminate sugary beverages, potatoes, white bread, pasta, white rice and sugary snacks, we would wipe out almost all the problems we have with weight and diabetes and other metabolic diseases. -- Dr. Walter Willett, Harvard School of Public Health
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=listsearcharchive@lassosoft.com
Jul 26
Reindl Harald Re: Chronicle of fixing broken replication, and a question
Jul 26, 2011; 17:18
Reindl Harald
Re: Chronicle of fixing broken replication, and a question
Jul 26
Todd Lyons Re: Chronicle of fixing broken replication, and a question
Jul 26, 2011; 10:13
Todd Lyons
Re: Chronicle of fixing broken replication, and a question
Jul 26
Reindl Harald Re: Chronicle of fixing broken replication, and a question
Jul 26, 2011; 23:58
Reindl Harald
Re: Chronicle of fixing broken replication, and a question
Search
Lasso Programming
This site manages and broadcasts several email lists pertaining to Lasso Programming and technologies related and used by Lasso developers. Sign up today!