Greenplum 激活standby master失败后的异常修复
背景
激活standby master失败后,主库和备库都起不来了。
如下,修改了MASTER_DATA_DIRECTORY和PGPORT环境变量为新的主库,启动主库。
$gpstart -a
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -a
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode
20151222:16:49:43:073138 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode
20151222:16:49:43:073138 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-2 -l /disk1/digoal/gpdata/gpseg-2/pg_log/startup.log -w -t 600 -o " -p 1922 -b 48 -z 0 --silent-mode=true -i -M master -C -1 -x 0 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start...... stopped waiting
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-2/postmaster.pid" does not exist
pg_ctl: could not start server
Examine the log output.
'
失败
手工执行命令当然也不行
env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-2 -l /disk1/digoal/gpdata/gpseg-2/pg_log/startup.log -w -t 600 -o " -p 1922 -b 48 -z 0 --silent-mode=true -i -M master -C -1 -x 0 -c gp_role=utility " start
使用master only模式启动当然也是不行的。
$gpstart -m
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -m
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Master-only start requested in configuration without a standby master.
Continue with master-only startup Yy|Nn (default=N):
> y
20151222:16:58:06:077478 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode
20151222:16:58:08:077478 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode
20151222:16:58:08:077478 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-2 -l /disk1/digoal/gpdata/gpseg-2/pg_log/startup.log -w -t 600 -o " -p 1922 -b 48 -z 0 --silent-mode=true -i -M master -C -1 -x 0 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start...... stopped waiting
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-2/postmaster.pid" does not exist
pg_ctl: could not start server
Examine the log output.
'
限制模式启动也不行
$gpstart -R
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -R
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode
20151222:16:57:24:076997 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode
20151222:16:57:24:076997 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-1 -l /disk1/digoal/gpdata/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 1921 -b 1 -z 0 --silent-mode=true -i -M master -C -1 -x 48 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start...... stopped waiting
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-1/postmaster.pid" does not exist
pg_ctl: could not start server
Examine the log output.
'
修改了MASTER_DATA_DIRECTORY和PGPORT环境变量为老的主库
然后试图激活原来的主库也失败
$gpactivatestandby -f
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:------------------------------------------------------
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Standby data directory = /disk1/digoal/gpdata/gpseg-1
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Standby port = 1921
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Standby running = no
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Force standby activation = yes
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:------------------------------------------------------
Do you want to continue with standby master activation? Yy|Nn (default=N):
> y
20151222:16:51:29:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Starting standby master database in utility mode...
20151222:16:51:31:074293 gpactivatestandby:digoal_host:digoal-[CRITICAL]:-Error activating standby master: ExecutionError: 'non-zero rc: 2' occured. Details: 'GPSTART_INTERNAL_MASTER_ONLY=1 $GPHOME/bin/gpstart -a -m -v' cmd had rc=2 completed=True halted=False
stdout='20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -a -m -v
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Setting level of parallelism to: 64
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Checking if GPHOME env variable is set.
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Checking if MASTER_DATA_DIRECTORY env variable is set.
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Checking if LOGNAME or USER env variable is set.
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:---Checking that current user can use GP binaries
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Obtaining master's port from master data directory
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Read from postgresql.conf port=1921
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Read from postgresql.conf max_connections=48
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-gp_external_grant_privileges is None
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Reading the gp_dbid file - /disk1/digoal/gpdata/gpseg-1/gp_dbid...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : # Greenplum Database identifier for this master/segment. ...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : # Do not change the contents of this file. ...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : dbid = 1 ...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Found match for dbid: 1.
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : standby_dbid = 48 ...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Found match for standby_dbid: 48.
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Check if Master is already running...
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Master-only start requested for management utilities.
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode
20151222:16:51:31:074365 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode
20151222:16:51:31:074365 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-1 -l /disk1/digoal/gpdata/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 1921 -b 1 -z 0 --silent-mode=true -i -M master -C -1 -x 48 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start...... stopped waiting
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-1/postmaster.pid" does not exist
pg_ctl: could not start server
Examine the log output.
'
'
stderr=''
老的主库,以master only启动也失败。
$gpstart -m
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -m
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-Master-only start requested in a configuration with a standby master.
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-This is advisable only under the direct supervision of Greenplum support.
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-This mode of operation is not supported in a production environment and
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-may lead to a split-brain condition and possible unrecoverable data loss.
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************
Continue with master-only startup Yy|Nn (default=N):
> y
20151222:16:57:44:077229 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode
20151222:16:57:46:077229 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode
20151222:16:57:46:077229 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-1 -l /disk1/digoal/gpdata/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 1921 -b 1 -z 0 --silent-mode=true -i -M master -C -1 -x 48 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start...... stopped waiting
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-1/postmaster.pid" does not exist
pg_ctl: could not start server
Examine the log output.
'
老的主库启动时,报错如下:
2015-12-22 16:57:45.959837 CST,,,p77246,th273340192,,,,0,,,seg-1,,,,,"LOG","00000","Found recovery.conf file, checking appropriate parameters for recovery in standby mode",,,,,,,0,,"xlog.c",5663,
2015-12-22 16:57:46.010953 CST,,,p77246,th273340192,,,,0,,,seg-1,,,,,"FATAL","XX000","recovery command file ""recovery.conf"" request for standby mode not specified (xlog.c:5756)",,,,,,,0,,"xlog.c",5756,"Stack trace:
1 0xb04cde postgres errstart (elog.c:502)
2 0x54afe7 postgres XLogReadRecoveryCommandFile (xlog.c:5754)
3 0x560a84 postgres StartupXLOG (xlog.c:6441)
4 0x564966 postgres StartupProcessMain (xlog.c:10970)
5 0x5f4675 postgres AuxiliaryProcessMain (bootstrap.c:463)
6 0x8eacd4 postgres <symbol not found> (postmaster.c:7589)
7 0x8eaefd postgres StartMasterOrPrimaryPostmasterProcesses (postmaster.c:1576)
8 0x8fce76 postgres doRequestedPrimaryMirrorModeTransitions (primary_mirror_mode.c:1735)
9 0x8f4122 postgres <symbol not found> (postmaster.c:2272)
10 0x8f76f0 postgres PostmasterMain (postmaster.c:7589)
11 0x7fa58f postgres main (main.c:206)
12 0x7f1b11cb4cdd libc.so.6 __libc_start_main (??:0)
13 0x4c2cf9 postgres <symbol not found> (??:0)
"
2015-12-22 16:57:46.012709 CST,,,p77240,th273340192,,,,0,,,seg-1,,,,,"LOG","00000","startup process (PID 77246) exited with exit code 1",,,,,,,0,,"postmaster.c",5854,
2015-12-22 16:57:46.012735 CST,,,p77240,th273340192,,,,0,,,seg-1,,,,,"LOG","00000","aborting startup due to startup process failure",,,,,,,0,,"postmaster.c",4706,
进入老的主库数据目录,发现多了两个文件
cd /disk1/digoal/gpdata/gpseg-1
-rw-r--r-- 1 digoal users 0 Dec 22 16:51 promote
-rw-r--r-- 1 digoal users 0 Dec 22 16:51 recovery.conf
promote代表要激活它,recovery.conf没有用。
把这两个文件删掉。
现在要做的时,把老的主库起来,然后删掉不能起来的standby master。
$gpstart -m
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -m
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-Master-only start requested in a configuration with a standby master.
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-This is advisable only under the direct supervision of Greenplum support.
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-This mode of operation is not supported in a production environment and
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-may lead to a split-brain condition and possible unrecoverable data loss.
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************
Continue with master-only startup Yy|Nn (default=N):
> y
20151222:18:20:29:116706 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Obtaining Greenplum Master catalog information
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Obtaining Segment details from master...
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Setting new master era
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Master Started...
删除standby master
$gpinitstandby -r
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:------------------------------------------------------
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Warm master standby removal parameters
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:------------------------------------------------------
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum master hostname = digoal_host.sqa.zmf
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum master data directory = /disk1/digoal/gpdata/gpseg-1
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum master port = 1921
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum standby master hostname = digoal_host.sqa.zmf
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum standby master port = 1922
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum standby master data directory = /disk1/digoal/gpdata/gpseg-2
Do you want to continue with deleting the standby master? Yy|Nn (default=N):
> y
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing standby master from catalog...
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Database catalog updated successfully.
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing standby entry from gp_transaction_files_filespace flat file
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing standby entry from gp_temporary_files_filespace flat file
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing filespace directories on standby master...
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Successfully removed standby master
现在可以关闭并启动主库了。
$gpstop -M fast -a
$gpstart -a