Greenplum 激活standby master失败后的异常修复

6 minute read

背景

激活standby master失败后,主库和备库都起不来了。

如下,修改了MASTER_DATA_DIRECTORY和PGPORT环境变量为新的主库,启动主库。

$gpstart -a  
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -a  
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...  
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'  
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'  
20151222:16:49:41:073138 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode  
20151222:16:49:43:073138 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode  
20151222:16:49:43:073138 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1  
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-2 -l /disk1/digoal/gpdata/gpseg-2/pg_log/startup.log -w -t 600 -o " -p 1922 -b 48 -z 0 --silent-mode=true -i -M master -C -1 -x 0 -c gp_role=utility " start'  
rc=1, stdout='waiting for server to start...... stopped waiting  
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-2/postmaster.pid" does not exist  
pg_ctl: could not start server  
Examine the log output.  
'  

失败

手工执行命令当然也不行

env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-2 -l /disk1/digoal/gpdata/gpseg-2/pg_log/startup.log -w -t 600 -o " -p 1922 -b 48 -z 0 --silent-mode=true -i -M master -C -1 -x 0 -c gp_role=utility " start  

使用master only模式启动当然也是不行的。

$gpstart -m  
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -m  
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...  
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'  
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'  
20151222:16:58:05:077478 gpstart:digoal_host:digoal-[INFO]:-Master-only start requested in configuration without a standby master.  
  
Continue with master-only startup Yy|Nn (default=N):  
> y  
20151222:16:58:06:077478 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode  
20151222:16:58:08:077478 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode  
20151222:16:58:08:077478 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1  
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-2 -l /disk1/digoal/gpdata/gpseg-2/pg_log/startup.log -w -t 600 -o " -p 1922 -b 48 -z 0 --silent-mode=true -i -M master -C -1 -x 0 -c gp_role=utility " start'  
rc=1, stdout='waiting for server to start...... stopped waiting  
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-2/postmaster.pid" does not exist  
pg_ctl: could not start server  
Examine the log output.  
'  

限制模式启动也不行

$gpstart -R  
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -R  
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...  
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'  
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'  
20151222:16:57:21:076997 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode  
20151222:16:57:24:076997 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode  
20151222:16:57:24:076997 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1  
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-1 -l /disk1/digoal/gpdata/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 1921 -b 1 -z 0 --silent-mode=true -i -M master -C -1 -x 48 -c gp_role=utility " start'  
rc=1, stdout='waiting for server to start...... stopped waiting  
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-1/postmaster.pid" does not exist  
pg_ctl: could not start server  
Examine the log output.  
'  

修改了MASTER_DATA_DIRECTORY和PGPORT环境变量为老的主库

然后试图激活原来的主库也失败

$gpactivatestandby  -f  
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:------------------------------------------------------  
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Standby data directory    = /disk1/digoal/gpdata/gpseg-1  
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Standby port              = 1921  
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Standby running           = no  
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Force standby activation  = yes  
20151222:16:51:28:074293 gpactivatestandby:digoal_host:digoal-[INFO]:------------------------------------------------------  
Do you want to continue with standby master activation? Yy|Nn (default=N):  
> y  
20151222:16:51:29:074293 gpactivatestandby:digoal_host:digoal-[INFO]:-Starting standby master database in utility mode...  
20151222:16:51:31:074293 gpactivatestandby:digoal_host:digoal-[CRITICAL]:-Error activating standby master: ExecutionError: 'non-zero rc: 2' occured.  Details: 'GPSTART_INTERNAL_MASTER_ONLY=1 $GPHOME/bin/gpstart -a -m -v'  cmd had rc=2 completed=True halted=False  
  stdout='20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -a -m -v  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Setting level of parallelism to: 64  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Checking if GPHOME env variable is set.  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Checking if MASTER_DATA_DIRECTORY env variable is set.  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Checking if LOGNAME or USER env variable is set.  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:---Checking that current user can use GP binaries  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Obtaining master's port from master data directory  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Read from postgresql.conf port=1921  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Read from postgresql.conf max_connections=48  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-gp_external_grant_privileges is None  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Reading the gp_dbid file - /disk1/digoal/gpdata/gpseg-1/gp_dbid...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : # Greenplum Database identifier for this master/segment. ...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : # Do not change the contents of this file. ...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : dbid = 1 ...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Found match for dbid: 1.  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Parsing : standby_dbid = 48 ...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Found match for standby_dbid: 48.  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[DEBUG]:-Check if Master is already running...  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Master-only start requested for management utilities.  
20151222:16:51:29:074365 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode  
20151222:16:51:31:074365 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode  
20151222:16:51:31:074365 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1  
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-1 -l /disk1/digoal/gpdata/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 1921 -b 1 -z 0 --silent-mode=true -i -M master -C -1 -x 48 -c gp_role=utility " start'  
rc=1, stdout='waiting for server to start...... stopped waiting  
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-1/postmaster.pid" does not exist  
pg_ctl: could not start server  
Examine the log output.  
'  
'  
  stderr=''  

老的主库,以master only启动也失败。

$gpstart -m  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -m  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-Master-only start requested in a configuration with a standby master.  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-This is advisable only under the direct supervision of Greenplum support.   
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-This mode of operation is not supported in a production environment and   
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-may lead to a split-brain condition and possible unrecoverable data loss.  
20151222:16:57:43:077229 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************  
  
Continue with master-only startup Yy|Nn (default=N):  
> y  
20151222:16:57:44:077229 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode  
20151222:16:57:46:077229 gpstart:digoal_host:digoal-[CRITICAL]:-Failed to start Master instance in admin mode  
20151222:16:57:46:077229 gpstart:digoal_host:digoal-[CRITICAL]:-Error occurred: non-zero rc: 1  
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /disk1/digoal/gpdata/gpseg-1 -l /disk1/digoal/gpdata/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 1921 -b 1 -z 0 --silent-mode=true -i -M master -C -1 -x 48 -c gp_role=utility " start'  
rc=1, stdout='waiting for server to start...... stopped waiting  
', stderr='pg_ctl: PID file "/disk1/digoal/gpdata/gpseg-1/postmaster.pid" does not exist  
pg_ctl: could not start server  
Examine the log output.  
'  

老的主库启动时,报错如下:

2015-12-22 16:57:45.959837 CST,,,p77246,th273340192,,,,0,,,seg-1,,,,,"LOG","00000","Found recovery.conf file, checking appropriate parameters  for recovery in standby mode",,,,,,,0,,"xlog.c",5663,  
2015-12-22 16:57:46.010953 CST,,,p77246,th273340192,,,,0,,,seg-1,,,,,"FATAL","XX000","recovery command file ""recovery.conf"" request for standby mode not specified (xlog.c:5756)",,,,,,,0,,"xlog.c",5756,"Stack trace:  
1    0xb04cde postgres errstart (elog.c:502)  
2    0x54afe7 postgres XLogReadRecoveryCommandFile (xlog.c:5754)  
3    0x560a84 postgres StartupXLOG (xlog.c:6441)  
4    0x564966 postgres StartupProcessMain (xlog.c:10970)  
5    0x5f4675 postgres AuxiliaryProcessMain (bootstrap.c:463)  
6    0x8eacd4 postgres <symbol not found> (postmaster.c:7589)  
7    0x8eaefd postgres StartMasterOrPrimaryPostmasterProcesses (postmaster.c:1576)  
8    0x8fce76 postgres doRequestedPrimaryMirrorModeTransitions (primary_mirror_mode.c:1735)  
9    0x8f4122 postgres <symbol not found> (postmaster.c:2272)  
10   0x8f76f0 postgres PostmasterMain (postmaster.c:7589)  
11   0x7fa58f postgres main (main.c:206)  
12   0x7f1b11cb4cdd libc.so.6 __libc_start_main (??:0)  
13   0x4c2cf9 postgres <symbol not found> (??:0)  
"  
2015-12-22 16:57:46.012709 CST,,,p77240,th273340192,,,,0,,,seg-1,,,,,"LOG","00000","startup process (PID 77246) exited with exit code 1",,,,,,,0,,"postmaster.c",5854,  
2015-12-22 16:57:46.012735 CST,,,p77240,th273340192,,,,0,,,seg-1,,,,,"LOG","00000","aborting startup due to startup process failure",,,,,,,0,,"postmaster.c",4706,  

进入老的主库数据目录,发现多了两个文件

cd /disk1/digoal/gpdata/gpseg-1  
-rw-r--r-- 1 digoal users     0 Dec 22 16:51 promote  
-rw-r--r-- 1 digoal users     0 Dec 22 16:51 recovery.conf  

promote代表要激活它,recovery.conf没有用。

把这两个文件删掉。

现在要做的时,把老的主库起来,然后删掉不能起来的standby master。

$gpstart -m  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Starting gpstart with args: -m  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Gathering information and validating the environment...  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.6.1 build 2'  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[INFO]:-Greenplum Catalog Version: '201310150'  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-Master-only start requested in a configuration with a standby master.  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-This is advisable only under the direct supervision of Greenplum support.   
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-This mode of operation is not supported in a production environment and   
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-may lead to a split-brain condition and possible unrecoverable data loss.  
20151222:18:20:28:116706 gpstart:digoal_host:digoal-[WARNING]:-****************************************************************************  
  
Continue with master-only startup Yy|Nn (default=N):  
> y  
20151222:18:20:29:116706 gpstart:digoal_host:digoal-[INFO]:-Starting Master instance in admin mode  
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Obtaining Greenplum Master catalog information  
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Obtaining Segment details from master...  
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Setting new master era  
20151222:18:20:32:116706 gpstart:digoal_host:digoal-[INFO]:-Master Started...  

删除standby master

$gpinitstandby -r  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:------------------------------------------------------  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Warm master standby removal parameters  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:------------------------------------------------------  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum master hostname               = digoal_host.sqa.zmf  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum master data directory         = /disk1/digoal/gpdata/gpseg-1  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum master port                   = 1921  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum standby master hostname       = digoal_host.sqa.zmf  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum standby master port           = 1922  
20151222:18:20:51:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Greenplum standby master data directory = /disk1/digoal/gpdata/gpseg-2  
Do you want to continue with deleting the standby master? Yy|Nn (default=N):  
> y  
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing standby master from catalog...  
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Database catalog updated successfully.  
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing standby entry from gp_transaction_files_filespace flat file  
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing standby entry from gp_temporary_files_filespace flat file  
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Removing filespace directories on standby master...  
20151222:18:20:52:116968 gpinitstandby:digoal_host:digoal-[INFO]:-Successfully removed standby master  

现在可以关闭并启动主库了。

$gpstop -M fast -a  
$gpstart -a  

Flag Counter

digoal’s 大量PostgreSQL文章入口