PostgreSQL rename 代码修正风波

6 minute read

背景

PostgreSQL的数据目录,包括所有相关的文件,建议的权限是700,owner是启动数据库集群的操作系统用户。

如果权限不对,或者OWNER不对,在打开文件时可能出错,会带来安全隐患,并导致一些不必要的麻烦。

例子

比如PostgreSQL fsync_fname_ext调用,默认会以读写的方式打开文件。

The argument flags must include one of the following access modes: O_RDONLY, O_WRONLY, or O_RDWR.

These request opening the file read-only, write-only, or read/write, respectively.

        以O_RDWR打开,如果文件权限不正确可能导致权限不足,报错  
	/*  
	 * Some OSs require directories to be opened read-only whereas other  
	 * systems don't allow us to fsync files opened read-only; so we need both  
	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are  
	 * not writable by our userid, but we assume that's OK.  
	 */  
	flags = PG_BINARY;  
	if (!isdir)  
		flags |= O_RDWR;  
	else  
		flags |= O_RDONLY;  

安全隐患

例如recovery.conf文件,在激活时,需要重命名为recovery.done

相关代码

PostgreSQL的rename封装,重命名前,需要先以O_RDWR模式打开文件(fsync_fname_ext)

/*  
 * durable_rename -- rename(2) wrapper, issuing fsyncs required for durability  
 *  
 * This routine ensures that, after returning, the effect of renaming file  
 * persists in case of a crash. A crash while this routine is running will  
 * leave you with either the pre-existing or the moved file in place of the  
 * new file; no mixed state or truncated files are possible.  
 *  
 * It does so by using fsync on the old filename and the possibly existing  
 * target filename before the rename, and the target file and directory after.  
 *  
 * Note that rename() cannot be used across arbitrary directories, as they  
 * might not be on the same filesystem. Therefore this routine does not  
 * support renaming across directories.  
 *  
 * Log errors with the caller specified severity.  
 *  
 * Returns 0 if the operation succeeded, -1 otherwise. Note that errno is not  
 * valid upon return.  
 */  
int  
durable_rename(const char *oldfile, const char *newfile, int elevel)  
{  
        int                     fd;  
  
        /*  
         * First fsync the old and target path (if it exists), to ensure that they  
         * are properly persistent on disk. Syncing the target file is not  
         * strictly necessary, but it makes it easier to reason about crashes;  
         * because it's then guaranteed that either source or target file exists  
         * after a crash.  
         */  
        if (fsync_fname_ext(oldfile, false, false, elevel) != 0)  
                return -1;  
  
        fd = OpenTransientFile((char *) newfile, PG_BINARY | O_RDWR, 0);  
        if (fd < 0)  
        {  
                if (errno != ENOENT)  
                {  
                        ereport(elevel,  
                                        (errcode_for_file_access(),  
                                         errmsg("could not open file \"%s\": %m", newfile)));  
                        return -1;  
                }  
        }  
  
...  
  
        /* Time to do the real deal... */  
        if (rename(oldfile, newfile) < 0)  
        {  
                ereport(elevel,  
                                (errcode_for_file_access(),  
                                 errmsg("could not rename file \"%s\" to \"%s\": %m",  
                                                oldfile, newfile)));  
                return -1;  
        }  
  

重命名recovery.conf,调用了这个rename封装的durable_rename

#define RECOVERY_COMMAND_FILE   "recovery.conf"  
#define RECOVERY_COMMAND_DONE   "recovery.done"  
  
  
        /*  
         * Rename the config file out of the way, so that we don't accidentally  
         * re-enter archive recovery mode in a subsequent crash.  
         */  
        unlink(RECOVERY_COMMAND_DONE);  
        durable_rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE, FATAL);  

fsync_fname_ext中也同步了这个操作,如果是个文件则以O_RDWR打开

/*  
 * fsync_fname_ext -- Try to fsync a file or directory  
 *  
 * Ignores errors trying to open unreadable files, or trying to fsync  
 * directories on systems where that isn't allowed/required, and logs other  
 * errors at a caller-specified level.  
 */  
static void  
fsync_fname_ext(const char *fname, bool isdir, int elevel)  
{  
	int			fd;  
	int			flags;  
	int			returncode;  
  
        这里使用O_RDWR打开,可能导致权限不足,报错  
	/*  
	 * Some OSs require directories to be opened read-only whereas other  
	 * systems don't allow us to fsync files opened read-only; so we need both  
	 * cases here.  Using O_RDWR will cause us to fail to fsync files that are  
	 * not writable by our userid, but we assume that's OK.  
	 */  
	flags = PG_BINARY;  
	if (!isdir)  
		flags |= O_RDWR;  
	else  
		flags |= O_RDONLY;  
  
	/*  
	 * Open the file, silently ignoring errors about unreadable files (or  
	 * unsupported operations, e.g. opening a directory under Windows), and  
	 * logging others.  
	 */  
	fd = OpenTransientFile((char *) fname, flags, 0);    
	if (fd < 0)  
	{  
		if (errno == EACCES || (isdir && errno == EISDIR))  
			return;  
		ereport(elevel,  
				(errcode_for_file_access(),  
				 errmsg("could not open file \"%s\": %m", fname)));    // 权限不够时报错
		return;  
	}  
  
	returncode = pg_fsync(fd);  
  
	/*  
	 * Some OSes don't allow us to fsync directories at all, so we can ignore  
	 * those errors. Anything else needs to be logged.  
	 */  
	if (returncode != 0 && !(isdir && errno == EBADF))  
		ereport(elevel,  
				(errcode_for_file_access(),  
				 errmsg("could not fsync file \"%s\": %m", fname)));  
  
	(void) CloseTransientFile(fd);  
}  

OpenTransientFile是open的封装,在durable_rename中调用时传入的FLAG也包含了O_RDWR

/*  
 * Like AllocateFile, but returns an unbuffered fd like open(2)  
 */  
int  
OpenTransientFile(FileName fileName, int fileFlags, int fileMode)  
{  
        int                     fd;  
  
        DO_DB(elog(LOG, "OpenTransientFile: Allocated %d (%s)",  
                           numAllocatedDescs, fileName));  
  
        /* Can we allocate another non-virtual FD? */  
        if (!reserveAllocatedDesc())  
                ereport(ERROR,  
                                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),  
                                 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",  
                                                maxAllocatedDescs, fileName)));  
  
        /* Close excess kernel FDs. */  
        ReleaseLruFiles();  
  
        fd = BasicOpenFile(fileName, fileFlags, fileMode);  
  
        if (fd >= 0)  
        {  
                AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];  
  
                desc->kind = AllocateDescRawFD;  
                desc->desc.fd = fd;  
                desc->create_subid = GetCurrentSubTransactionId();  
                numAllocatedDescs++;  
  
                return fd;  
        }  
  
        return -1;                                      /* failure */  
}  

BasicOpenFile是OpenTransientFile底层调用, 通过open打开文件

/*  
 * BasicOpenFile --- same as open(2) except can free other FDs if needed  
 *  
 * This is exported for use by places that really want a plain kernel FD,  
 * but need to be proof against running out of FDs.  Once an FD has been  
 * successfully returned, it is the caller's responsibility to ensure that  
 * it will not be leaked on ereport()!  Most users should *not* call this  
 * routine directly, but instead use the VFD abstraction level, which  
 * provides protection against descriptor leaks as well as management of  
 * files that need to be open for more than a short period of time.  
 *  
 * Ideally this should be the *only* direct call of open() in the backend.  
 * In practice, the postmaster calls open() directly, and there are some  
 * direct open() calls done early in backend startup.  Those are OK since  
 * this module wouldn't have any open files to close at that point anyway.  
 */  
int  
BasicOpenFile(FileName fileName, int fileFlags, int fileMode)  
{  
        int                     fd;  
  
tryAgain:  
        fd = open(fileName, fileFlags, fileMode);  
  
        if (fd >= 0)  
                return fd;                              /* success! */  
  
        if (errno == EMFILE || errno == ENFILE)  
        {  
                int                     save_errno = errno;  
  
                ereport(LOG,  
                                (errcode(ERRCODE_INSUFFICIENT_RESOURCES),  
                                 errmsg("out of file descriptors: %m; release and retry")));  
                errno = 0;  
                if (ReleaseLruFile())  
                        goto tryAgain;  
                errno = save_errno;  
        }  
  
        return -1;                                      /* failure */  
}  

rename的不靠谱设计?

重命名时不需要检查被重命名文件的owner,任意用户在目录所属owner为当前用户时,就可以对文件进行重命名

man 2 rename

The  renaming  has no safeguards.    
If the user has permission to rewrite file names, the command will perform the action without any questions.    
For example, the result can be quite drastic when the command is run as root in the /lib directory.    
Always make a backup before running the command, unless you truly know what you are doing.    

例子,普通用户重命名超级用户创建的文件

[root@   ~]# cd /tmp  
[root@   tmp]# touch abc  
[root@   tmp]# chmod 600 abc  
[root@   tmp]# ll abc  
-rw------- 1 root root 0 Aug 29 23:33 abc  
[root@   tmp]# su - digoal  
Last login: Mon Aug 29 23:18:41 CST 2016 on pts/1  
[digoal@   ~]$ cd /tmp  
[digoal@   tmp]$ ll abc  
-rw------- 1 root root 0 Aug 29 23:33 abc  
[digoal@   tmp]$ mv abc d  
mv: cannot move ‘abc’ to ‘d’: Operation not permitted  
[digoal@   tmp]$ mv abc e  
mv: cannot move ‘abc’ to ‘e’: Operation not permitted  
[digoal@   tmp]$ mv abc a  
mv: cannot move ‘abc’ to ‘a’: Operation not permitted  
[digoal@   tmp]$ exit  
logout  
  
[root@   tmp]# cd /home/digoal  
[root@   digoal]# touch abc  
[root@   digoal]# chmod 600 abc  
[root@   digoal]# ll abc  
-rw------- 1 root root 0 Aug 29 23:33 abc  
[root@   digoal]# su - digoal  
Last login: Mon Aug 29 23:33:04 CST 2016 on pts/1  
[digoal@   ~]$ ll abc  
-rw------- 1 root root 0 Aug 29 23:33 abc  
[digoal@   ~]$ mv abc abcd  
[digoal@   ~]$ ll abcd  
-rw------- 1 root root 0 Aug 29 23:33 abcd  

Flag Counter

digoal’s 大量PostgreSQL文章入口