Use PostgreSQL collect and analyze Operation System statistics

当你管理的服务器越来越多的时候,哪个服务器才是你最需要关注的?
通过各个监控维度的排行,可以知道你到底需要关注或者优先关注哪些服务器.
这个通过nagios,cacti等监控软件当然是可以做到的.
不过不太灵活,因为服务器的配置各不一样,关注的维度也不一样,PostgreSQL数据库的递归调用和窗口函数可以很好的运用来输出各种各样的统计数据,有了数据也就利于展现了。
本文测试环境 :
OS : RHEL5
DB : PostgreSQL 9.1.2
假设我这里有1000台跑了mongoDB,PostgreSQL,Oracle,MySQL等数据库的服务器需要将它们的SAR报告集中到一台PostgreSQL数据库中.
集中之后对报告进行分析,找出需要关注的服务器来.
一、首先看看我们要收集什么信息 :
我这里举了一个简单的例子,收集前一天sar的统计报告. 如下 :
1. sar -b
反映系统的每秒读写IO请求等,详情
   
   
-b Report I/O and transfer rate statistics. The following values are displayed:
tps
Total number of transfers per second that were issued to physical devices. A transfer is an I/O
request to a physical device. Multiple logical requests can be combined into a single I/O request
to the device. A transfer is of indeterminate size.
rtps
Total number of read requests per second issued to physical devices.
wtps
Total number of write requests per second issued to physical devices.
bread/s
Total amount of data read from the devices in blocks per second. Blocks are equivalent to sec-
tors with 2.4 kernels and newer and therefore have a size of 512 bytes. With older kernels, a
block is of indeterminate size.
bwrtn/s
Total amount of data written to devices in blocks per second.
2. sar -B
反映每秒系统写入或从磁盘读出的page数,详情
   
   
-B Report paging statistics. The following values are displayed:
pgpgin/s
Total number of kilobytes the system paged in from disk per second. Note: With old kernels
(2.2.x) this value is a number of blocks per second (and not kilobytes).
pgpgout/s
Total number of kilobytes the system paged out to disk per second. Note: With old kernels
(2.2.x) this value is a number of blocks per second (and not kilobytes).
fault/s
Number of page faults (major + minor) made by the system per second (post 2.5 kernels only).
This is not a count of page faults that generate I/O, because some page faults can be resolved
without I/O.
majflt/s
Number of major faults the system has made per second, those which have required loading a memory
page from disk (post 2.5 kernels only).
3. sar -c
反映系统每秒创建的进程数,如果这个数字很大可能是应用程序连接数据库是短连接,并且请求数据库频繁,而PostgreSQL采用的是客户端连接过来fork新进程然后这个新进程与客户端进行交互的模式,因此这种情况会造成数据库服务器大量的关闭和创建进程,sar -c能反映这种情况. 使用短连接还有一个坏处就是当系统中使用到sequence,并且这个sequence被大量的这种短连接进程请求,那么它设置的sequence cache没有效果并且会造成大量的跳号.
详情
   
   
-c Report process creation activity.
proc/s
Total number of processes created per second.
4. sar -q
反映系统的负载,详情
   
   
-q Report queue length and load averages. The following values are displayed:
runq-sz
Run queue length (number of processes waiting for run time).
plist-sz
Number of processes and threads in the process list.
ldavg-1
System load average for the last minute.
ldavg-5
System load average for the past 5 minutes.
ldavg-15
System load average for the past 15 minutes.
5. sar -r
反映系统的内存和SWAP的使用情况,详情
   
   
-r Report memory and swap space utilization statistics. The following values are displayed:
kbmemfree
Amount of free memory available in kilobytes.
kbmemused
Amount of used memory in kilobytes. This does not take into account memory used by the kernel
itself.
%memused
Percentage of used memory.
kbbuffers
Amount of memory used as buffers by the kernel in kilobytes.
kbcached
Amount of memory used to cache data by the kernel in kilobytes.
kbswpfree
Amount of free swap space in kilobytes.
kbswpused
Amount of used swap space in kilobytes.
%swpused
Percentage of used swap space.
kbswpcad
Amount of cached swap memory in kilobytes. This is memory that once was swapped out, is swapped
back in but still also is in the swap area (if memory is needed it doesnt need to be swapped out
again because it is already in the swap area. This saves I/O).
6. sar -R
反映每秒被free的内存,新增给buffer的内存,新增给cache的内存. 详情
   
   
-R Report memory statistics. The following values are displayed:
frmpg/s
Number of memory pages freed by the system per second. A negative value represents a number of
pages allocated by the system. Note that a page has a size of 4 kB or 8 kB according to the
machine architecture.
bufpg/s
Number of additional memory pages used as buffers by the system per second. A negative value
means fewer pages used as buffers by the system.
campg/s
Number of additional memory pages cached by the system per second. A negative value means fewer
pages in the cache.
7. sar -u
反映系统CPU在user,nice,system,iowait,steal,idle的分配比例. 详情
   
   
-u Report CPU utilization. The following values are displayed:
%user
Percentage of CPU utilization that occurred while executing at the user level (application).
%nice
Percentage of CPU utilization that occurred while executing at the user level with nice priority.
%system
Percentage of CPU utilization that occurred while executing at the system level (kernel).
%iowait
Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk
I/O request.
%steal
Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hyper-
visor was servicing another virtual processor.
%idle
Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk
I/O request.
Note: On SMP machines a processor that does not have any activity at all (0.00 for every field) is a
disabled (offline) processor.
8. sar -v
inode,file 或其他内核表的报告,详情
   
   
-v Report status of inode, file and other kernel tables. The following values are displayed:
dentunusd
Number of unused cache entries in the directory cache.
file-sz
Number of used file handles.
inode-sz
Number of used inode handlers.
super-sz
Number of super block handlers allocated by the kernel.
%super-sz
Percentage of allocated super block handlers with regard to the maximum number of super block
handlers that Linux can allocate.
dquot-sz
Number of allocated disk quota entries.
%dquot-sz
Percentage of allocated disk quota entries with regard to the maximum number of cached disk quota
entries that can be allocated.
rtsig-sz
Number of queued RT signals.
%rtsig-sz
Percentage of queued RT signals with regard to the maximum number of RT signals that can be
queued.
9. sar -w
反映每秒上下文的切换数量,详情
   
   
-w Report system switching activity.
cswch/s
Total number of context switches per second.
10. sar -W
反映SWAP每秒被换进或换出的数量,详情
   
   
-W Report swapping statistics. The following values are displayed:
pswpin/s
Total number of swap pages the system brought in per second.
pswpout/s
Total number of swap pages the system brought out per second.

二、接下来看看数据库表的设计 :
我这里使用的是sar用户和sar库以及tbs_sar表空间,首先初始化数据库 :
   
   
create role sar nosuperuser nocreatedb nocreaterole noinherit login encrypted password 'DIGOAL';
create tablespace tbs_sar owner digoal location '/home/sar/tbs_sar';
create database sar with owner digoal template template0 encoding 'UTF8' tablespace tbs_sar;
grant all on database sar to sar;
grant all on tablespace tbs_sar to sar;
\c sar sar
create schema sar authorization sar;
创建序列,函数 和表 :

create sequence seq_server_id start with 1 increment by 1;

存放Server信息,本例只为说明方法,所以这个表设计得比较简单,实际使用当中可以加入其他字段,如IDC,维护人,项目名称等。
   
   
create table server(
id int primary key,
ip inet not null unique,
info text);
根据IP地址获取ServerID的函数,没有则新分配
   
   
create or replace function get_server_id (i_ip inet) returns int as $BODY$
declare
v_id int;
begin
select id into v_id from server where ip=i_ip;
if not found then
insert into server(id, ip) values(nextval('seq_server_id'::regclass), i_ip);
select id into v_id from server where ip=i_ip;
end if;
return v_id;
exception
when others then
return -1;
end
$BODY$ language plpgsql;
根据ServerID获取IP的函数
   
   
create or replace function get_ip (i_id int) returns inet as $BODY$
declare
v_ip inet;
begin
select ip into v_ip from server where id=i_id;
return v_ip;
exception
when others then
return '0.0.0.0/0'::inet;
end
$BODY$ language plpgsql;
根据ServerID获取服务器info的函数
   
   
create or replace function get_info (i_id int) returns text as $BODY$
declare
v_info text;
begin
select info into v_info from server where id=i_id;
return v_info;
exception
when others then
return 'no info';
end
$BODY$ language plpgsql;
统计昨天未收集到SAR日志的函数
    
    
create or replace function get_server_nodata_yesterday() returns setof text as $BODY$
declare
v_result text;
begin
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_context where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_context: ';
return query select s1.ip||','||s1.info from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_context where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_cpu where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_cpu: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_cpu where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_inode where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_inode: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_inode where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_io where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_io: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_io where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_load where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_load: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_load where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_mem where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_mem: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_mem where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_mem_swap where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_mem_swap: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_mem_swap where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_page where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_page: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_page where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_proc where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_proc: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_proc where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
perform 1 from (select s1.* from server s1 left outer join
(select * from (select server_id,row_number() over (partition by server_id order by s_date desc) from sar_swap where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null) t;
if found then
return next 'sar_swap: ';
return query select s1.ip||',row_number() over (partition by server_id order by s_date desc) from sar_swap where s_date=current_date-1) t1
where row_number=1) t2 on (s1.id=t2.server_id) where t2.server_id is null;
end if;
return;
end
$BODY$ language plpgsql;
sar信息存放表 :
   
   
create table sar_io
(server_id int not null,
s_date date not null,
s_time time not null,
tps numeric,
rtps numeric,
wtps numeric,
bread_p_s numeric,
bwrtn_p_s numeric,
unique(server_id,s_date,s_time));
create table sar_page
(server_id int not null,
pgpgin_p_s numeric,
pgpgout_p_s numeric,
fault_p_s numeric,
majflt_p_s numeric,s_time));
create table sar_proc
(server_id int not null,
proc_p_s numeric,s_time));
create table sar_load
(server_id int not null,
runq_sz numeric,
plist_sz numeric,
ldavg_1 numeric,
ldavg_5 numeric,
ldavg_15 numeric,s_time));
create table sar_mem_swap
(server_id int not null,
kbmemfree numeric,
kbmemused numeric,
percnt_memused numeric,
kbbuffers numeric,
kbcached numeric,
kbswpfree numeric,
kbswpused numeric,
percnt_swpused numeric,
kbswpcad numeric,s_time));
create table sar_mem
(server_id int not null,
frmpg_p_s numeric,
bufpg_p_s numeric,
campg_p_s numeric,s_time));
create table sar_cpu
(server_id int not null,
percnt_user numeric,
percnt_nice numeric,
percnt_system numeric,
percnt_iowait numeric,
percnt_steal numeric,
percnt_idle numeric,s_time));
create table sar_inode
(server_id int not null,
dentunusd numeric,
file_sz numeric,
inode_sz numeric,
super_sz numeric,
percnt_super_sz numeric,
dquot_sz numeric,
percnt_dquot_sz numeric,
rtsig_sz numeric,
percnt_rtsig_sz numeric,s_time));
create table sar_context
(server_id int not null,
cswch_p_s numeric,s_time));
create table sar_swap
(server_id int not null,
pswpin_p_s numeric,
pswpout_p_s numeric,s_time));

三、在需要收集sar报告的操作系统中配置如下程序用来收集sar信息 :
收集用到了PostgreSQL的psql程序,所以需要在系统中安装PostgreSQL客户端. 安装过程略.
假设PostgreSQL数据库的连接信息如下,IP 10.10.10.1,Port 1931,DBNAME sar,USER sar,PASSWORD DIGOAL
配置 ~/.pgpass文件

10.10.10.1:1931:sar:sar:DIGOAL

chmod 400 ~/.pgpass


编写sar_collect.sh脚本,用于收集昨天的SAR报告

vi /home/postgres/sar_collect.sh

   
   
#!/bin/bash # 环境变量,数据库连接,避免风暴随机等待60秒内 . /home/postgres/.bash_profile DB_URL="-h 10.10.10.1 -p 1931 -U sar -d sar" sleep $(($RANDOM%60)) NET_DEV="`/sbin/route -n|grep UG|awk '{print $8}'|head -n 1`" IP_ADDR="'`/sbin/ip addr show $NET_DEV|grep inet|grep "global $NET_DEV$"|awk '{print $2}'`'" SAR_FILE="/var/log/sa/sa`date -d -1day +%d`" SAR_DATE="'`date -d -1day +%Y-%m-%d`'" SERVER_ID="`psql -A -t $DB_URL -c "select * from get_server_id($IP_ADDR)"`" # sar -b,sar_io tps rtps wtps bread/s bwrtn/s SQL=`sar -b -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_io(server_id,s_date,s_time,tps,rtps,wtps,bread_p_s,bwrtn_p_s) values('$SERVER_ID','$SAR_DATE',","\47"$1$2"\47,$3",$4",$5",$6",$7");"}'` psql $DB_URL -c "$SQL" # sar -B,sar_page pgpgin/s pgpgout/s fault/s majflt/s SQL=`sar -B -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_page(server_id,pgpgin_p_s,pgpgout_p_s,fault_p_s,majflt_p_s) values('$SERVER_ID',$6");"}'` psql $DB_URL -c "$SQL" # sar -c,sar_proc proc/s SQL=`sar -c -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_proc(server_id,proc_p_s) values('$SERVER_ID',$3");"}'` psql $DB_URL -c "$SQL" # sar -q,sar_load runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 SQL=`sar -q -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_load(server_id,runq_sz,plist_sz,ldavg_1,ldavg_5,ldavg_15) values('$SERVER_ID',$7");"}'` psql $DB_URL -c "$SQL" # sar -r,sar_mem_swap kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad SQL=`sar -r -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_mem_swap(server_id,kbmemfree,kbmemused,percnt_memused,kbbuffers,kbcached,kbswpfree,kbswpused,percnt_swpused,kbswpcad) values('$SERVER_ID',$7",$8",$9",$10",$11");"}'` psql $DB_URL -c "$SQL" # sar -R,sar_mem frmpg/s bufpg/s campg/s SQL=`sar -R -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_mem(server_id,frmpg_p_s,bufpg_p_s,campg_p_s) values('$SERVER_ID',$5");"}'` psql $DB_URL -c "$SQL" # sar -u,sar_cpu %user %nice %system %iowait %steal %idle SQL=`sar -u -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_cpu(server_id,percnt_user,percnt_nice,percnt_system,percnt_iowait,percnt_steal,percnt_idle) values('$SERVER_ID',$9");"}'` psql $DB_URL -c "$SQL" # sar -v,sar_inode dentunusd file-sz inode-sz super-sz %super-sz dquot-sz %dquot-sz rtsig-sz %rtsig-sz SQL=`sar -v -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_inode(server_id,dentunusd,file_sz,inode_sz,super_sz,percnt_super_sz,dquot_sz,percnt_dquot_sz,rtsig_sz,percnt_rtsig_sz) values('$SERVER_ID',$11");"}'` psql $DB_URL -c "$SQL" # sar -w,sar_context cswch/s SQL=`sar -w -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_context(server_id,cswch_p_s) values('$SERVER_ID',$3");"}'` psql $DB_URL -c "$SQL" # sar -W,sar_swap pswpin/s pswpout/s SQL=`sar -W -f $SAR_FILE|grep -E 'AM[ ]+([0-9]+|\.+|all|-)|PM[ ]+([0-9]+|\.+|all|-)'|awk '{print "insert into sar_swap(server_id,pswpin_p_s,pswpout_p_s) values('$SERVER_ID',$4");"}'` psql $DB_URL -c "$SQL"
# Author : Digoal.Zhou
# THE END

修改权限 :

chmod 500 sar_collect.sh

新建执行计划,

crontab -e

1 2 * * * /home/postgres/sar_collect.sh


四、然后看看几条简单的报告SQL,使用row_number窗口函数 :
   
   
# 昨天15分钟最大负载排名前10,后一条为平均值排行 # 负载过高需要关注这些服务器上运行的数据库和业务是否健康,例如是否需要建索引,是否需要使用绑定变量等. select get_ip(server_id),* from (select *,row_number() over (partition by server_id order by ldavg_15 desc) from sar_load where s_date=current_date-1) t where row_number=1 order by ldavg_15 desc limit 10; select get_info(server_id),get_ip(server_id),round(avg(ldavg_15),2) ldavg_15 from sar_load where s_date=current_date-1 group by server_id,s_date order by ldavg_15 desc limit 10; # 昨天最大读请求数排名前10,后一条为平均值排行 # 读请求过高需要关注这些服务器上运行的数据库和业务是否健康,是否需要加内存,是否需要对存储性能扩容等. select get_ip(server_id),row_number() over (partition by server_id order by rtps desc) from sar_io where s_date=current_date-1) t where row_number=1 order by rtps desc limit 10; select get_info(server_id),round(avg(rtps),2) rtps from sar_io where s_date=current_date-1 group by server_id,s_date order by rtps desc limit 10; # 昨天最大写请求数排名前10,后一条为平均值排行 # 写请求过高需要关注这些服务器上运行的数据库和业务是否健康,例如是否需要减少索引,是否需要使用异步IO,是否需要对存储性能进行扩容等. select get_ip(server_id),row_number() over (partition by server_id order by wtps desc) from sar_io where s_date=current_date-1) t where row_number=1 order by wtps desc limit 10; select get_info(server_id),round(avg(wtps),2) wtps from sar_io where s_date=current_date-1 group by server_id,s_date order by wtps desc limit 10; # 昨天最大iowait排名前10,后一条为平均值排行 # iowait过高需要关注这些服务器上运行的数据库和业务是否健康,例如是否需要加内存,是否需要将常用数据放入内存,是否需要对存储性能进行扩容等. select get_ip(server_id),row_number() over (partition by server_id order by percnt_iowait desc) from sar_cpu where s_date=current_date-1) t where row_number=1 order by percnt_iowait desc limit 10; select get_info(server_id),round(avg(percnt_iowait),2) percnt_iowait from sar_cpu where s_date=current_date-1 group by server_id,s_date order by percnt_iowait desc limit 10; # 昨天最大swap页进出排名前10,后一条为平均值排行 # swap也进出过高需要关注这些服务器上运行的数据库和业务是否健康,是否需要将常用数据放入内存等. select get_ip(server_id),row_number() over (partition by server_id order by pswpin_p_s+pswpout_p_s desc) from sar_swap where s_date=current_date-1) t where row_number=1 order by pswpin_p_s+pswpout_p_s desc limit 10; select get_info(server_id),round(avg(pswpin_p_s+pswpout_p_s),2) pswpin_out_p_s from sar_swap where s_date=current_date-1 group by server_id,s_date order by pswpin_out_p_s desc limit 10; # 昨天最大SWAP使用比例前10,后一条为平均值排行 # SWAP使用率过高需要关注这些服务器上运行的数据库和业务是否健康,是否需要调整数据库参数,是否需要使用大页等. select get_ip(server_id),row_number() over (partition by server_id order by percnt_swpused desc) from sar_mem_swap where s_date=current_date-1) t where row_number=1 order by percnt_swpused desc limit 10; select get_info(server_id),round(avg(percnt_swpused),2) percnt_swpused from sar_mem_swap where s_date=current_date-1 group by server_id,s_date order by percnt_swpused desc limit 10; # 昨天每秒新建进程排名前10,后一条为平均值排行 # 每秒新建进程数过高需要关注这些服务器上运行的数据库和业务是否健康,例如是否需要加个数据库连接池使用长连接,Oracle是否需要使用共享连接,应用程序是否可以将短连接改成长连接的模式等. select get_ip(server_id),row_number() over (partition by server_id order by proc_p_s desc) from sar_proc where s_date=current_date-1) t where row_number=1 order by proc_p_s desc limit 10; select get_info(server_id),round(avg(proc_p_s),2) proc_p_s from sar_proc where s_date=current_date-1 group by server_id,s_date order by proc_p_s desc limit 10;
报告如图 :


五、最后通过邮件将报告发送给自己 :
发送邮件脚本 :
  
  
#!/bin/bash
. /home/postgres/.bash_profile
EMAIL="digoal@126.com"
echo -e `date +%F\ %T` >/tmp/sar_report.log echo -e "\n---- WeeklyAvgValue TOP10: ----\n" >>/tmp/sar_report.log echo -e "\n1. ldavg_15 TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),get_ip(server_id),round(avg(ldavg_15),2) ldavg_15 from sar_load where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by ldavg_15 desc limit 10;" >>/tmp/sar_report.log echo -e "\n2. rtps TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),round(avg(rtps),2) rtps from sar_io where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by rtps desc limit 10;" >>/tmp/sar_report.log echo -e "\n3. wtps TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),round(avg(wtps),2) wtps from sar_io where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by wtps desc limit 10;" >>/tmp/sar_report.log echo -e "\n4. iowait TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),round(avg(percnt_iowait),2) percnt_iowait from sar_cpu where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by percnt_iowait desc limit 10;" >>/tmp/sar_report.log echo -e "\n5. swap_page_in_out TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),round(avg(pswpin_p_s+pswpout_p_s),2) pswpin_out_p_s from sar_swap where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by pswpin_out_p_s desc limit 10;" >>/tmp/sar_report.log echo -e "\n6. swap_usage TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),round(avg(percnt_swpused),2) percnt_swpused from sar_mem_swap where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by percnt_swpused desc limit 10;" >>/tmp/sar_report.log echo -e "\n7. newproc_p_s TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),round(avg(proc_p_s),2) proc_p_s from sar_proc where s_date<=current_date-1 and s_date>=current_date-7 group by server_id order by proc_p_s desc limit 10;" >>/tmp/sar_report.log echo -e "\n---- DailyAvgValue TOP10: ----\n" >>/tmp/sar_report.log echo -e "\n1. ldavg_15 TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) ldavg_15 from sar_load where s_date=current_date-1 group by server_id order by ldavg_15 desc limit 10;" >>/tmp/sar_report.log echo -e "\n2. rtps TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) rtps from sar_io where s_date=current_date-1 group by server_id order by rtps desc limit 10;" >>/tmp/sar_report.log echo -e "\n3. wtps TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) wtps from sar_io where s_date=current_date-1 group by server_id order by wtps desc limit 10;" >>/tmp/sar_report.log echo -e "\n4. iowait TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) percnt_iowait from sar_cpu where s_date=current_date-1 group by server_id order by percnt_iowait desc limit 10;" >>/tmp/sar_report.log echo -e "\n5. swap_page_in_out TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) pswpin_out_p_s from sar_swap where s_date=current_date-1 group by server_id order by pswpin_out_p_s desc limit 10;" >>/tmp/sar_report.log echo -e "\n6. swap_usage TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) percnt_swpused from sar_mem_swap where s_date=current_date-1 group by server_id order by percnt_swpused desc limit 10;" >>/tmp/sar_report.log echo -e "\n7. newproc_p_s TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),2) proc_p_s from sar_proc where s_date=current_date-1 group by server_id order by proc_p_s desc limit 10;" >>/tmp/sar_report.log echo -e "\n---- DailyMaxValue TOP10: ----\n" >>/tmp/sar_report.log echo -e "\n1. ldavg_15 TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),ldavg_15 from (select *,row_number() over (partition by server_id order by ldavg_15 desc) from sar_load where s_date=current_date-1) t where row_number=1 order by ldavg_15 desc limit 10;" >>/tmp/sar_report.log echo -e "\n2. rtps TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),bwrtn_p_s from (select *,row_number() over (partition by server_id order by rtps desc) from sar_io where s_date=current_date-1) t where row_number=1 order by rtps desc limit 10;" >>/tmp/sar_report.log echo -e "\n3. wtps TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),row_number() over (partition by server_id order by wtps desc) from sar_io where s_date=current_date-1) t where row_number=1 order by wtps desc limit 10;" >>/tmp/sar_report.log echo -e "\n4. iowait TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),percnt_idle from (select *,row_number() over (partition by server_id order by percnt_iowait desc) from sar_cpu where s_date=current_date-1) t where row_number=1 order by percnt_iowait desc limit 10;" >>/tmp/sar_report.log echo -e "\n5. swap_page_in_out TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),pswpout_p_s from (select *,row_number() over (partition by server_id order by pswpin_p_s+pswpout_p_s desc) from sar_swap where s_date=current_date-1) t where row_number=1 order by pswpin_p_s+pswpout_p_s desc limit 10;" >>/tmp/sar_report.log echo -e "\n6. swap_usage TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),kbswpcad from (select *,row_number() over (partition by server_id order by percnt_swpused desc) from sar_mem_swap where s_date=current_date-1) t where row_number=1 order by percnt_swpused desc limit 10;" >>/tmp/sar_report.log echo -e "\n7. newproc_p_s TOP10 :\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select get_info(server_id),proc_p_s from (select *,row_number() over (partition by server_id order by proc_p_s desc) from sar_proc where s_date=current_date-1) t where row_number=1 order by proc_p_s desc limit 10;" >>/tmp/sar_report.log echo -e "\n---- get_server_nodata_yesterday: ----\n" >>/tmp/sar_report.log psql -h 127.0.0.1 sar sar -c "select * from get_server_nodata_yesterday();" >>/tmp/sar_report.log
cat /tmp/sar_report.log|mutt -s "`date +$F` DB Servers RS Consume Top10" $EMAIL
# Author : Digoal.Zhou
# THE END
配置mutt环境,假设数据库编码为UTF-8,否则中文可能出错.
  
  
vi ~/.muttrc
set envelope_from=yes
set from=digoal@126.com
set realname="德哥"
set use_from=yes
set charset="UTF-8"
六、其他,
1. 展现可以通过WEB形式来做,这里只是举了个简单的收集和统计的例子,未设计WEB开发.
2. 1000台服务器一天收集的这些sar日志数据量约200MB. 按照现在的硬盘容量,放几年没问题.
【参考】
man sar

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


文章浏览阅读601次。Oracle的数据导入导出是一项基本的技能,但是对于懂数据库却不熟悉Oracle的同学可能会有一定的障碍。正好在最近的一个项目中碰到了这样一个任务,于是研究了一下Oracle的数据导入导出,在这里跟大家分享一下。......_oracle 迁移方法 对比
文章浏览阅读553次。开头还是介绍一下群,如果感兴趣polardb ,mongodb ,mysql ,postgresql ,redis 等有问题,有需求都可以加群群内有各大数据库行业大咖,CTO,可以解决你的问题。加群请联系 liuaustin3 ,在新加的朋友会分到2群(共700多人左右 1 + 2)。最近我们在使用MYSQL 8 的情况下(8.025)在数据库运行中出现一个问题 参数prefer_order_i..._mysql prefer_ordering_index
文章浏览阅读3.5k次,点赞3次,收藏7次。折腾了两个小时多才成功连上,在这分享一下我的经验,也仅仅是经验分享,有不足的地方欢迎大家在评论区补充交流。_navicat连接opengauss
文章浏览阅读2.7k次。JSON 代表 JavaScript Object Notation。它是一种开放标准格式,将数据组织成中详述的键/值对和数组。_postgresql json
文章浏览阅读2.9k次,点赞2次,收藏6次。navicat 连接postgresql 注:navicat老版本可能报错。1.在springboot中引入我们需要的依赖以及相应版本。用代码生成器生成代码后,即可进行增删改查(略)安装好postgresql 略。更改配置信息(注释中有)_mybatisplus postgresql
文章浏览阅读1.4k次。postgre进阶sql,包含分组排序、JSON解析、修改、删除、更新、强制踢出数据库所有使用用户、连表更新与删除、获取今年第一天、获取近12个月的年月、锁表处理、系统表使用(查询所有表和字段及注释、查询表占用空间)、指定数据库查找模式search_path、postgre备份及还原_pgsql分组取每组第一条
文章浏览阅读3.3k次。上一篇我们学习了日志清理,日志清理虽然解决了日志膨胀的问题,但就无法再恢复检查点之前的一致性状态。因此,我们还需要日志归档,pg的日志归档原理和Oracle类似,不过归档命令需要自己配置。以下代码在postmaster.c除了开启归档外,还需要保证wal_level不能是MINIMAL状态(因为该状态下有些操作不会记录日志)。在db启动时,会同时检查archive_mode和wal_level。以下代码也在postmaster.c(PostmasterMain函数)。......_postgresql archive_mode
文章浏览阅读3k次。系统:ubuntu22.04.3目的:利用向日葵实现windows远程控制ubuntu。_csdn局域网桌面控制ubuntu
文章浏览阅读1.6k次。表分区是解决一些因单表过大引用的性能问题的方式,比如某张表过大就会造成查询变慢,可能分区是一种解决方案。一般建议当单表大小超过内存就可以考虑表分区了。1,继承式分区,分为触发器(trigger)和规则(rule)两种方式触发器的方式1)创建表CREATE TABLE "public"."track_info_trigger_partition" ( "id" serial, "object_type" int2 NOT NULL DEFAULT 0, "object_name..._pg数据表分区的实现
文章浏览阅读3.3k次。物联网平台开源的有几个,就我晓得的有、、thingskit、JetLink、DG-iot(还有其他开源的,欢迎在评论区留言哦!),然后重点分析了下ThingsBoard、ThingsPanel和JetLink,ThingsBoard和Jetlinks是工程师思维产品,可以更多的通过配置去实现开发的目的,ThingsPanel是业务人员思路产品,或者开发或者用,避免了复杂的配置带来的较高学习门槛。ThingsBoard和Jetlinks是Java技术体系的,ThingsPanel是PHP开发的。_jetlinks和thingsboard
文章浏览阅读3.8k次。PostgreSQL 数据类型转换_pgsql数字转字符串
文章浏览阅读7k次,点赞3次,收藏14次。在做数据统计页面时,总会遇到统计某段时间内,每天、每月、每年的数据视图(柱状图、折线图等)。这些统计数据一眼看过去也简单呀,不就是按照时间周期(天、月、年)对统计数据进行分个组就完了嘛?但是会有一个问题,简单的写个sql对周期分组,获取到的统计数据是缺失的,即没有数据的那天,整条记录也都没有了。如下图需求:以当前月份(2023年2月)为起点,往后倒推一年,查询之前一年里每个月的统计数据。可见图中的数据其实是缺少的,这条sql只查询到了有数据的月份(23年的1月、2月,22年的12月)_如何用一条sql查出按年按月按天的汇总
文章浏览阅读3.8k次,点赞66次,收藏51次。PostgreSQL全球开发小组与2022年10月13日,宣布发布PostgreSQL15,这是世界上最先进的开源数据库的最新版本_mysql8 postgresql15
文章浏览阅读1.3k次。上文介绍了磁盘管理器中VFD的实现原理,本篇将从上层角度讲解磁盘管理器的工作细节。_smgrrelationdata
文章浏览阅读1.1k次。PostgreSQL设置中文语言界面和局域网访问_postgressql汉化
文章浏览阅读4.2k次。PostgreSQL 修改数据存储路径_如何设置postgresql 数据目录
文章浏览阅读4.7k次。在项目中用到了多数据源,在连接postgres数据库时,项目启动报错,说数据库连接错误,说dual不存在,网上好多教程都是说数据库查询的时候的大小写问题,而这个仅仅是连接,咋鞥却处理方法是修改application-dev.yml中的配置文件.项目中的druid参数是这样的:确实在配置文件中有个查询语句。_relation "dual" does not exist
文章浏览阅读4.9k次。PostgreSQL是一款强大的关系型数据库,但在实际使用过程中,许多用户经常会遇到慢SQL的问题。这些问题不仅会降低数据库性能,还会直接影响业务流程和用户体验。因此,本文将会深入分析PostgreSQL慢SQL的原因和优化方案,帮助用户更好地利用这个优秀的数据库系统。无论你是初学者还是专业开发者,本文都将为你提供实用的技巧和方法,让你的PostgreSQL数据库始终保持高效快速。_postgresql数据库优化
文章浏览阅读1.6k次。Linux配置postgresql开机自启_linux 启动pgsql
文章浏览阅读2k次。本篇介绍如何在centos7系统搭建一个postgresql主备集群实现最近的HA(高可用)架构。后续更高级的HA模式都是基于这个最基本的主备搭建。_postgresql主备