Search

Top 60 Oracle Blogs

Recent comments

CPU Capacity planning from OEM metrics

The CPU used by your Oracle Database is expensive because it is the metric used by licensing. The more you can control and know what you need, the more freedom you will have to optimize the costs. With instance caging, available in all editions, you can put a soft limit. This means that:

  • you run on a limited number of threads and after a while, this gives a good idea of what you really need. You can forecast the capacity for a future consolidation.
  • you monitor ‘resmgr: cpu quantum’ and if activity is high you can decide to scale-up immediately, throttle some services, or do some query/design tuning.

In order to set instance caging, you need to define a value for CPU_COUNT according to the past activity. This post is the detail behind the following tweet:

Oracle Enterprise Manager cloud Control collects many metrics from the target databases and maintains a history of them with daily minimum, maximum and average,… The Oracle documentation mentions the view MGMT$METRIC_DAILY but I prefer the MGMT_METRICS_1DAY because the ‘underscore’ views are not mentioned in the Licensing Information documentation. The ‘dollar’ metric views require Diagnostic Pack. However, those scripts are to be used with databases where Diagnostic Pack is enabled.

Average, Maximum or Percentile?

The goal is to define a CPU_COUNT which accepts the peak activity. Then I’ll get the maximum value from the daily metric: the maximum number of sessions in CPU observed during each day. However, we may have experienced an issue where some queries take lot of resources. If this is caused by an application bug or bad execution plan, we need to fix it. We don’t want to pay for CPU to cope with these issues. Then, rather than taking the maximum, I’ll look at a percentile. Here is a graph from 4 instances during 2 years:

Do you want to pay for 120 threads just because of one peak? Probably not. Taking a percentile 99% will give the value to cope with 99% of the days.

My query for these instances returns the following: 125 thread is the maximum, but 74 threads are sufficient for 99% of the days observed. And 50 threads are ok for 90% of the days:

You see the same number for multiple instances because they are nodes for the same database.

RAC instances

Here is another example with two nodes from the same database. In addition to the peaks where we can accept to throttle for a short time, there is a service that has been relocated several times:

Then which CPU_COUNT do you want to set on each instance? The safest would be to accept, on each node, the possibility to run the load observed on the whole cluster. This can happen in an exceptional case where only one node remains. But this costs a lot. Rather than setting the sum of cluster load, we can set the maximum observed, so that each node can accept a peak observed on one of the nodes. Or, if we don’t expect service relocation, or if we accept some response time degradation in those cases, we can just set what has been observed on the node.

The following query joins the daily metrics with some target information: “Line of Business” and “Department” to group the databases from a business point of view, “LifeCycle Status” to differentiate production, test, and development.

The metric selected is the per-instance CPU second per second, which is the average number of sessions in CPU.

“CPU load/instance/day” does the join and calculates the value, the max, and the sum among the cluster. With single-instance, you don’t have to worry. With RAC, you may decide, in “Target CPU load/instance/day” which one you use. Then “Proposed instance caging” calculates the percentiles. The main query, at the end, does a ‘group by rollup’ to add some aggregates per business and environment. Of course, you will customize for your needs. My goal is to show the useful sources of information and how to mine them.

with 
TARGET_PROPERTIES as (
select target_guid,property_name,property_value
from sysman.MGMT_TARGET_PROPERTIES
where property_type='INSTANCE'
),
TARGET_PROP_DEFS as (
select property_name,property_display_name
from sysman.MGMT$ALL_TARGET_PROP_DEFS
where defined_by='SYSTEM'
),
TARGET_LINE_OF_BUSINESS as (
select target_guid,property_value "Line of Business"
from TARGET_PROP_DEFS
natural left outer join
TARGET_PROPERTIES where property_display_name='Line of Business'
),
TARGET_DEPARTMENT as (
select target_guid,property_value "Department"
from TARGET_PROP_DEFS
natural left outer join
TARGET_PROPERTIES where property_display_name='Department'
),
TARGET_LIFESYCLE_STATUS as (
select target_guid,property_value "LifeCycle Status"
from TARGET_PROP_DEFS
natural left outer join
TARGET_PROPERTIES
where property_display_name='LifeCycle Status'
),
TARGET_RAC_DATABASES as (
select member_target_name target_name
,member_target_guid target_guid
,composite_target_name "RAC Database"
from sysman.MGMT_TARGET_MEMBERSHIPS
where composite_target_type='rac_database'
and member_target_type='oracle_database'
),
TARGETS_INSTANCES as (
select target_guid, target_type, type_meta_ver
,category_prop_1, target_name
from sysman.mgmt_targets
where target_type='oracle_database'
),
METRICS_CPU as (
select metric_guid
,target_type, type_meta_ver, category_prop_1, metric_name
,metric_label, key_column, num_keys, column_label
,description, short_name, source, eval_func
from sysman.mgmt_metrics
where column_label = 'CPU Usage (per second)'
),
METRICS_1DAY as (
select target_guid
,metric_guid,rollup_timestamp "Day"
,value_maximum/100 "Max CPU load"
from sysman.mgmt_metrics_1day
),
"CPU load/instance/day" as (
select "Line of Business","Department","LifeCycle Status"
,"RAC Database",target_name "Instance","Day","Max CPU load"
-- sums over the RAC cluster because we should afford running all services on one instance (but sum of max can be large when a service has been relocated during the day)
,sum("Max CPU load") over (partition by "Line of Business","Department","LifeCycle Status","RAC Database","Day") "Sum instances CPU load"
-- or just need to ensure that each node can run the maximum observed per node
,max("Max CPU load") over (partition by "Line of Business","Department","LifeCycle Status","RAC Database","Day") "Max instances CPU load"
from
METRICS_1DAY
natural join
METRICS_CPU
natural join
TARGETS_INSTANCES
natural left outer join
TARGET_LINE_OF_BUSINESS
natural left outer join
TARGET_RAC_DATABASES
natural left outer join
TARGET_LIFESYCLE_STATUS
natural left outer join
TARGET_DEPARTMENT
),
"Target CPU load/instance/day" as (
select "Line of Business","Department","LifeCycle Status","RAC Database","Instance","Day"
-- choice of the metric used when in RAC:
-- - "Max CPU load" when not counting relocation of services,
-- - "Sum instances CPU load" when counting that each nodes can accept the maximum load seen in any node
-- - "Max instances CPU load" when counting that each node can accept the maximul load seen in the whole cluster
,"Max CPU load" "Max CPU"
from "CPU load/instance/day"
),
"Proposed instance caging" as (
select
"Line of Business","Department","LifeCycle Status","RAC Database","Instance"
,ceil(max("Max CPU")) "From Max"
-- here we add a percentile calculation because we do not count expceptional peaks
,ceil(percentile_cont(0.99) within group(order by "Max CPU")) "From percentile 99%"
,ceil(percentile_cont(0.90) within group(order by "Max CPU")) "From percentile 90%"
from "Target CPU load/instance/day"
group by "Line of Business","Department","LifeCycle Status","RAC Database","Instance"
)
select
"Line of Business","Department","LifeCycle Status"
,"RAC Database","Instance"
,sum("From Max")
,sum("From percentile 99%"),sum("From percentile 90%")
from "Proposed instance caging"
group by rollup (
"Line of Business","Department","LifeCycle Status","RAC Database","Instance"
)
order by
grouping("Line of Business") desc,grouping("Department") desc,grouping("LifeCycle Status") desc,grouping("RAC Database") desc,grouping("Instance") desc,
"Line of Business","Department","LifeCycle Status","RAC Database","Instance"
;

You may like or not my way of writing SQL queries. I use Common Table Expressions to define the source of data, name each column according to its role in the final result, and use natural join because the names define clearly the join columns. I find this very easy to develop and test each step.

The “CPU load/instance/day” result can easily be exported to an Excel pivot graph to look at the whole picture, as above, before deciding which percentile and which cluster aggregation to use. You may even have to look at your logs to see if the peaks are related to business activity (where you want to scale-up) or a problem (which you want to cage). And remember that the goal is to set a base for instance caging, which can be adapted easily later. When you run a while with a controlled number of threads, you can consider consolidation and licensing optimization. And don’t forget to see if you can reduce the CPU_COUNT with some tuning. Please, don’t hesitate to comment, here or on Twitter, with remarks or improvements.

Franck Pachot (@FranckPachot) | Twitter