Configuring Hugepages For Oracle on Linux
NOTE: I have recently discovered that Oracle, hugepages, and NUMA are incompatible, at least on Linux. NUMA must be disabled to use hugepages with Oracle.
RAM is managed in 4k pages in 64-bit Linux. When memory sizes were limited, and systems with more than 16G RAM were rare, this was not as much of an issue. However, as systems get more memory, and the increasing demand on performance for memory increased and become less manageable. Hugepages can make managing the large amounts of memory available in modern servers much less CPU intensive. In particular, with the number of memory pages reduced by typically three orders of magnitude, the chance that a particular page pointer will be available in the processor cache goes up dramatically.
First some caveats on using hugepages: Hugepages are not swappable, thus Oracle SGA memory must either be all hugepages are no hugepages. If you allocate hugepages for Oracle, and don’t allocate enough for the entire SGA, Oracle will not use any hugepage memory. If there is not enough non-hugepage memory, your database will not start. Finally, enabling hugepages will require a server restart, so if you do not have the ability to restart your server, do not attempt to enable hugepages.
Oracle Metalink note 1134002.1 says explicitly that AMM (MEMORY_TARGET/MEMORY_MAX_TARGET) is incompatible with hugepages. However, I have found at least one blog that says that AMM is compatible with hugepages when using the USE_LARGE_PAGES parameter in 11g (where AMM is available). Until further confirmation is found, I do not recommend trying to combine hugepages with MEMORY_TARGET/MEMORY_MAX_TARGET.
There are both Oracle database settings and Linux OS settings that must be adjusted in order to enable hugepages. The Linux and oracle settings of concern are below:
Linux OS settings:
oracle soft memlock
oracle hard memlock
Oracle Database spfile/init.ora:
SGA_TARGET = Size of the SGA for use currently
SGA_MAX_SIZE = Size the SGA *could* be increased to without restarting the server
MEMORY_TARGET = These parameters should not be used with hugepages
MEMORY_MAX_TARGET = These parameters should not be used with hugepages
First, calculate the Linux OS settings. Kernel.shmmax should be set to the size of the largest SGA_TARGET on the server plus 1G, to account for other processes. For a single instance with 180G RAM, that would be 181G.
Kernel.shmall should be set to the sum of the SGA_TARGET values divided by the pagesize. Use ‘getconf pagesize’ command to get the page size. Units are bytes. The standard pagesize on Linux x86_64 is 4096, or 4k.
Oracle soft memlock and oracle hard memlock should be set to slightly less than the total memory on the server, I chose 230G. Units are kbytes, so the number is 230000000. This is the total amount of memory Oracle is allowed to lock.
Now for the hugepage setting itself: vm.nr_hugepages is the total number of hugepages to be allocated on the system. The number of hugepages required can be determined by finding the maximum amount of SGA memory expected to be used by the system (the SGA_MAX_SIZE value normally, or the sum of them on a server with multiple instances) and dividing it by the size of the hugepages, 2048k, or 2M on Linux. To account for Oracle process overhead, add five more hugepages . So, if we want to allow 180G of hugepages, we would use this equation: (180*1024*1024/2048)+5. This gives us 92165 hugepages for 180G. Note: I took a shortcut in this calculation, by using memory in MEG rather than the full page size. To calculate the number in the way I initial described, the equation would be: (180*1024*1024*1024)/(2048*1024).
In order to allow the Oracle database to use up to 180G for the SGA_TARGET/SGA_MAX_SIZE, below are the settings we would use for the OS:
/etc/security/limits.conf oracle soft memlock 230000000 oracle hard memlock 230000000 /etc/sysctl.conf vm.nr_hugepages = 92165 kernel.shmmax = 193273528320+1g = 194347270144 kernel.shmall = 47448064
In the Oracle database there is a new setting in 11gR2. This is USE_LARGE_PAGES, with possible values of ‘true’, ‘only’, and ‘false’. True is the default and current behavior, ‘False’ means never use hugepages, use only small pages. ‘Only’ forces the database to use hugepages. If insufficient pages are available the instance will not start. Regardless of this setting, it must use either all hugepages or all smallpages. According to some blogs, using this setting is what allows the MEMORY_MAX_TARGET and MEMORY_TARGET to be used with hugepages. As I noted above, I have not verified this with a Metalink note as yet.
Next, set SGA_TARGET and SGA_MAX_SIZE to the desired size. I generally recommend setting both to the same size. Oracle recommends explicitly setting the MEMORY_TARGET and MEMORY_MAX_TARGET to 0 when enabling hugepages. So these are the values in the spfile that we change:
In order to verify that hugepages are being used, run this command:
cat /proc/meminfo | grep Huge
It will show HugePages_Total, HugePages_Free, and HugePages_Rsvd. The HugePages_Rsvd value is the number of hugepages that are in use.
Note that this example uses Linux hugepage size of 2M (2048k).
On Itanium systems the hugepage size is 256M.
These instructions show you how to successfully implement huge pages in Linux. Note that everything would be the same for Oracle 10gR2, with the exception that the USE_LARGE_PAGES parameter is unavailable.
1.Reasons for Using Hugepages 1.Use hugepages if OLTP or ERP. Full stop.
2.Use hugepages if DW/BI with large numbers of dedicated connections or a large SGA. Full stop.
3.Use hugepages if you don’t like the amount of memory page tables are costing you (/proc/meminfo). Full stop.
2.SGA Memory Management Models 1.AMM does not support hugepages. Full stop.
2.ASMM supports hugepages.
3.Instance Type 1.ASM uses AMM by default. ASM instances do not need hugepages. Full stop.
2.All non-ASM instances should be considered candidate for hugepages. See 1.1->1.3 above.
4.Configuration 1.Limits (multiple layers) 1./etc/security/limits.conf establishes limits for hugepages for processes. Note, setting these values does not pre-allocate any resources.
2.Ulimit also establishes hugepages limits for processes.
5.Allocation 1./etc/sysctl.conf vm.nr_hugepages allocates memory to the hugepages pool.
6.Sizing 1.Read MOS 401749.1 for information on tools available to aid in the configuration of vm/nr_hugepages
To make the point of how urgently Oracle DBAs need to qualify their situation against list items 1.1 through 1.3 above, please consider the following quote from an internal email I received. The email is real and the screen output came from a real customer system. Yes, 120+ gigabytes of memory wasted in page tables. Fact is often stranger than fiction!
And here is an example of kernel pagetables usage, with a 24GB SGA, and 6000+ connections .. with no hugepages in use .
# grep PageT /proc/meminfo
PageTables: 123,731,372 kB
In part I of my recent blog series on Linux hugepages and modern Oracle releases I closed the post by saying that future installments would materialize if I found any pitfalls. I don’t like to blog about bugs, but in cases where there is little material on the matter provided elsewhere I think it adds value. First, however, I’d like to offer links to parts I and II in the series:
• Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – I.
• Configuring Linux Hugepages for Oracle Database Is Just Too Difficult! Isn’t It? Part – II.
The pitfall I’d like to bring to readers’ attention is a situation that can arise in the case where the Oracle Database 11g Release 2 188.8.131.52 parameter USE_LARGE_PAGES is set to “only” thus forcing the instance to either successfully allocate all shared memory from the hugepages pool or fail to boot. As I pointed out in parts I and II this is a great feature. However, after an instance is booted it stands to reason that other processes (e.g., Oracle instances) may in fact use hugepages thus drawing down the amount of free hugepages. In fact, it stands to reason that other uses of hugepages could totally deplete the hugepages pool.
So what happens to a running instance that successfully allocated its shared memory from the hugepages pool and hugepages are later externally drawn down? The answer is nothing. An instance can plod along just fine after instance startup even if hugepages continue to get drawn down to the point of total depletion. But is that the end of the story?
What Goes Up, Must (be able to) Come Down
OK, so for anyone that finds themselves in a situation where an instance is up and happy but HugePages_Free is zero the following is what to expect:
$ sqlplus ‘/ as sysdba’ SQL*Plus: Release 184.108.40.206.0 Production on Wed Sep 29 17:32:32 2010 Copyright (c) 1982, 2010, Oracle. All rights reserved. Connected to an idle instance. SQL>SQL> HOST grep -i huge /proc/meminfoHugePages_Total: 4663HugePages_Free: 0HugePages_Rsvd: 10Hugepagesize: 2048 kB SQL> shutdown immediateORA-01034: ORACLE not availableORA-27102: out of memoryLinux-x86_64 Error: 12: Cannot allocate memoryAdditional information: 1Additional information: 6422533SQL>
Pay particular attention to the fact that sqlplus is telling us that it is attached to an idle instance! I assure you, this is erroneous. The instance is indeed up.
Yes, this is bug 10159556 (I filed it for what it is worth). The solution is to have ample hugepages as opposed to precisely enough. Note, in another shell a privileged user can dynamically allocate more hugepages (even a single hugepage) and the instance will be then able to be shutdown cleanly. As an aside, an instance in this situation can be shutdown with abort. I don’t aim to insinuate that this is some sort of zombie instance that will not go away.
/* Created: Version 1: 01/30/2012 Jed S. Walker Updated: Version 2: 03/20/2006 Jed S. Walker Notes : does not work on Oracle < 9i due to MView check To do: does not handle "" table names Usage: @jdesc user.table @jdesc table */ set serveroutput on size 1000000 set verify off set feedback off set pages 50 linesize 120 set long 10000 col owner format a20 col object_name format a35 declare n_exist number; v_param1 varchar2(200); v_seploc number; v_schema_name user_tables.table_name%type; v_table_name user_tables.table_name%type; v_longmsg varchar2(4000); v_type_atts varchar2(500); cursor curs_syns (owner_val varchar2, tabname_val varchar2) is select owner, synonym_name from all_synonyms where owner = owner_val and table_name = tabname_val order by owner, synonym_name; cursor curs_tcols (owner_val varchar2, tabname_val varchar2) is select column_name, data_type, data_length, data_precision, data_scale, decode(nullable,'Y','NULL','NOT NULL') nullable, data_default from all_tab_columns where owner = owner_val and table_name = tabname_val order by column_id; v_lenprec varchar2(20); cursor curs_cons (owner_val varchar2, tabname_val varchar2) is select constraint_name, decode(constraint_type, 'P', 'Primary', 'U', 'Unique', 'C', 'Check', 'R', 'References', constraint_type) constraint_type, search_condition, r_constraint_name, decode(delete_rule, 'NO ACTION', '', delete_rule) delete_rule, decode(constraint_type, 'P', 1, 'U', 2, 'R', 3, 'C', 4, 5) sortorder from all_constraints where owner = owner_val and table_name = tabname_val order by sortorder, constraint_name; v_fk_table_name user_tables.table_name%type; v_temp varchar2(2000); cursor curs_ccollist (owner_val varchar2, consname_val varchar2) is select column_name from all_cons_columns where owner = owner_val and constraint_name = consname_val order by position; v_criteria varchar2(2000); cursor curs_inds (owner_val varchar2, tabname_val varchar2) is select index_name, index_type, uniqueness from all_indexes where owner = owner_val and table_name = tabname_val order by uniqueness desc, index_name; cursor curs_icollist (owner_val varchar2, indname_val varchar2) is select column_name from all_ind_columns where index_owner = owner_val and index_name = indname_val order by column_position; v_columns varchar2(2000); cursor curs_trigs (owner_val varchar2, tabname_val varchar2) is select trigger_name, trigger_type, triggering_event from all_triggers where owner = owner_val and table_name = tabname_val order by trigger_type, triggering_event, trigger_name; begin -- get parameters and determine if this is local schema or other schema v_param1:='&1'; v_seploc:=instr(v_param1,'.'); if v_seploc = 0 then v_schema_name:=upper(SYS_CONTEXT('USERENV','CURRENT_SCHEMA')); v_table_name:=upper(v_param1); else v_schema_name:=upper(substr(v_param1,1,v_seploc-1)); v_table_name:=upper(substr(v_param1,v_seploc+1,length(v_param1)-v_seploc)); end if; -- check for table existence or view select count(1) into n_exist from all_tables where owner = v_schema_name and table_name = v_table_name; if n_exist = 0 then select count(1) into n_exist from all_views where owner = v_schema_name and view_name = v_table_name; if n_exist = 0 then dbms_output.put_line('Table ' || v_schema_name || '.' || v_table_name || ' does not exist'); else dbms_output.put_line(chr(10) || v_schema_name || '.' || v_table_name || ' is a view:' || chr(10)); select text into v_longmsg from all_views where owner = v_schema_name and view_name = v_table_name; dbms_output.put_line(v_longmsg); end if; else -- show table name and type attributes dbms_output.put_line(chr(10) || v_schema_name || '.' || v_table_name || chr(10)); -- show table type attributes select decode(temporary,'Y','YES','N','NO') into v_temp from all_tables where owner = v_schema_name and table_name = v_table_name; dbms_output.put_line('Temporary Table => ' || v_temp); select decode(iot_type,'IOT','YES','NO') into v_temp from all_tables where owner = v_schema_name and table_name = v_table_name; dbms_output.put_line('Index Organized Table (IOT) => ' || v_temp); select partitioned into v_temp from all_tables where owner = v_schema_name and table_name = v_table_name; dbms_output.put_line('Partitioned => ' || v_temp); select decode(cluster_name,null,'NO','YES') into v_temp from all_tables where owner = v_schema_name and table_name = v_table_name; dbms_output.put_line('Clustered Table => ' || v_temp); select decode(compression,'ENABLED','YES','NO') into v_temp from all_tables where owner = v_schema_name and table_name = v_table_name; dbms_output.put_line('Compressed Table => ' || v_temp); select decode(count(1),0,'NO','YES') into v_temp from all_base_table_mviews where owner = v_schema_name and master = v_table_name; dbms_output.put_line('Materialized Views => ' || v_temp); select decode(count(1),0,'NO','YES') into v_temp from all_policies where object_owner = v_schema_name and object_name = v_table_name; dbms_output.put_line('Fine Grained Access Control => ' || v_temp); -- show synonyms dbms_output.put_line(chr(10)); n_exist:=0; for rec_syn in curs_syns (v_schema_name,v_table_name) loop dbms_output.put_line('Synonym: ' || rec_syn.owner || ' : ' || rec_syn.synonym_name); n_exist:=n_exist+1; end loop; if n_exist = 0 then dbms_output.put_line('There are no synonyms for this table.'); end if; -- list columns, data type, null?, default dbms_output.put_line(chr(10) || rpad('COLUMN',32) || chr(9) || rpad('TYPE',14) || chr(9) || rpad('NULLS',10) || chr(9) || 'DEFAULT'); dbms_output.put_line(rpad('------',32) || chr(9) || rpad('----',14) || chr(9) || rpad('-----',10) || chr(9) || '-------'); for rec_col in curs_tcols (v_schema_name,v_table_name) loop if rec_col.data_type in ('CHAR','VARCHAR2','NCHAR','NVARCHAR2') then v_lenprec:='(' || rec_col.data_length || ')'; elsif rec_col.data_type in ('NUMBER') then v_lenprec:='(' || rec_col.data_precision || ',' || rec_col.data_scale || ')'; else v_lenprec:=''; end if; dbms_output.put_line(rpad(rec_col.column_name,32) || chr(9) || rpad(rec_col.data_type || v_lenprec,14) || chr(9) || rpad(rec_col.nullable,10) || chr(9) || rec_col.data_default); end loop; -- constraints dbms_output.put_line(chr(10) || rpad('CONSTRAINT',32) || chr(9) || rpad('TYPE',14) || chr(9) || 'CRITERIA'); dbms_output.put_line(rpad('----------',32) || chr(9) || rpad('----',14) || chr(9) || '--------'); n_exist:=0; for rec_con in curs_cons (v_schema_name,v_table_name) loop if rec_con.constraint_type in ('Primary','Unique') then v_criteria:='('; for rec_collist in curs_ccollist(v_schema_name,rec_con.constraint_name) loop if v_criteria != '(' then v_criteria:=v_criteria||','; end if; v_criteria:=v_criteria || rec_collist.column_name; end loop; v_criteria:=v_criteria || ')'; elsif rec_con.constraint_type = 'References' then --v_criteria:=rec_con.r_constraint_name; select table_name into v_fk_table_name from all_constraints where owner = v_schema_name and constraint_name = rec_con.r_constraint_name; v_temp:='('; for rec_collist in curs_ccollist(v_schema_name,rec_con.r_constraint_name) loop if v_temp != '(' then v_temp:=v_temp||','; end if; v_temp:=v_temp || rec_collist.column_name; end loop; v_temp:=v_temp || ')'; v_criteria:=v_fk_table_name || v_temp || ' ' || rec_con.delete_rule; elsif rec_con.constraint_type = 'Check' then v_criteria:=rec_con.search_condition; if length(v_criteria) > (250-32-14) then v_criteria:=substr(v_criteria,1,(250-32-14)) || '...'; end if; else v_criteria:='Unknown'; end if; n_exist:=n_exist+1; dbms_output.put_line(rpad(rec_con.constraint_name,32) || chr(9) || rpad(rec_con.constraint_type,14) || chr(9) || v_criteria); end loop; if n_exist = 0 then dbms_output.put_line('<none>'); end if; -- indexes dbms_output.put_line(chr(10) || rpad('INDEX',32) || chr(9) || rpad('TYPE',20) || chr(9) || 'COLUMNS'); dbms_output.put_line(rpad('-----',32) || chr(9) || rpad('----',20) || chr(9) || '-------'); n_exist:=0; for rec_ind in curs_inds (v_schema_name,v_table_name) loop v_columns:='('; for rec_collist in curs_icollist (v_schema_name,rec_ind.index_name) loop if v_columns != '(' then v_columns:=v_columns||','; end if; v_columns:=v_columns || rec_collist.column_name; end loop; v_columns:=v_columns || ')'; dbms_output.put_line(rpad(rec_ind.index_name,32) || chr(9) || rpad(rec_ind.uniqueness || ':' || rec_ind.index_type,20) || chr(9) || v_columns); n_exist:=n_exist+1; end loop; if n_exist = 0 then dbms_output.put_line('<none>'); end if; -- triggers dbms_output.put_line(chr(10) || rpad('TRIGGER',32) || chr(9) || rpad('TYPE',20) || chr(9) || rpad('EVENT',20) ); dbms_output.put_line(rpad('-----',32) || chr(9) || rpad('----',20) || chr(9) || rpad('------',20) ); n_exist:=0; for rec_trig in curs_trigs (v_schema_name,v_table_name) loop dbms_output.put_line(rpad(rec_trig.trigger_name,32) || chr(9) || rpad(rec_trig.trigger_type,20) || chr(9) || rpad(rec_trig.triggering_event,20) ); n_exist:=n_exist+1; end loop; if n_exist = 0 then dbms_output.put_line('<none>'); end if; -- some space before prompt dbms_output.put_line(chr(10)); end if; -- check for table end; / set feedback on
go to the oracle home and type the following:
opatch lsinventory -details > lsinv.txt
then copy to /tmp
then ftp from /tmp
Clone DB OracleBase article
This article is a test
When doing any type of upgrade/patch I’m often asked about backing up the existing binaries and how to save the customer from minimal downtime. One of the ways this can be achieved is by simply making another oracle home tar copied from the existing home. This now gives several options that allow flexibity in many ways.
# tar -cvf /tmp/u01.tar /u01
First parameter is the location of the tar file, second paramater is the directory where the oracle home is located
# gzip /tmp/u01.tar
You can then gzip it to make it smaller/easier to move it
Once moved to the destination server
# gunzip /tmp/u01.tar.gz
# cd /
# tar -xvf /tmp/u01.tar
Root Configuration Scripts
Run the root configutation scripts, generated as part of the original installation, on the destination server as the “root” user.
Modify Config Files
If you have not prepared the “.bash_profile” as part of the destination server setup, copy the file from the source to the destination server.
# scp /home/oracle/.bash_profile email@example.com:/home/oracle/.bash_profile
Edit the “.bash_profile” file on the destination server, giving it the correct value for the ORACLE_HOSTNAME environment variable.
Amend any hostname or IP references in the “listener.ora” and “tnsnames.ora” files in the “$ORACLE_HOME/network/admin” directory.
Edit the “/etc/oratab” making sure all instances are referenced correctly. In my case I need to add the following entry.
Tested a table on Exadata