Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit cpu count to fix boot issue on machines with high core count #107

Closed
wants to merge 1 commit into from
Closed

limit cpu count to fix boot issue on machines with high core count #107

wants to merge 1 commit into from

Conversation

Felk
Copy link

@Felk Felk commented May 5, 2022

This fixes the issues I faced in #64

With this change I am successfully able to start the image on a 96-core host machine, which previously errored with ORA-00821: Specified value of sga_target 1808M is too small, needs to be at least 2320M.

According to the documentation on CPU_COUNT, this is interpreted as a maximum if used during initialization, so it shouldn't be a problem for machines with fewer than 16 counts:

However, if CPU_COUNT is set to a value greater than the current number of CPUs in the initialization parameter file, then CPU_COUNT is capped to the current number of CPUs.

I don't have a machine with fewer than 16 cores at hand, but I'll set this to 999 locally and see if that boots successfully too to verify it works. Building the image just takes a while each time 😅

@Felk
Copy link
Author

Felk commented May 5, 2022

Rebased to sign the commit, and I was able to start the docker image on my machine with cpu_count set to 999, so it does work as a maximum.

@gvenzl
Copy link
Owner

gvenzl commented May 21, 2022

Hi @Felk,

Thanks for this PR, I have to say I'm somewhat surprised that this works.
But I think you have just given me the single most important clue to fix this issue indeed.
See, CPU_COUNT is fixed to 2 for Oracle XE, as it only allows for 2 CPU cores:

SQL> alter system set cpu_count=999 scope=spfile;

System altered.

SQL> shutdown immediate;
startup;
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> ORACLE instance started.

Total System Global Area 1241512272 bytes
Fixed Size		    9685328 bytes
Variable Size		  603979776 bytes
Database Buffers	  620756992 bytes
Redo Buffers		    7090176 bytes
Database mounted.
Database opened.
SQL> show parameter cpu_count

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cpu_count			     integer	 2

However, the Oracle software clearly seems to calculate the memory requirements based on CPU count before the CPU count is set to 2, otherwise, we would never run into that issue of not having enough memory to begin with.

So your fix made me curious, what's actually in that pfile once you create it?
Turns out, CPU_COUNT isn't in there:

SQL> CREATE PFILE='/tmp/pfile.ora' FROM SPFILE;

File created.

SQL> exit
Disconnected
bash-4.4$ cat /tmp/pfile.ora
XE.__data_transfer_cache_size=0
XE.__db_cache_size=553648128
XE.__inmemory_ext_roarea=0
XE.__inmemory_ext_rwarea=0
XE.__java_pool_size=150994944
XE.__large_pool_size=16777216
XE.__oracle_base='/opt/oracle'#ORACLE_BASE set from environment
XE.__pga_aggregate_target=419430400
XE.__sga_target=1241513984
XE.__shared_io_pool_size=67108864
XE.__shared_pool_size=436207616
XE.__streams_pool_size=0
XE.__unified_pga_pool_size=0
*._dmm_blas_library='libora_netlib.so'
*.audit_file_dest='/opt/oracle/admin/XE/adump'
*.audit_trail='db'
*.common_user_prefix=''
*.compatible='21.0.0'
*.control_files='/opt/oracle/oradata/XE/control01.ctl','/opt/oracle/oradata/XE/control02.ctl'
*.db_block_size=8192
*.db_name='XE'
*.diagnostic_dest='/opt/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=XEXDB)'
*.enable_pluggable_database=true
*.local_listener=''
*.nls_language='AMERICAN'
*.nls_territory='AMERICA'
*.open_cursors=300
*.pga_aggregate_target=392m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1178m
*.shared_servers=0
*.spatial_vector_acceleration=FALSE
*.undo_tablespace='UNDOTBS1'

And if CPU_COUNT isn't there, it defaults to 0, which per the doc enabled the dynamic cpu count sampling:

If CPU_COUNT is set to 0 (its default setting), then Oracle Database continuously monitors the number of CPUs reported by the operating system and uses the current count. If CPU_COUNT is set to a value other than 0, then Oracle Database will use this count rather than the actual number of CPUs, thus disabling dynamic CPU reconfiguration.

So now I wonder, could it be that this is what happens:

  1. Oracle DB starts up
  2. Oracle DB check for memory requirements
    1. Oracle DB checks for CPU_COUNT in parameter file
    2. CPU_COUNT is not set, hence 0, hence DB checks /proc/cpuinfo or similar
    3. Oracle DB calculates memory requirements
  3. Along the line, Oracle DB says "hey, I'm XE, CPU_COUNT needs to be set to 2

So what if I just explicitly set the CPU_COUNT to 2 in the image itself during the build, hence having the value present in the SPFILE to begin with.
Perhaps then 2.ii. will just be "oh, CPU_COUNT = 2, ok, 1.1GB it is.

Could you do me a favor and check the two following on your env (as I do not have a machine with 96 cores :)):

Once the DB is up, could you please go into the container docker exec ... bash
and startup a SQL*Plus session sqlplus / as sysdba
and do a show parameter cpu.

And likewise, could you please recreate the /tmp/pfile.ora afterwards and give me its output (from within SQL*Plus still)?

CREATE PFILE='/tmp/pfile.ora' FROM SPFILE;
HOST cat /tmp/pfile.ora

I bet it shows up as CPU_COUNT=2 via show parameter and CPU_COUNT=16 in the pfile.ora.
Let me, in the meantime work on a commit that explicitly sets the CPU_COUNT=2, perhaps I can ask you later on to just try to start that image on your machine as well.

The reason why I want to get it into the image itself btw is so that we don't pay that penalty of recreating the parameter file during container startup. It's probably just a couple of milliseconds that we save but hey, whatever can be done during image build means people don't have to wait during container startup, right? :)

@Sanne
Copy link

Sanne commented May 22, 2022

I just tested the test image gvenzl/oracle-xe:test and got:

SQL> show parameter cpu_count

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cpu_count			     integer	 2

SQL> show parameter sga_target

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
sga_target			     big integer 1184M

SQL> create pfile='/tmp/pfile.ora' from spfile;

File created.

SQL> host cat /tmp/pfile.ora

XE.__data_transfer_cache_size=0
XE.__db_cache_size=520093696
XE.__inmemory_ext_roarea=0
XE.__inmemory_ext_rwarea=0
XE.__java_pool_size=150994944
XE.__large_pool_size=16777216
XE.__oracle_base='/opt/oracle'#ORACLE_BASE set from environment
XE.__pga_aggregate_target=419430400
XE.__sga_target=1241513984
XE.__shared_io_pool_size=67108864
XE.__shared_pool_size=469762048
XE.__streams_pool_size=0
XE.__unified_pga_pool_size=0
*._dmm_blas_library='libora_netlib.so'
*.audit_file_dest='/opt/oracle/admin/XE/adump'
*.audit_trail='db'
*.common_user_prefix=''
*.compatible='21.0.0'
*.control_files='/opt/oracle/oradata/XE/control01.ctl','/opt/oracle/oradata/XE/control02.ctl'
*.cpu_count=2
*.db_block_size=8192
*.db_name='XE'
*.diagnostic_dest='/opt/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=XEXDB)'
*.enable_pluggable_database=true
*.local_listener=''
*.nls_language='AMERICAN'
*.nls_territory='AMERICA'
*.open_cursors=300
*.pga_aggregate_target=392m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1178m
*.shared_servers=0
*.spatial_vector_acceleration=FALSE
*.undo_tablespace='UNDOTBS1'

While using gvenzl/oracle-xe:21.3.0-slim instead:

(I'm testing on a 48 cores workstation)

SQL> show parameter cpu_count

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cpu_count			     integer	 48

SQL> show parameter sga_target

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
sga_target			     big integer 1504M

SQL> create pfile='/tmp/pfile.ora' from spfile;

File created.

SQL> host cat /tmp/pfile.ora

XE.__data_transfer_cache_size=0
XE.__db_cache_size=1006632960
XE.__inmemory_ext_roarea=0
XE.__inmemory_ext_rwarea=0
XE.__java_pool_size=0
XE.__large_pool_size=83886080
XE.__oracle_base='/opt/oracle'#ORACLE_BASE set from environment
XE.__pga_aggregate_target=419430400
XE.__sga_target=1577058304
XE.__shared_io_pool_size=83886080
XE.__shared_pool_size=385875968
XE.__streams_pool_size=0
XE.__unified_pga_pool_size=0
*._dmm_blas_library='libora_netlib.so'
*.audit_file_dest='/opt/oracle/admin/XE/adump'
*.audit_trail='db'
*.common_user_prefix=''
*.compatible='21.0.0'
*.control_files='/opt/oracle/oradata/XE/control01.ctl','/opt/oracle/oradata/XE/control02.ctl'
*.db_block_size=8192
*.db_name='XE'
*.diagnostic_dest='/opt/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=XEXDB)'
*.enable_pluggable_database=true
*.local_listener=''
*.nls_language='AMERICAN'
*.nls_territory='AMERICA'
*.open_cursors=300
*.pga_aggregate_target=400m
*.processes=300
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1500m
*.shared_servers=0
*.spatial_vector_acceleration=FALSE
*.undo_tablespace='UNDOTBS1'

@gvenzl
Copy link
Owner

gvenzl commented May 23, 2022

Awesome, thanks so much @Sanne, that confirms my theory!
The whole memory (mis)calculation issue can be avoided by just setting CPU_COUNT=2 explicitly.

I also seem to have been wrong that CPU_COUNT is always set to 2 in XE environments. That's certainly what it showed during my tests but given that my test env only has 2 cores, that was probably misleading.

@Felk
Copy link
Author

Felk commented May 23, 2022

So your fix made me curious, what's actually in that pfile once you create it? Turns out, CPU_COUNT isn't in there

Yep, that's what I saw too. Part of the reason I did HOST echo '*.cpu_count=16' >> /tmp/pfile.ora instead of HOST sed -i 's/\*\.cpu_count.*/\*\.cpu_count=16/g' /tmp/pfile.ora.

Could you do me a favor and check the two following on your env...

Using my 16-core testimage, because the original one does not boot on that machine:

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cpu_count			     integer	 16
cpu_min_count			     string	 16
parallel_threads_per_cpu	     integer	 1
resource_manager_cpu_allocation      integer	 0
resource_manager_cpu_scope	     string	 INSTANCE_ONLY

Not sure this is even useful to you then.

And likewise, could you please recreate the /tmp/pfile.ora afterwards and give me its output (from within SQL*Plus still)?

File created.
XE.__data_transfer_cache_size=0
XE.__db_cache_size=671088640
XE.__inmemory_ext_roarea=0
XE.__inmemory_ext_rwarea=0
XE.__java_pool_size=150994944
XE.__large_pool_size=33554432
XE.__oracle_base='/opt/oracle'#ORACLE_BASE set from environment
XE.__pga_aggregate_target=536870912
XE.__sga_target=1610612736
XE.__shared_io_pool_size=67108864
XE.__shared_pool_size=671088640
XE.__streams_pool_size=0
XE.__unified_pga_pool_size=0
*._dmm_blas_library='libora_netlib.so'
*.audit_file_dest='/opt/oracle/admin/XE/adump'
*.audit_trail='db'
*.common_user_prefix=''
*.compatible='21.0.0'
*.control_files='/opt/oracle/oradata/XE/control01.ctl','/opt/oracle/oradata/XE/control02.ctl'
*.cpu_count=16
*.db_block_size=8192
*.db_name='XE'
*.diagnostic_dest='/opt/oracle'
*.dispatchers='(PROTOCOL=TCP) (SERVICE=XEXDB)'
*.enable_pluggable_database=true
*.local_listener=''
*.nls_language='AMERICAN'
*.nls_territory='AMERICA'
*.open_cursors=300
*.pga_aggregate_target=512m
*.processes=1280
*.remote_login_passwordfile='EXCLUSIVE'
*.sga_target=1536m
*.shared_servers=0
*.undo_tablespace='UNDOTBS1'

Let me, in the meantime work on a commit that explicitly sets the CPU_COUNT=2, perhaps I can ask you later on to just try to start that image on your machine as well.

Thank you, that sounds like it will fix the issue. I'll gladly try out the image

@gvenzl
Copy link
Owner

gvenzl commented May 23, 2022

Thanks a lot, @Felk!

I have uploaded gvenzl/oracle-xe:test, could you please pull that image and see whether it works on your machine?

@Felk
Copy link
Author

Felk commented May 24, 2022

gvenzl/oracle-xe:test successfully starts on the 96-core machine! Thank you very much

@gvenzl
Copy link
Owner

gvenzl commented May 25, 2022

Awesome, thanks so much for verifying!

@Felk
Copy link
Author

Felk commented May 25, 2022

I'll close this pull request since the changes are obsolete now and am thrilled to finally have our CI pipeline run oracle once this makes it into the official image

@Felk Felk closed this May 25, 2022
@gvenzl
Copy link
Owner

gvenzl commented May 28, 2022

The fix is now present in all images!

Thanks once more @Felk for this great discovery!

@Felk Felk deleted the limit_cpu_count branch May 30, 2022 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants