Blame it on the network

One of the most frustrating oracle errors we hate as DBA’s is the famous ora-3113 (often accompanied by ora-3114 and ora-12170). :)

Let’s take a scenario where we start hearing complaints about network errors and sporadic connection break downs from the end users.

The following example is taken from a windows 64bit and client windows 32 bit.

I’m hoping that the next procedure might help you better understand where the error stems from and debug it:

  • First step in debugging such a behavior is looking from which side the errors come from. It could come from the server’s side, the client’s side or both. If such error comes from the server side we would probably see some evidence in the alert log and observe some traces in the user dump destination. We should also take a look at the listener.log as it might also give us a hint.
  • In case the alert log is empty of errors and no traces were gathered in the udump directory we should check the client for errors. Now, we might see some network substrate (ns) related errors in the sqlnet.log such as  ” ns main err code: 12547″ but it does not point out the real issue as it indicates only a lost connection of the client. At this point we would want to trace the sql*net further.

In order to get some detailed information let’s add the para  meter

TRACE_LEVEL_CLIENT = 16 to the sqlnet.ora.

  • Our trace file is going to grow very fast and so enough disk space is necessary. If we do get an error when the trace is running it should look like the following:
(636) [13-NOV-2009 07:21:52:294] ntt2err: entry
(636) [13-NOV-2009 07:21:52:294] ntt2err: soc 7480 error - operation=5,
ntresnt[0]=517, ntresnt[1]=54, ntresnt[2]=0
(636) [13-NOV-2009 07:21:52:294] ntt2err: exit
(636) [13-NOV-2009 07:21:52:294] nttrd: exit
(636) [13-NOV-2009 07:21:52:294] nsprecv: error exit
(636) [13-NOV-2009 07:21:52:294] nserror: entry(636) [13-NOV-2009 07:21:52:294] nserror:
nsres:, op=68, ns=12547, ns2=12560; nt[0]=517, nt[1]=54, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
(636) [13-NOV-2009 07:21:52:294] nsrdr: error exit
(636) [13-NOV-2009 07:21:52:294] snsbitts_ts: entry
(636) [13-NOV-2009 07:21:52:294] snsbitts_ts: acquired the bit
(636) [13-NOV-2009 07:21:52:294] snsbitts_ts: normal exit
(636) [13-NOV-2009 07:21:52:294] snsbitcl_ts: entry
(636) [13-NOV-2009 07:21:52:294] snsbitcl_ts: normal exit
(636) [13-NOV-2009 07:21:52:294] snsbitts_ts: entry
(636) [13-NOV-2009 07:21:52:294] snsbitts_ts: acquired the bit
(636) [13-NOV-2009 07:21:52:294] snsbitts_ts: normal exit
(636) [13-NOV-2009 07:21:52:294] nsdo: nsctxrnk=0
(636) [13-NOV-2009 07:21:52:294] snsbitcl_ts: entry
(636) [13-NOV-2009 07:21:52:294] snsbitcl_ts: normal exit
(636) [13-NOV-2009 07:21:52:294] nsdo: error exit
(636) [13-NOV-2009 07:21:52:294] nioqrc: wanted 1 got 0, type 0
(636) [13-NOV-2009 07:21:52:294] nioqper: error from nioqrc
(636) [13-NOV-2009 07:21:52:294] nioqper: ns main err code: 12547
(636) [13-NOV-2009 07:21:52:294] nioqper: ns (2) err code: 12560
(636) [13-NOV-2009 07:21:52:294] nioqper: nt main err code: 517
(636) [13-NOV-2009 07:21:52:294] nioqper: nt (2) err code: 54
(636) [13-NOV-2009 07:21:52:294] nioqper: nt OS err code: 0
(636) [13-NOV-2009 07:21:52:294] nioqer: entry
(636) [13-NOV-2009 07:21:52:294] nioqer: incoming err = 12151
(636) [13-NOV-2009 07:21:52:294] nioqce: entry
(636) [13-NOV-2009 07:21:52:294] nioqce: exit
(636) [13-NOV-2009 07:21:52:294] nioqer: returning err = 3135
(636) [13-NOV-2009 07:21:52:294] nioqer: exit
(636) [13-NOV-2009 07:21:52:294] nioqrc: exit
(636) [13-NOV-2009 07:21:52:294] nioqds: entry
(636) [13-NOV-2009 07:21:52:294] nioqds: disconnecting...
(636) [13-NOV-2009 07:21:52:294] nsclose: entry
(636) [13-NOV-2009 07:21:52:294] nstimarmed: entry
(636) [13-NOV-2009 07:21:52:294] nstimarmed: no timer allocated
(636) [13-NOV-2009 07:21:52:294] nstimarmed: normal exit
  • The highlighted line is what we need to focus on. The actual error is the ntt2error and the following arguments  – operation=5, ntresnt[0]=517, ntresnt[1]=54, ntresnt[2]=0 mean that a the network is stressed.

The error means that a packet being sent from the client does not receive ACK for 93 seconds, which is the default of the parameter TcpMaxDataRetransmissions

(look at Microsoft TechNet for further details –  http://technet.microsoft.com/en-us/library/cc938210.aspx ). The retransmission mechanism works in the way that when a packet is being send it waits for 3 seconds for ack before resending it. If it does not receive ack it will wait 3 seconds times 2, then times 4, times 8, and times 16 (5 times total).

The workaround for a busy network is to increase the default of the TcpMaxDataRetransmissions parameter. This parameter is located in the registry under

HKEY_LOCAL_MACHINE > system > CurrentControlSet > Services > Tcpip > Parameters

If the parameter does not exsist, just add it but bear in mind that adjusting the parameter can lead to a longer timeout wait rather than actually solve the problem.

So what you really need to do is to get the guy with the sniffer over and have him search for the problem.

Good luck!

BMR – one more important RMAN benefit

Still uncertain whether to backup your database with RMAN or not?

RMAN significantly improved database backup by giving the following advantages:

  1. Faster backup.
  2. Internal management of backup and restore which are independent of OS or PLSQL scripts.
  3. Single interface for all OS.
  4. Internal Oracle capabilities such as binaries shrinking algorithm, parallelism in backup and restore procedures.

But here is another major benefit that might help you make the right decision:

BMR – Block Media Recovery

Follow this link to the original article posted in Hebrew:

http://www.oracle.com/global/il/technology/ido_ben_zeev.html

Install Management Agent on windows cluster

Installing the management agent (formerly known as intelligent agent) is usually a simple task as long as you asked your system administrator to open all the required ports.

When it comes to a windows cluster it involves a little bit more caution and precision. The agent works on a cluster environment no matter if you are using Oracle FailSafe or not. The following pdf simplifies the procedure:

Install agent on windows cluster

Upgrading to 11g – The dark side of db_file_multiblock_read_count

db_file_multiblock_read_count parameter helps us controlling the numbers of blocks being brought up to the buffer cache in one I/O action. In older oracle releases, this parameter was usually set between 4 and 16 on OLTP environments and up to 64 on DWH environment.

Starting 10gR2, db_file_multiblock_read_count  is being set automatically depends on your OS I/O size and buffer cache size.

What happens is that even when you tune your system by optimizing the multiblock parameter, when you then upgrade to 11g (Fresh installation) Oracle might set up the default to an irrelevant number.  So, for instance, if you work on RAC OLTP based system, and your multiblock is set on 8 you should expect a change of this parameter with major performance degradation.

I recently visited a customer who suffered from some nasty waits on “gc buffer busy acquire” as well as other GC related wait events right after upgrading to 11g. Further investigation and a stress test validated the assumption that this is due to the change of the multiblock parameter (from 16 to 128).

The reason for this behavior is clear- since one node keeps a large amount of blocks in the buffer cache, and the second node tries to acquire access to the same blocks (intense OLTP operations) it will wait on cache fusion events.

So be conscious about it!

Ido

Take advantage of the 11g ASM cp and easily backup your RAC databases

Oracle introduced the asmcd utility already in 10g database, but with limited shell commands. Although we expected oracle to come up with advanced shell capabilities in 11g , a little has been done, but still very helpful.

A very nice feature allows us to migrate (export) asm files out to an external file system.  The big plus here is that no database needs to be mounted on the machine where we would like to mount the asm disk groups on. For instance, you can use a storage utility to create a snapshot of your ASM disks and present them on a different machine where you want to them to be backed up. This is nice especially where you have a backup tool (evault , for example) that does not support ASM.

In the attached pdf, you will find all the steps and scripts to design such a configuration.

To make it easier on everyone, I put everything down in the following pdf:

Using ASM CP command

Sharing Knowledge is powerful – Welcome to my new Blog

So here it is – Ido’s technological Blog!

yes I know.. here is yet another blog, you say.. there are zillion of blogs.. you say..  Why would I need yet another one to read… you say.

So what’s the difference anyhow? well, let me tell you..

I will make my best to expose you all to the most recent technological updates out there.

I will make the best to publish some scripts from time to time to help you improve your work flow.

I will make the best to expose you all to different configurations on various platforms.

So let’s all share our ora insights because sharing knowledge is fun (or at least helpful..)

Cheers

Ido

Copyright © OraInsights
Ido Ben Zeev

Built on Notes Blog Core / Powered by WordPress
Enhanced by Lucid