System Replication Implementation and Testing (part 2)
System Replication Implementation and Testing (part 2): My name is Man-Ted Chan and I’m from the SAP HANA product support team. This is part 2 to my High Availability/System Replication blog, part 1 can be found here.
This will continue where the last blog left off
How to turn off replication
First we will unregister the secondary server, this means no more data from the primary will go to this server:
After this have been unregistered we can check the hdbnsutil –sr_state to confirm this:
However, if you check the primary node you will see that the replication is still enabled, but no server for the replication is listed.
Next we can disable the replication on the primary
Once this is done you can check the replication tab and hdbnsutil –sr_state
As a test, I stopped the primary to see what happens on
Other things tested during this phase
As a test I stopped the primary to see what happen to the replication. No automated takeover will occur, but we will see the following network communication errors in the trace files
e Stream NetworkChannelCompletion.cpp(00524) : NetworkChannelCompletionThread #2 NetworkChannel FD 28 [0x00007fc028072818] {refCnt=3, idx=2} 10.97.22.172/0_tcp->10.97.22.172/30103_tcp ConnectWait,[—c]
: Error in asynchronous stream event: exception 1: no.2110001 (Basis/IO/Stream/impl/NetworkChannelCompletion.cpp:450)
Generic stream error: getsockopt, Event=EPOLLERR – , rc=111: Connection refused
Please note that if you stop the replication server the primary server will throw the following alerts
ReplicationError with state INFO with event ID 1 occurred at on xxxx36f509:30007. Additional info: Communication channel closed
Associated with Alert ID 78
The following error will be found in the trace files
e TNS TNSClient.cpp(00671) : sendRequest dr_getremotereplicationinfo to xxxx301545c:30001 failed with NetException. data=(I)drsender=1|
e sr_nameserver TNSClient.cpp(06880) : error when sending request ‘dr_getremotereplicationinfo’ to xxxx301545c:30102: connection refused,location=xxxx301545c:30001
i EventHandler EventManagerImpl.cpp(00602) : acknowledge: ReplicationEvent(): Communication channel closed
If you run into this alert in your own system you should check to see if the secondary node is down (can you start it or was there a crash?)
How to perform a takeover
*Please note that performing a takeover should be done only if there is an issue if the primary or if you would like zero down during a HANA upgrade
Right click on the secondary node and open the “Configure System Replication”
At an OS level you will see the takeover process
To perform the takeover via the command prompt you would run the following on the secondary server:
Hdbnsutil –sr_takeover
*After the takeover a new server needed to be made so the server name is different from 301545c to 59e3753f1
Please note on your replication server you will now be able to open the admin panel and not just the diagnosis mode (in the diagnosis mode only ‘Processes’, ‘Diagnosis Files’, and ‘Emergency Information’ tabs are available)
On the old primary server and old replication we can check the Landscape->System Replication and see there is no replication
Since the replication hasn’t been disabled we will see the communication errors again on the original primary
i EventHandler EventManagerImpl.cpp(00780) : –removeAllEvents: ReplicationEvent(): Communication channel closed
On the old replication server the nameserver trace will show the following during the takeover if it was successful
i sr_nameserver TREXNameServer.cpp(15647) : re-assign for databaseId 2 volume 2 returned successfully
i sr_nameserver TREXNameServer.cpp(15647) : re-assign for databaseId 2 volume 4 returned successfully
i sr_nameserver TREXNameServer.cpp(15647) : re-assign for databaseId 2 volume 3 returned successfully
i sr_nameserver TREXNameServer.cpp(15703) : issueing “/usr/sap/MV1/SYS/global/hdb/install/bin/hdbupdrep -s MV1 –user_store_key=SRTAKEOVER -b”
i sr_nameserver TREXNameServer.cpp(15686) : reconfiguring all services
Check the global.ini and nameserver.ini on the secondary node (the primary will not change)
/usr/sap/MV1/global/hdb/custom/config> cat global.ini
[system_replication]
site_id = 2
mode = sync
actual_mode = primary
site_name = rep
mo-59e3753f1:/usr/sap/MV1/global/hdb/custom/config> cat nameserver.ini
[landscape]
id = 55de6934-1b45-7f0a-e100-00000a6116ac
master = mo-59e3753f1:30001
worker = mo-59e3753f1
active_master = mo-59e3753f1:30001
idsr = 55f36543-7352-8161-e100-00000a61131b
roles_mo-59e3753f1 = worker
Memory
In order to minimize memory consumption, the following parameters should be set in the secondary system:
1) global.ini/[system_replication]/preload_column_tables = false
2) global.ini/[memorymanager]/global_allocation_limit =
If the parameter “preload_column_tables” is set to “true” on the secondary side, the secondary system will dynamically load tables into memory according to the preload information shipped from the primary side.
During the takeover procedure, the “global_allocation_limit” should be increased on the secondary side to the same value as on the primary side.
Memory on the primary can be consumed in async mode there is a logbuffer that gets loaded and then sent over to the secondary, the amount of memory this takes up is set by
- global.ini -> [system_replication] -> logshipping_async_buffer_size =
Tracing
For additional information during a takeover please run the following
alter system alter configuration (‘nameserver.ini’,’SYSTEM’) SET (‘trace’,’failover’)=’debug’ with reconfigure;
alter system alter configuration (‘nameserver.ini’,’SYSTEM’) SET (‘trace’,’ha_provider’)=’debug’ with reconfigure;
Perform failover test. Once done you can turne off this tracing
alter system alter configuration (‘nameserver.ini’,’SYSTEM’) UNSET (‘trace’,’failover’) with reconfigure;
alter system alter configuration (‘nameserver.ini’,’SYSTEM’) UNSET (‘trace’,’ha_provider’) with reconfigure;
For general tracing during the replication you can go edit in the SAP HANA studio global.ini-> trace-> sr_dataaccess = debug and studio global.ini-> trace->stream= debug. This will add additional tracing in the indexserver trace.
References
System Replication Configuration Parameters
https://help.sap.com/saphelp_hanaplatform/helpdata/en/0c/d257970d514abd8ddf9ee1f45f3bca/content.htm?fullscreen=true
Issues Encountered
Misc.
-After SP9 users ran into Alert 79, Configuration Parameter Mismatch, to resolve this you can edit global.ini->system_replication->keep_old_style_alert = false
The ini’s will still be mismatched, but the alert will stop appearing. User can manually check the mismatches, or can go to /usr/sap//global/hdb/customer/config and copy from the primary and paste it to the secondary, but do not overwrite global.ini->system_replication and nameserver.ini->landscape section as this will break replication. Another option you can do is run the SQL script to find the differences:
HANA_Replication_SystemReplication_ParameterDeviations
Network Related
-‘Communication Channel Closed’ errors, the replication server is either down or there is a networking error. (Check to see if the HANA services are running, if they are talk to your networking team about blocked ports)
-(DataAccess/impl/DisasterRecoveryProtocol.cpp:3478) Asynchronous Replication Buffer is Overloaded exception throw location:
This error occurs only if you choose ASYNC replication, this can occur if there is a slowness in the network. You can check your network statistics on with the following table
HOST_VOLUME_IO_TOTAL_STATISTICS or run the SQL script
HANA_Replication_SystemReplication_Bandwidth
If you need to resolve this issue prior to looking into you network you can do one of the following:
1) Change the replication mode, -sr_change mode –mode= sync|syncmem
2) Change global.ini->system_replication->logshipping_async_wait_on_buffer_full = false, this will temporarily decouple the synchronization.
Registration fails
Issue:
Unable to contact primary site error: at 30001
Solution:
Check the host name you have entered, something’s to check:
The hostnames are unique
The secondary host name is not a substring of the primary
Do not use the IP address
Issue:
f sr_nameserver TREXNameServer.cpp(10651) : remoteHost does not match with any host of the source site. Please ensure that all hosts of source and target site
Can’t resolve all hostnames of both sites correctly.
Solution:
Run the following query and
select name from m_topology_tree where path = ‘/host/’
Startup of secondary fails
Issue:
Secondary nameserver starup fails after registration of secondary to primary: TREXNameServer.cpp(02876) : source site is not active, cannot start secondary site. Please run hdbnsutil -sr_takeover in case of a disaster or start primary site first. -> stopping instance ..
Solution:
Do not use secondary hostnames that are substring of primary hostnames.
Issue:
nameserver server:30001 not responding.
collecting information …
error: source system and target system have overlapping logical
hostnames; each site must have a unique set of logical hostnames.
hdbrename can be used to change names;
failed.
Solution:
This is caused by connection timeouts, but if you see it only for a few services check to see if the landscape are the same.
MultiDB issue
Issue:
“unhandled ltt exception: exception 1000003:
Index 1 out of range [0, 0)” when i check the sr_state after running
Solution:
Resolved in 97.01 and 102
Takeover
Issue:
i LogReplay RowStoreTransactionCallback.cc(00226) : starting master-slave DTX consistency check
e LogReplay RowStoreTransactionCallback.cc(00264) : Slave volume 3 is not available
Solutions:
Resolved in rev 74.04 and 82
Work around:
1) Add following INI parameters as ‘false’ in indexserver.ini and statisticserver.ini
[transaction]
check_slave_on_master_restart = false
check_global_trans_consistency = false
2) The, restart your system.
Issue:
From time to time the takeover process hangs
w Backup BackupMonitor_TransferQueue.cpp(00048) : Master index server not available! Following trace Entries are in written to the trace file, and there is a time gap in the trace of 30m: [11596]{-1}[-1/-
i PersistenceManag PersistenceManagerImpl.cpp(02359) : Activating periodic savepoint, frequency 300 e TrexNet Channel.cpp(00362) : active channel 33 from 53223 to 127.0.0.1:30001: reading failed with timeout error; timeout=1800000ms elapsed
Solution:
There is no work around, this issue is fixed in 85.02 and 90
Issue:
If a takeover is performed on a secondary system where not all tenants could be taken over (e.g. because they were not initialized yet) then the takeover flag is not removed from the topolgy (/topology/datacenters/takeover/*)
Solution:
Resolved in HANA 10.1
Crash on secondary
indexserver crash at DataRecovery::LoggerImpl::IsSecondaryBackupHistoryComplete on the secondary system.
The bug is fixed as of revision 90 so a permanent solution is available via an upgrade.
In the interim the workaround to the issue is the setting of the parameter [system_replication] ensure_backup_history = false within the global.ini file.
The setting of this parameter disables the maintenance of the backup history. The takeover process is not affected by this parameter but full recovery scenarios after takeover (using old primary data/log backups with new primary log backups) may be impacted.
SAP Notes
1995412 – Secondary site of System Replication runs out of disk space due to closed data shipping connection
1945676 – Correct usage of hdbnsutil -sr_unregister
2057595 – FAQ: SAP HANA High Availability
2100052 – How to disable parameter mismatch alert for system replication
2050830 – Registering a secondary system via HANA Studio fails with error ‘remoteHost does not match with any host of the source site’
2021186 – Garbage collection takes a long time during HANA service restart
2075771 – SAP HANA DB: System Replication – Possible persistence corruption on secondary site
1852017 – Error 10061 when connecting SAP Instances to failed over HANA nodes
2063657 – HANA System Replication takeover decision guideline
2062631 – high availability limitation for SAN storage
2129651 – Indexserver crash caused by inconsistent log position when startup
1681092 – Multiple SAP HANA DBMSs (SIDs) on one SAP HANA system
2033624 -System replication: Secondary system hangs during takeover
2081563 – secondary system’s replication mode and replication status changed to “UNKNOWN”
2135107 – Log segment for backup history is still missing after reconnect with log shipping
New NetWeaver Information at SAP.com
Very Helpfull