Aux Copy Speed problem

Last post 05-24-2016, 11:45 AM by farmer. 32 replies.
Sort Posts: Previous Next
  • Aux Copy Speed problem
    Posted: 10-27-2015, 1:48 PM
    I'm having some difficulty in troubleshooting why my Aux Copy process is not fully utilizing the bandwidth I have available.
    I've read up on a lot of the posts here and through the available documentation and still haven't found the fix.  Additionally, I have a ticket open and am working with an engineer as well.

    Scenario:
    Simpana 10, SP11 all latest hotfixes applied.
    2 Media Agents, 1Gig Ethernet, plenty of CPU, RAM and Disk.
    7 different storage policies with various number of subclients associated from 1 to 60, but most are <10. All policies are set to Disk Optimized for the DASH copy setting.

    For a period of time, we had the Media Agents sitting next to each other and connected to the same network. When we would run the scheduled aux copy jobs they would finish within a few hours, disk usage was low and it was fully utilizing the entire 1GB of network availability.

    We then moved our secondary media agent to a remote site with only a 200 Mbit/s line. Without changing any settings, now we're only seeing a throughput of 100 Mbit/s. Again, disk usage etc remains low, but we never see the network usage we expect. The Media Agent is the only system using this connection so it's not fighting for availability.

    I've verified we're not doing any network throttling in any of the multiple places you can set that. We have the aux copy schedule set to allow maximum readers. We flipped the DASH copy to Network Optimized and have all the speed improvement keys in place and didn't see a noticeable difference other than our source disks were taking pounding. Completion time was still similar.

    So, what other things can I look at to troubleshoot where the slowness is coming from? My Media Agent hardware is capable of sending/receiving data at 1Gbit/s and handle that data on disk as well. Why would I not be utilizing my full bandwidth on a smaller pipe?
  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 2:17 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    how are the desintantion Q&I times?

    DASH copy with Disk Read Optimized will increase the speeds as Network causes more reads on the source disk.

    Have you enabled stream randomization and distribute streams evenly on the storage policy?

     

    have you used any of the additional settings? there are a few you can change but I would do them one at a time - monitor before adjusting these as well.

    http://documentation.commvault.com/commvault/v10/article?p=features/auxiliary_copy/best_practices.htm

     

    Additional Settings:

     

    UseCacheDB: Value 0(default) – client side deduplication database will be in memory  Value 1: Client Side deduplication database will be on disk at the Job Results directory of the source Media Agent

    DataMoverUseLookAheadLinkReader:  Value 1: reads multiple data signatures from the deduplication database. This is useful to optimize the network performance and reduce the time taken to read the data during backup or Auxiliary Copy operation

    DataMoverLookAheadLinkReaderSlots: Default: 4 (Max 16) improves the DASH copy and DASH full efficiency by performing more lookups on the destination deduplication databases (DDBs) for the deduplication block signatures.

     

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 2:21 PM

    Hello,

    By design DASH Copy is not supposed to saturate a WAN pipe, typically if your getting very good dedupe ratio less data transmission of data blocks are sent.

    When you say you seeing slowness how are you seeing it?

    Are you looking at the job throughput?

    Are you looking at backup window time ( ie: getting the primary daily changes across to the secondary)?

    Are you purely looking at network traffic?

    Have you tried increasing the amount of streams you use on the primary ( source side) policy which in turn you can increase the amout of streams being used for the dash which will help to use more of the pipe.

    Regards

    SG1

     

     

     

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:12 PM

    isle:

    how are the desintantion Q&I times?

    DASH copy with Disk Read Optimized will increase the speeds as Network causes more reads on the source disk.

    Have you enabled stream randomization and distribute streams evenly on the storage policy?

    Additional Settings:

    UseCacheDB:...

    DataMoverUseLookAheadLinkReader: ...

    DataMoverLookAheadLinkReaderSlots: ...

     

    Sorry, not sure what you mean by Q&I.  The destination MA is quite idle regarding cpu, memory, disk, etc.

    We saw no significant completion time improvement when switching between Disk Read and Network Optimized, all we saw was a large increase in source disk activity.

    Stream randomization and distribute streams evenly is enabled on our Windows file storage policies but not our SQL ones.

    We have all the additional settings mentioned above in place on our media agents.  From what support told me, the UseCacheDB is only applicable for the Network Optimized setting.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:18 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    Q&I = query and insert

    http://documentation.commvault.com/commvault/v10/article?p=features/deduplication/t_monitoring_ddb.htm

     

     

    Disk read optimized should be used if the source is deduped.

    Why are you using network??

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:24 PM

    SG1:

    By design DASH Copy is not supposed to saturate a WAN pipe, typically if your getting very good dedupe ratio less data transmission of data blocks are sent.

    When you say you seeing slowness how are you seeing it?

    Are you looking at the job throughput?

    Are you looking at backup window time ( ie: getting the primary daily changes across to the secondary)?

    Are you purely looking at network traffic?

    Have you tried increasing the amount of streams you use on the primary ( source side) policy which in turn you can increase the amout of streams being used for the dash which will help to use more of the pipe.

    Slowness in that my aux copy jobs indicate they'll take 14 days to complete what I've backed up the night before and not utilizing all the bandwith they possibly could.  I guess what I'm looking for is what is the slow link in this chain.  When I was on a 1 GB pipe it was using it all to copy data and the MAs weren't really stressed so the assumption is that the pipe was the limiting factor because it's saturated and the MAs appeared that they can handle more data than they're receiving.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:29 PM

    isle:

    Q&I = query and insert

    http://documentation.commvault.com/commvault/v10/article?p=features/deduplication/t_monitoring_ddb.htm

     

     

    Disk read optimized should be used if the source is deduped.

    Why are you using network??

     

    I'll check on the Q&I.

    We've been set up on Disk Read Optimized.  We flipped it over to Network Optimized to test if that improved the speed since it's over a 200mb link vs a 1gb link.  We determined that it did not and have since changed the copies back to Disk Read.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:40 PM

    Average Q&I time for my Source Side DDB is 273.  Destination side is 264.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:48 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    Give these a try and restart the aux copy 

    DataMoverUseLookAheadLinkReader:  Value 1

    DataMoverLookAheadLinkReaderSlots: Value: 16

     

     

    I would take a screen shot of the streams before and after

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 3:54 PM

    Those settings have been in place since probably February this year on all our Media Agents.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:00 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    Have you tried removing them and seeing if there is a difference?

    What kind of speeds do you see with windows copy to a UNC share at the destination?

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:04 PM

    UNC share copy I can get it up over 189 mbit/s.  

    I haven't tried removing them.  I can certainly do so though.  Once they're removed, they take effect immediately?  I understand I'll have to kill the current jobs and run new aux copies.  Just want to verify I don't need to restart services on the MAs or anything like that.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:07 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    Restarting job will honor new settings 

     

    In certain enviroments the "additonal settings' have performance issues 

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:09 PM

    The Additional settings I currently have are:

     

    DataMoverLooAheadLinkReaderSlots16

    DataMoverUseLokAheadLinkReader1

    SignaturePerBatch16

    UseAuxcopyReadlessPlus1

    UseCacheDB1 (I believe this only takes effect when using Network Optimized)

     

    Going to remove the top 2.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:12 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    Where does you job results directory sit for the source media agent?

     

    I would actually remove these : 

    UseAuxcopyReadlessPlus1

    UseCacheDB1 (I believe this only takes effect when using Network Optimized)

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:31 PM

    I'll run this for a bit with the top 2 disabled and see what happens.

     

    Job results directory sits on C:\Program Files\CommVault\Simpana\iDataAgent\JobResults for everything in our environment.

    The C drive for our MAs is spinning disk.  Index Cache and DDB are SSDs.  Not sure if that's pertinent.

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:34 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    Depending on the Disk Queue you could be slowing it down.

     

    Without the key or a value of zero its hashed in memory.

    On disk it provides local look ups - in your case on the C drive.

     

  • Re: Aux Copy Speed problem
    Posted: 10-27-2015, 4:56 PM

    Disk activity is pretty minimal on C:\.  Which key are you refering to regarding "Without the key or a value of zero its hashed in memory." ?

  • Re: Aux Copy Speed problem
    Posted: 10-28-2015, 2:24 PM
    • isle is not online. Last active: 06-04-2019, 3:28 PM isle
    • Top 25 Contributor
    • Joined on 08-21-2012
    • NJ
    • Adept
    • Points 308

    UseCacheDB with value = 1 stores it on disk instead of memory.

     

    how are things running with changing the settings?

  • Re: Aux Copy Speed problem
    Posted: 10-28-2015, 3:57 PM

    Mixed.  Some seem to be slightly better, ie they have a lower expected completion time, but some are higher.

     

    Currently running with DataMoverLookAheadLinkReaderSlots at 16,

    DataMoverUseLookAheadLinkReader at 1,

    SignaturePerBatch at 16,

    UseAuxcopyReadlessPlus is disabled

    UseCacheDB is disabled.

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 3:59 PM

    If you ping the destination media agent from the source media agent, what's the response time (latency)? And what's the throughput of the job on average?

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:05 PM

    12ms is consistantly what we're getting back.

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:17 PM

    What's your usual aux throughput right now? 250-400GB/Hr?

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:27 PM

    Looking at the graph my router is showing me around 95-110Mbps.  The remote media agent is the only device on this connection.

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:30 PM

    Adding up all the Aux Copy Average Throughputs, I'm seeing about 213 GB/hr.

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:31 PM

    What does the aux copy job shows? Can you give me the numbers when you look at "Average Throughput", "Total Data Processed" and "Data Transferred on Network" in the general tab of the job's properties and also "Over Throughput" under the progress tab?

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:38 PM

    Screenshot of all our aux copies and throughput attached.


    Attachment: Aux Copies.PNG
  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:50 PM

    With our other media agent, which is exactly the same hardware wise, these jobs take maybe a couple hours.  They're on a 1 Gbps connection with a 1ms latency.

    I'm really wondering if the round trip time is really what's restricting our ability to utilize the complete 200mbps connection we have.  We had this same server on a 100mbps connection and it would peg that one out.

  • Re: Aux Copy Speed problem
    Posted: 11-05-2015, 4:55 PM

    This graph is pretty typical of what I see.

     


    Attachment: traffic.PNG
  • Re: Aux Copy Speed problem
    Posted: 11-06-2015, 10:55 AM

    Yeah,‎ I had a couple of sites with a similar latency and was seeing aux at about 250-250GB/hr as well.


    I didn't get CV to confirm this but I beleive that the fact that you were using all your 100mbit pipe before was kind of a coincidence as it equates to roughly 250GB/Hr. In my testing, I found that this limit is mostly caused by a 12ms latency between MAs (at least in my environments). One of my sites actually had a 6gbit/s link and auxes were not really faster.


    *************** DISCLAIMER ***************
    Now, I did manage to find some sort of workaround which I think is pretty good but you will definitely not find it on any best practices / white papers on Commvault's website. I'll give you the steps but you should probably contact Commvault about it (I would ask your sales rep to put you in contact with an engineer as support obviously won't help you here). You can also do it by yourself like I did but I make no guarantee that this will work in your environment. Also, as I did not validate any of this with Commvault, anything that appears as being a statement below is really just what I think based on my own finding/testing.
    *************** DISCLAIMER ***************


    Now that this is out of the way, here we go!


    When you do an aux, the source MA keeps asking the destination MA to check if each block signature is present in its DDB. This causes a lot of small chatter on the link which is very bad on high latency links (in this case I consider 12ms as high latency).


    So, to help reduce this, I needed a way to remove the latency from the aux process. The first attempt was to copy the destination DDB back to my source MA.


    This didn't help much as the default MA for the destination maglib was still the destination MA so my guess was that the chatter was still occuring over the network because the destination MA was still identified as the default MA. Unfortunately, you can't change that to a MA that doesn't have access to the maglib so I then shared all the destination maglib paths with my source MA. Since my storage was all block at the time, I basically used \\DestinationMA\d$\MagLibPaths with a specific user account and added all the destination paths to my local MA going through the destination's MA UNC paths. Using a NAS would have been better but that was not an option for me.


    Anyway, once I added all the paths to the source MA, I was then able to set my source MA as the default MA for that storage policy copy and this in turn made my aux copies go past 1TB/hr (when no data needed to be copied which should be most of it when copying your pre-seeded fulls). Obviously, when copying actual new unique data, you're bound to the speed limit of the physical link.


    So, the final config looked like this:


    1. Source MA hosts both source and destination DDB on SSD.

    2. Source MA has all the destination maglib paths visible and is the default MA for the destination DDB / secondary copies pointing to that DDB.

    3. Destination MA is now only acting as a "proxy" to the block maglib (could be avoided if using CIFS NAS).

    4. Secondary copies are configured for disk optimized DASH copies.

    5. In v9, I was using the lookAheadReader key but in V10 I didn't find it to provide any improvements as the DDB was going much faster than the v9 DDB.


    Now, a few question you might ask:


    1. If the DDB is on the source MA and I lose the whole site, am I screwed? No. DDB is not needed for restore. If you have a whole DR site at your destination environment, you can start backing it up using another DDB.

    2. I'm already using the destination DDB for local backups at that site. What now? I personally recommend not "sharing" the aux DDB with other backups. Here's my logic: the DDB has a front end / primary block limit. For simplicity sake, let's say that this limit is 100TB. If you share your destination DDB with other backups at the destination site, you run into the possibility of "filling up" one DDB before the other. Say your source front end data was originally 50TB and destination front end data backup 30TB. Now, your source has grown to 80TB and destination to 50TB. Well in theory you're past the DDB limit and performance will suffer both for your auxes and remote backups. I prefer to keep a local and destination DDB dedicated to a "data set" to avoid this issue. There is a small hit on disk usage but it should usually be very minimal.

    3. I only have 1 SSD drive for my source MA, what do I do with the destination? If it has enough space, I would usually put it on the same drive if I know that backups won't really run at the same time as auxes (often the case) and/or if my testing shows the performance hit is minimal. Remember, you're probably still getting a better aux throughput because of it. Ideally, you might want to eventually put it on its own SSD.

    4. Wouldn't the UseCacheDB and UseAuxcopyReadlessPlus have the same impact as what you described? You'd think so but in my case it didn't help. Remeber your CacheDB will usually be a very small subset of the actual destination DDB. Unless you allow it to grow to hundreds of GBs, there should always be a bunch of signature requests going over the link.


    Wow, I think I wrote enough stuff for now. Sorry for the very long message but I prefer you know what you're getting into if you decide to try it out.


    Phil

     

  • Re: Aux Copy Speed problem
    Posted: 03-11-2016, 9:50 AM

    Phil,

       That is genius!  I have been battling slow dash copies of highly deduped backups across a 1GB 52ms WAN link for over a year now working with CV support,  I am going to propose your solution and see what I get back from them.

       Thank you for the GREAT IDEA!

  • Re: Aux Copy Speed problem
    Posted: 03-21-2016, 3:07 PM

    farmer, let me know how it goes.  We're still crawling along getting further and further behind.  Looking at only copying fulls vs incrementals, diffs, etc.

  • Re: Aux Copy Speed problem
    Posted: 05-24-2016, 11:45 AM

    We did get it setup with the 2 DDB in the source site, and that source media agent writing to the shared maglib in the remote site.   For some reason that still did not help.   The hunt continues to try and figure out why it will not run faster.  

The content of the forums, threads and posts reflects the thoughts and opinions of each author, and does not represent the thoughts, opinions, plans or strategies of Commvault Systems, Inc. ("Commvault") and Commvault undertakes no obligation to update, correct or modify any statements made in this forum. Any and all third party links, statements, comments, or feedback posted to, or otherwise provided by this forum, thread or post are not affiliated with, nor endorsed by, Commvault.
Commvault, Commvault and logo, the “CV” logo, Commvault Systems, Solving Forward, SIM, Singular Information Management, Simpana, Commvault Galaxy, Unified Data Management, QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe, CommCell, SnapProtect, ROMS, and CommValue, are trademarks or registered trademarks of Commvault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Close
Copyright © 2019 Commvault | All Rights Reserved. | Legal | Privacy Policy