Pruning & stats

Last post 07-10-2018, 5:30 AM by Ola Lawal. 3 replies.
Sort Posts: Previous Next
  • Pruning & stats
    Posted: 05-17-2018, 12:58 AM

    Our largest issue with CV at the moment is getting pruning to play nice with DASH copies|backups.  If pruning is large and runs into the time that backups|DASH copies operate, both pruning and other operations slow down drastically and cascade into a slow mess with pruning & dash copy backlog and queued jobs.

    We've got local SSD's with Q&I times ~300ys, and finalize times generally from under 1 second > 5.  When pruning runs with DASH copies|backups both of the above can spike  drastically. 

    We've got auto-kill set on our DASH copies before DDB backups and before the main backup window kicks off.  Operational windows only go so far, particularly if you've already got a baklog from a DDB recon, once a backlogs built up I find I need to use every available hour to get it back down again and operational windows can result in wasted time not doing something that needs doing.  (Also SMM_DONOTPRUNE only affects data aging and not pending deletes already send out to the MAs.)

    Ideally we'd like more control and reporting of pruning.  Currently the most information we can get from the GUI is the slope of the graph.

    While I can grep finalize times from SIDBPrune and track current pending deletes per MA & the size thereof from SIDBEngine, this isn't enough information to make changes and see if something is helping or hindering.

    Stuff that'd be useful:

    Add the size (TB) of the pending deletes per GDSP and MA to the GUI somewhere.

    Add pending deletes and size reclaimed per hour counters to either the commserve or private reporting server (both preferred).

    Does CV support have a parser that grabs all the "Avg" times from SIDBPrune.log and does something useful with it (Graph|Groups values by hour), this would be good to release to customers if so.

     

    Also while we're on a similar topic the non-errors "Unable to aquire the Lock to File" for pruning on partitioned GDSP's, I thought AFID's to prune were unique to an MA so why are multiple MA's trying to prune the chunk of a stream at the same time?

  • Re: Pruning & stats
    Posted: 07-09-2018, 8:12 PM

    Point 1

    =======

    Our largest issue with CV at the moment is getting pruning to play nice with DASH copies|backups.  If pruning is large and runs into the time that backups|DASH copies operate, both pruning and other operations slow down drastically and cascade into a slow mess with pruning & dash copy backlog and queued jobs.

    Answer

    =======

    Am really sorry about the issues you are experiencing with Commvault and I hope my answers would assist you in any way possible.

    a.) The 3 main Commvault operations that makes use of intense IOPS on the Media Agents are Synthetic Fulls, Aux copy jobs and Data Aging jobs(Media Agent physical pruning)

    b.) If either of the 3 listed schedules collides with each other, this will slow performance of both AUX\Dash cop jobs and Pruning jobs with Synthetic Full jobs taking higher precedence, then Aux copy, then Data aging job respectively.

    c.) Commvault has provided a means to speed up physical pruning thread on Media Agents, however, this uses high IOPS as more Media Agent resources are utilised during this operation.

    Please add the Additional Key to your Media Agent:

     

    MediaAgent --> DedupMaxDiskZerorefPrunerThreadsForStore

    Value--> 1-25

    Type --> DWORD|Integer

     

    Note*- Usage is for Disk Libraries Only - Increases Pruning ability on Media Agent. Default value is 3 (Start from low value to higher value)

    Helps generate more physical pruning threads when magnetic pruning is perceived to be slow.

     

    Reference: http://documentation.commvault.com/additionalsetting/details?name=DedupMaxDiskZerorefPrunerThreadsForStore

     

    Point 2

    =======

    We've got local SSD's with Q&I times ~300ys, and finalize times generally from under 1 second > 5.  When pruning runs with DASH copies|backups both of the above can spike  drastically.

     

    Answer

    =======

    Yes, this is true as both Dash Copies and Pruning are CPU intensive operations. Best recommendation will be to stagger schedule so that both operations doesn't run at the same time.

     

    Also, whats the Front End Size of Data being backed up? The average Q and I time should be under 200 milli seconds.

    Please refer to the Hardware Specifications for Deduplication Mode and ensure your Media Agents hosting the DDBs meets the Hardware specs:

    http://documentation.commvault.com/commvault/v11/article?p=1656.htm

     

    Point 3

    =======

    We've got auto-kill set on our DASH copies before DDB backups and before the main backup window kicks off.  Operational windows only go so far, particularly if you've already got a backlog from a DDB recon, once a backlogs built up I find I need to use every available hour to get it back down again and operational windows can result in wasted time not doing something that needs doing.  (Also SMM_DONOTPRUNE only affects data aging and not pending deletes already send out to the MAs.)

     

    Ideally we'd like more control and reporting of pruning.  Currently the most information we can get from the GUI is the slope of the graph.

     

    While I can grep finalize times from SIDBPrune and track current pending deletes per MA & the size thereof from SIDBEngine, this isn't enough information to make changes and see if something is helping or hindering.

     

    Answer

    ======

    The already prodvided additional key DedupMaxDiskZerorefPrunerThreadsForStore would assist to aggravate pruning threads. However, if something is hindering pruning, the sidbprune.log and sidbpysicaldeletes.log will show us errors that could be affecting physical pruning on the Media Agent.

     

    At the moment, the pending deletes records could only be visible via the SIDBStore in Commcell Console, the graph and the SidbEngine.log but we cannot see the Application size of Pending Physical Deletes at this stage.

     

    Point 4

    =======

    Stuff that'd be useful:

     

    Add the size (TB) of the pending deletes per GDSP and MA to the GUI somewhere.

     

    Answer

    ======

    Please raise a Commvault Support case to see if a Customer Modification Request could be raised for this feauture.

     

    Point 5

    =======

     

    Add pending deletes and size reclaimed per hour counters to either the commserve or private reporting server (both preferred).

     

    Answer

    ======

    Please raise a Commvault Support case to see if a Customer Modification Request could be raised for this feature.

     

    Alternatively, you can use the - Interval between disk space check changed to 15. (Control Panel --> Media Management -->Interval between disk space check changed )

     

    Then you can right Click on the Library --> Properties tab, then check the amount of disk space reclaimed.

     

    Point 6

    =======

    Does CV support have a parser that grabs all the "Avg" times from SIDBPrune.log and does something useful with it (Graph|Groups values by hour), this would be good to release to customers if so.

     

    Yes we do have a counter that allows the analysis of the AVG times on the SIDBPrune.log but this is not being used for graphical purpose.. Please enable below settings on your media Agent and it would analzye the Average finalize time to prune a chunk of Data in your CVMagnectic Library

     

    Increase Deduped Counters on your Media Agent

    ----------------------------------------------

     

    HKEY_LOCAL_MACHINE\SOFTWARE\CommVault Systems\Galaxy\Instance001\MediaAgent

    Type: INTEGER

    Name: DedupEnablePruningCounters

    Value: 1

     

    Point 7

    =======

    Also while we're on a similar topic the non-errors "Unable to aquire the Lock to File" for pruning on partitioned GDSP's, I thought AFID's to prune were unique to an MA so why are multiple MA's trying to prune the chunk of a stream at the same time?

     

     

    Answer

    ======

    As you already know, the SFILE.idx is locked so that two pruning MAs do not attempt to prune the same chunk at the same time. So, if there is already one MA that has locked an SFILE.idx, any other MAs trying to prune that chunk at the same time will hit this error.

     

    However, when it comes to pruning, Multiple Media Agents have the ability to prune as long as it has read\write access to the mount paths of the Library regardless if this is a mount path associated to a DDB. However you can set each Media Agent to be a pruner for a particular Library but you cannot set a particular Media Agent to prune based on DDB. AFID's are unique to the DDB but pruning happens on the Media Agent which is why all the Media Agents tries to prune on a mount path\Library level and not a DDB level for Data written on Libraries.

     

    To set your Media agent to prune for specific Library, please make use of below settings:

    qoperation execscript -sn SetDeviceControllerProps -si operation -si LibraryName -si MountPathName -si MediaAgentName

    http://documentation.commvault.com/commvault/v11/article?p=features/cli/qscripts/CommServ.QS_SetDeviceControllerProps.Readme.html

     

     

    I wish you all the best with Commvault. Please do let me know if you have any questions.

     

    Thanks.

     

    Ola.

  • Re: Pruning & stats
    Posted: 07-09-2018, 8:49 PM

    Heya Ola,

     

    Thanks for the detailed reply.

     

    Since I posted that we've had a CV onsite engineer and spent a week working through our configuration and issues.  Short version is our netapp media speeds (read in particular) aren't fast enough and our remediation plan to move to NFS 4.2 (sparse file support) boxes suffered from the same issue (writes twice as fast as reads).  We're almost halfway through the first migration and need to complete that to reclaim netapp overhead (WAFL etc), after which the plan is to move to DataServer-IP via direct RDM's which will hopefully sort the disk read issues.  Our disk read speeds are helped by cache in the disk arrays, however this helps much less for writes.

     

    1.  We've enabled that additional setting "DedupMaxDiskZerorefPrunerThreadsForStore" since the post, currently running at 15, I tried more and it looked to slow things down.  However I'm measuring the speed of pruning but subjectively how fast SIDBPrune.log is scolling past.  Some sort of counter to judge pending deletes processed per hour would be good.  I'll get a case in with for an CMR.

    2.  Baseline size from the DDB is 250tb, App size 323tb.  4x partition DDB's, we're not hitting resource limits for CPU/RAM, limiting factor looks to be DDB SSD random IOspeed first and mount path speed 2nd (particulary read).

    3/4/5.  I'll get a CMR requests in, I can get the space used by the pending deletes to be processed with a grep of SIDBPrune.log 

    cat SIDBEngine*.log | grep -i Total] | grep -E -o "Pending Deletes.{30}" | tr "[]-" " " | numfmt --to=iec --field 4

    Just more work to log into each DDB partition for something that ideally would be in the GUI.

    Talking to the onsite CV guy, we discussed using an automatic synth full schedule which I hadn't heard of before and will be really useful.  Something similar for aging/pruning and dash copies would be great.  Rather than static schedule times, soon as the commserve gets quiet run maintenance activities to maximise the hours of the day available.

     

    6.  Yeap, I've enabled that before and I can see the Avg Finalise times. More about trawling the lgos and doing something useful with that info that info if you already had a script etc. in use.

    7.  Very good explanation, thanks for explaining that.

     

     

  • Re: Pruning & stats
    Posted: 07-10-2018, 5:30 AM

    Hi,

    You are most welcome!

    Please find Automatic schedule for Synthetic Full as below for your reference

    Reference:

    https://documentation.commvault.com/commvault/v11/article?p=92404.htm

    In the mean time, only the following backup agents are supported in Commvault.

    • Laptop client data backups
    • Database log backups
    • Synthetic full backups

    You could also lodge a CMR request with Commvault Support if Automatic schedules could be done incorporated for Data Aging and Aux copy jobs.

     

    Finally, I am glad that your enviroment is getting sorted and I hope your journey with Commvault goes smoother from here onwards.

     

    Regards

    Ola.

     

The content of the forums, threads and posts reflects the thoughts and opinions of each author, and does not represent the thoughts, opinions, plans or strategies of Commvault Systems, Inc. ("Commvault") and Commvault undertakes no obligation to update, correct or modify any statements made in this forum. Any and all third party links, statements, comments, or feedback posted to, or otherwise provided by this forum, thread or post are not affiliated with, nor endorsed by, Commvault.
Commvault, Commvault and logo, the “CV” logo, Commvault Systems, Solving Forward, SIM, Singular Information Management, Simpana, Commvault Galaxy, Unified Data Management, QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe, CommCell, SnapProtect, ROMS, and CommValue, are trademarks or registered trademarks of Commvault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Close
Copyright © 2018 Commvault | All Rights Reserved. | Legal | Privacy Policy