Partitioned DDB - dataflow

Last post 07-14-2013, 9:31 AM by JayST. 13 replies.
Sort Posts: Previous Next
  • Partitioned DDB - dataflow
    Posted: 07-01-2013, 9:22 AM

    I'm testing with partioned DDB storage policies. I'm unsure about the dataflow. 

    Let's say i have MA1 and MA2. Each MA has it's own MagLib, which uses SAN volumes attached to the MA as a local drive letter.

    MA1-Maglib1

    MA2-Maglib2

    I could create a storage policy primary copy with partitioned DDBs and have a gridstor setup by sharing the mount paths on each MA/Maglib. The MA1 would have read-access to the mount path of MA2 and vice versa.

    I'm trying to create a scalable two-node Deduplication configuration in which each media agent has it's own DDB for back-ups going to each of those media agents.

    If i have 100 Subclients starting, ideally, 50 would go to one MA and 50 to the other MA (round robin).

    questions:

    1.)Will the DDB on each MA be hammered for 50 subclients or do both DDBs do work for all 100 subclients?

    2.)What's the data flow in this setup when i would back-up multiple subclients and have the data path configuration to "round robin" between these two media agents?

    3.)Would this even be efficient? 

    4.) Which Media Agent writes the actual data to the maglib for a single back-up?

     

  • Re: Partitioned DDB - dataflow
    Posted: 07-01-2013, 12:02 PM

    What kind of storage is presented to your media agents Block Cifis NFS?  Also do you have to run an Aux copy to tape? If you do setting this up might be a bad idea here is why.

    If you have your media agents load balanced, blocks will sometimes write to one media agent and then another. When you do an aux copy, the media agent is going to copy the block per job ID not data it captured. If blocks are spread out to various media agents the media agent performing the aux tape copy has to do a UNC path to get the other block off the other media agents. We had round robbin turned on and had to turn it off becuase we have to dump to tape every day. If you have a tape requirment the federated DDB will provide more load blanaced but will your tape copy will suffer if you are using block level storage.

  • Re: Partitioned DDB - dataflow
    Posted: 07-01-2013, 1:34 PM

    storage is block based in this case. And yes i'd like to run an aux copy to tape as well .

    "blocks will sometime write to one media agent and then another".
    But round robin works on job IDs as well right? If i kick off 2 subclients at the same time, round robin will evenly distribute the subclients over 2 media agents, at least, this is what i thought i saw on my tests :)

    When aux copying to tape using a tape drive attached to one of the MAs (let's say MA1), the first job will come from the maglib attached to MA1 and the second job will come from the maglib attached to MA2, but will also travel over the network (UNC path) to MA1 before going to tape. Is that correct?

    I think a CIFS / NFS share would be more efficient here, but i'm still curious how data flows when using block storage and how my media agents behave regarding the 100 jobs in my first example.

  • Re: Partitioned DDB - dataflow
    Posted: 07-02-2013, 9:00 AM

    We are new to CV in general and have our system setup basically how you have descibed. I haven't been able to see any sort of a round robin configuration in use at all. All clients in a policy use one media agent even if there is 100+ kicking off at once.


    -J. Mills
    Systems Engineer (Storage & Backup)
    CommVault Certified Master

  • Re: Partitioned DDB - dataflow
    Posted: 07-03-2013, 3:13 AM

    that's strange. Then you probably have a different configuration . Do you have the round-robin set on the data path configuration of the storage policiy copy? 

  • Re: Partitioned DDB - dataflow
    Posted: 07-03-2013, 5:55 AM

    does anyone else have some input/details on partitioned DDBs and the dataflow when using them?

    thanks.

  • Re: Partitioned DDB - dataflow
    Posted: 07-04-2013, 5:33 PM
    • Ali is not online. Last active: 07-03-2019, 12:32 PM Ali
    • Top 10 Contributor
    • Joined on 08-05-2010

    Hi Jay, when you 'partition' the DDB, you're not really going to have 2 DDB's, its just 1 logical engine, with 2 physical partitions, allowing for the scalability you mentioned regarding 'footprint' and 'backup speed'.

    Data will be round robin of course once you put in Gridstore, however both 'partitions' won't reference the same blocks necessarily, during the data flow, each partition will update its respective block refrence to it, and write the unique block reference for newer blocks to which ever partition gets it 1st.

    The GUI under Storage Resources-->DeDuplication Engines can show you each partitions respective performance as well to get a good feel for how the data is being distributed.

    Guessing you may have already seen this:

    http://documentation.commvault.com/commvault/release_10_0_0/books_online_1/english_us/prod_info/dedup_disk.htm?var1=http://documentation.commvault.com/commvault/release_10_0_0/books_online_1/english_us/features/dedup_disk/faq.htm

  • Re: Partitioned DDB - dataflow
    Posted: 07-08-2013, 9:03 AM

    thanks for the info Ali.

    But when i look at my 4 questions in my opening post, i'm unsure how to answer some of them.

    1.)Will the DDB on each MA be hammered for 50 subclients or do both DDBs do work for all 100 subclients?
    --> i think both MAs will do work for every back-up. So when i would  start one back-up, two media agents will be involved in the DDB processing. 

    2.)What's the data flow in this setup when i would back-up multiple subclients and have the data path configuration to "round robin" between these two media agents?
    -->1.) data is send from client to MA1  2.) DDB processing takes place by MA1 and MA2 for this clientdata 3.) MA1 saves data to it's maglib 4.) MA2 only updates it's reference to MA1's maglib data blocks, but does not save data.

    Am i close here?

    3.)Would this even be efficient? 
    Regarding DDB processing, yes, in regards to saving blocks to the Maglib, not so sure. 

    4.) Which Media Agent writes the actual data to the maglib for a single back-up?
    --> not sure. Will MA1 save the blocks to it's maglib for a single job ID, or does MA2 also save blocks to it's maglib. Looking at the following drawing, i think only one media agent is writing the blocks to the maglib. 
    http://documentation.commvault.com/commvault/release_10_0_0/books_online_1/english_us/images/dedup/overview/partitioned_ddb.png 

  • Re: Partitioned DDB - dataflow
    Posted: 07-08-2013, 9:26 AM
    • Ali is not online. Last active: 07-03-2019, 12:32 PM Ali
    • Top 10 Contributor
    • Joined on 08-05-2010

    JayST, both MA's won't access the engine at the same time for the same blocks, each block will be unique to each Engine, the goal is for scaling and load balancing.

    For #2 thats correct, for #3 of course, partition DDB will give you scale for a larger dataset, as your allowing for multi-partitions.  Both media agents could potentially depending on the available streams/writers per library.

  • Re: Partitioned DDB - dataflow
    Posted: 07-08-2013, 7:14 PM

    Hi Guys,

    I think there is a bit confusion between the datamover and DDB host.

    When you are using gridstore and load balancing streams across multiple media agents, that is solely the work of the data mover. So each stream will be written down to the configured datapath in the storage policy for that media agent.

    With partitioned dedupe, which signature is processed by which partition depends on the outcome of the signature hash. For example, if you have two partitions, the signature generated will return a 0 or 1 hash value. If the value is 0, then the signature is submitted to the DDB host for partition 1. If the hash value is 1 then the signature is submitted to DDB host hosting partition 2.

    In this regard, even if you have 20 streams going to MA1, and 50 streams to MA2, the signature processing is esentially still split 50/50 due to the way the hashing algorithm works. Each media agent will be responsible for half the load. The only time this is not true, is if a partition goes offline, then all signature processing will be moved to the other available partition nodes.

    So do not get too hung up with the stream allocation, as that not relevant to the signature processing.

    A similar concept was used back in the 8.0 days, where media agents were defined only to host DDBs and not do any data movement (This was coined an MA zero configuration).

  • Re: Partitioned DDB - dataflow
    Posted: 07-09-2013, 2:54 PM

    So Partitioned DDB allows me to scale my DDB further while also balancing some load regarding DDB processing during back-ups.

    Would you say DDB scaling is the primary purpose of DDB partitioning? More than the load sharing?

    The documentation mentions partitioned DDBs improve throughput/backup performance, but doesn't the this process add latency when having to distribute the hashes during backups? 

    Which component would submit the hashes to MA1 or MA2 actually? The client? (when signature generation is enabled on client level...) 

    If the client does the submits to either MA1 or MA2, i can imagine this improves things regarding throughput / performance on the MAs

     

     

  • Re: Partitioned DDB - dataflow
    Posted: 07-10-2013, 10:12 AM

    Hi JayST,

    Partitioned DDBs provides both performance, scaling and you forgot one ... redundancy.

    In terms of improving performance, this is due to number of connections. Performance scales with connections. Once the connection count to a DDB hits around 50, generally we start to see a decline in performance depending on the size of the store. With multiple partitions, you can run more connections and therefore increase overall throughput compare to a single partition/DDB.

    In terms of scaling; this applies to records within the DDB. All databases tend to slow down the larger they get. The limit in v9 for mechanical drives is somewhere along the lines of 500-750 million primary records. With SSD I believe it scales a bit further to a billion primaries. So now we have the ability for a single engine to grow beyond these limits by utilizing partitions.

    In terms of redundancy; if a single partition fails in the middle of your backup window, backups can continue using the remaining available partitions (as opposed to having to run a reconstruct in V9). Additionally, backups can run while a reconstruction job is occuring on one partition. One thing to bare in mind is that the client will have to rewrite blocks that should exist as part of the offline partition, so dedupe ratio % will be affected whilst the partition is offline.

    In regards to the client signature inserts, I believe the client will make a connection to both SIDB processes on each media agent. Depending on the outcome of the hash generated, the signature will be checked in against MA1 or MA2, so latency should not be a factor.

  • Re: Partitioned DDB - dataflow
    Posted: 07-10-2013, 8:15 PM
    • Ali is not online. Last active: 07-03-2019, 12:32 PM Ali
    • Top 10 Contributor
    • Joined on 08-05-2010

    Thanks dandre, thats a good summary.  I need some writing lessons...Cool

  • Re: Partitioned DDB - dataflow
    Posted: 07-14-2013, 9:31 AM

    dandre:

    .....

    In regards to the client signature inserts, I believe the client will make a connection to both SIDB processes on each media agent. Depending on the outcome of the hash generated, the signature will be checked in against MA1 or MA2, so latency should not be a factor.

    Great info! thanks!

The content of the forums, threads and posts reflects the thoughts and opinions of each author, and does not represent the thoughts, opinions, plans or strategies of Commvault Systems, Inc. ("Commvault") and Commvault undertakes no obligation to update, correct or modify any statements made in this forum. Any and all third party links, statements, comments, or feedback posted to, or otherwise provided by this forum, thread or post are not affiliated with, nor endorsed by, Commvault.
Commvault, Commvault and logo, the “CV” logo, Commvault Systems, Solving Forward, SIM, Singular Information Management, Simpana, Commvault Galaxy, Unified Data Management, QiNetix, Quick Recovery, QR, CommNet, GridStor, Vault Tracker, InnerVault, QuickSnap, QSnap, Recovery Director, CommServe, CommCell, SnapProtect, ROMS, and CommValue, are trademarks or registered trademarks of Commvault Systems, Inc. All other third party brands, products, service names, trademarks, or registered service marks are the property of and used to identify the products or services of their respective owners. All specifications are subject to change without notice.
Close
Copyright © 2019 Commvault | All Rights Reserved. | Legal | Privacy Policy