Wednesday, March 7, 2012

Data Auditing vs CDC

I first understood CDC (Change data Capture) has beeing equivalent to auditing functionalities and I was excited to see that it was done in a way that exposed data in a relational fashion*. After more reading I lost some illusions and I would like to know if I understood things correctly.

In short:

CDC is more geared to help ETL and real time BI (nice goal but not related to auditing)
those tables are meant to be purged. The opposite of what is needed for auditing. The grace period being 72 hours, I would not be surprised to encounter perf problems if it was not purged and used for auditing.

So CDC should not be used for auditing. Is that right?

Thanks, Eric

*BTW, the way I hoped to consume audited data would not be through clumsy functions. How do you expect applications to consume the data without lots of extra coding? An application must be able to load an object (built from N tables) at any point in time without code changes. What we did with SQL2005 (not that it is exactly the best way but it does work fine) is to:

have audit tables with the same structure has the audited tables (+ metadata). Same concept as CDC have an AUDIT schema that expose views identical to the audited tables. Ex. you have a dbo.Order table and an Audit.Order view. The views expose only the rows has they were base on the current session "point in time" (more on that later). have a SET_POINT_IN_TIME stored procedure that allows the application to set a specific point in time for the current session in CONTEXT_INFO (xml/binary format) have auditing triggers. The down side is sync IO (which CDC does not have) but we at least found a way to reduce the IO by storing only column changes and make it the view's responsability to expose fully restored rows (works well but adds some overhead when reading audited data)

The end result is that any read only ad hoc statement that can be made on base tables can also be made on the Audit view. You do all JOINS the same way and you can load the same objects and run the same reports.

Hi Eric

As you say, CDC was created to help ETL scenarios by providing support for incremental data load. It uses an asynchronous capture mechanism that reads the transaction log and populates change capture tables with the row data and provides API's that provide access to that captured data in a number of ways.

Regarding the purge of the captured data; once CDC is enabled for a table, changes are captured in a corresponding capture table, which will grow in size over time. There is a cleanup agent job that is created that removes data from the capture tables according to a retention period. The retention period and the configuration of the cleanup job are under user control. It would therefore be possible to disable the cleanup job if you wanted, understanding of course that the capture tables would continue to grow.

Even though CDC was targeted at ETL scenarios it is possible that the functionality it offers would be useful in other scenarios where changes to table data need to be determined. CDC was not meant to be used for auditing, but the term "auditing" can mean different things to different people. I would recommend you look at the functionality offered by CDC and the characteristics of CDC (e.g. cleanup) and determine whether it can be used for your scenario or not.

I hope that helps.

Thanks, Mark

|||

Thanks Mark. I recognized the auditing potential but as I said I am afraid to use the feature for a purpose it was not intended unless it is officially supported.

If I have:

a 300gig database with 1000+ tables around 1000+ users OLTP scenario but also some complex reporting that can't be run on a mirror (some reports can do updates too) want to enable auditing on all tables at all time need a retention period of 5+ years (with enough disk space to support it of course)

No comments:

Post a Comment