Data contract check reference

experimental
Last modified on 10-Jul-24

× Data contracts is an experimental project in Soda Core.

As the development team explores data contracts, expect minor imperfections, inconsistencies, and limited support, compatibility, and functionality if you download and use the soda-core-contracts package.

Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. Be aware, Soda data contracts checks do not use SodaCL. Learn more About Soda data contracts.

✖️    Requires Soda Core Scientific
✔️    Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Snowflake, and Spark
✖️    Supported in Soda Core CLI
✖️    Supported in Soda Library + Soda Cloud
✖️    Supported in Soda Cloud Agreements + Soda Agent

✖️ Available as a no-code check

What follows is reference documentation and examples of each type of data contract check.
Note that data contracts checks do not follow SodaCL syntax.

Duplicate
Freshness
Missing
Row count
SQL aggregation
SQL metric expression
SQL metric query
Validity
Check filters
List of threshold keys

Duplicate

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`no_duplicate_values`	no	-	`name` `columns`
`duplicate_count`	required	-	`name`
`duplicate_percent`	required	-	`name`

dataset: dim_employee

...

columns:
- name: id
  checks:
  - type: no_duplicate_values
- name: last_name
  checks:
  - type: duplicate_count
    must_be_less_than: 10
    name: Fewer than 10 duplicate names
- name: address_line1
  checks:
  - type: duplicate_percent
    must_be_less_than: 1

checks:
- type: no_duplicate_values
  columns: ['phone', 'email']

Freshness

This check compares the maximum value in the column to the time the scan runs; the check fails if that computed value exceeds the threshold you specified in the check.

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`freshness_in_days`	required	-	`name`
`freshness_in_hours`	required	-	`name`
`freshness_in_minutes`	required	-	`name`

dataset: dim_customer

...

columns:
- name: date_first_purchase
  checks:
    type: freshness_in_days
    must_be_less_than: 2
    name: New data arrived within the last 2 days

Missing

If you do not use an optional column configuration key to identify the values Soda ought to consider as missing, Soda uses NULL to identify missing values.

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`no_missing_values`	no	-	`name` `missing_values` `missing_sql_regex`
`missing_count`	required	-	`name` `missing_values` `missing_sql_ regex`
`missing_percent`	required	-	`name` `missing_values` `missing_sql_regex`

Row count

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`rows_exist`	no	-	`name`
`row_count`	required	-	`name`

dataset: dim_customer

...

columns: 
- name: first_name
  checks: 
  - type: row_count
    must_be_between: [100, 120]
    name: Verify row count range

checks: 
- type: rows_exist

SQL aggregation

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`avg`	required	-	`name`
`sum`	required	-	`name`

dataset: dim_customer

...

columns:
- name: yearly_income
  checks:
  - type: avg
    must_be_between: [50000, 80000]
    name: Average salary within expected range

- name: total_children
  checks:
  - type: sum
    must_be_less_than: 10

SQL metric expression

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`metric_expression`	required	-	`name`

Use a SQL metric expression check to monitor a custom metric that you define using a SQL expression.

You can apply a SQL metric check to one or more columns or to an entire dataset. As Soda data contracts pushes results to Soda Cloud, it associates column checks with the column name in Soda Cloud.

Relative to a SQL metric query check, a SQL metric expression check offers slightly better performance during contract verification. In the case where Soda must also compute other metrics during verification, it appends a SQL metric expression to the same query so that it only requires a single pass over the data to compute all the metrics. A SQL metric query executes independently of other queries during verification, essentially requiring a separate pass.

dataset: CUSTOMERS
...

columns:
  - name: id
  # SQL metric expression check for a column
  - name: country
    checks:
    - type: metric_expression
      metric: us_count
      expression_sql: COUNT(CASE WHEN country = 'US' THEN 1 END)
      must_be: 0

dataset: CUSTOMERS
...

columns:
  - name: id
  - name: country
checks:
# SQL metric expression check for a dataset
- type: metric_expression
  metric: us_count
  expression_sql: COUNT(CASE WHEN country = 'US' THEN 1 END)
  must_be: 0

SQL metric query

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`metric_expression`	required	-	`name`

Use a SQL metric query check to monitor a custom metric that you define using a SQL query.

You can apply a SQL metric check to one or more columns or to an entire dataset. As Soda data contracts pushes results to Soda Cloud, it associates column checks with the column name in Soda Cloud.

dataset: CUSTOMERS
...

columns:
  # SQL metric query check for a column
  - name: id
    checks:
    - type: metric_query
      metric: us_count
      query_sql: |
        SELECT COUNT(*)
        FROM {table_name}
        WHERE country = 'US'
      must_be_not_between: [0, 5]
  - name: country

dataset: CUSTOMERS
...

columns:
  - name: id
checks:
  # SQL metric expression check for a dataset
  - type: metric_query
    metric: us_count
    query_sql: |
      SELECT COUNT(*)
      FROM {table_name}
      WHERE country = 'US'
    must_be_not_between: [0, 5]

Validity

Type of check	Accepts threshold values	Column config keys: required	Column config keys: optional
`no_invalid_values`	no	At least one of: `valid_values` `valid_format` Valid formats `valid_sql_regex` `valid_min` `valid_max` `valid_length` `valid_min_length` `valid_max_length` `valid_values_reference_data` `invalid_values` `invalid_format` `invalid_sql_regex`	`name`
`invalid_count`	required	At least one of: `valid_values` `valid_format` Valid formats `valid_sql_regex` `valid_min` `valid_max` `valid_length` `valid_min_length` `valid_max_length` `valid_values_reference_data` `invalid_values` `invalid_format` `invalid_sql_regex`	`name`
`invalid_percent`	required	At least one of: `valid_values` `valid_format` Valid formats `valid_sql_regex` `valid_min` `valid_max` `valid_length` `valid_min_length` `valid_max_length` `valid_values_reference_data` `invalid_values` `invalid_format` `invalid_sql_regex`	`name`

dataset: dim_customer

...

columns: 
- name: first_name
  data_type: character varying
  checks: 
  - type: no_invalid_values
    valid_min_length: 2
- name: email_address
  checks: 
  - type: invalid_count
    must_be_less_than: 25
    valid_format: email
- name: id
  checks:
  - type: invalid_percent
    must_be_less_than: 5
    valid_sql_regex: '^ID.$'
    name: Less than 5% invalid
- name: total_children
  checks:
  - type: invalid_count
    # With multiple configurations, rows must meet ALL criteria
    valid_min: 0
    valid_max: 12
    must_be_less_than: 10
    name: Acceptable range of offspring count
  - name: comment
    checks:
    - type: no_invalid_values
      valid_min_length: 0
      valid_max_length: 160

Valid formats

For a list of the available formats to use with the valid_formats column configuration key, see: List of valid formats and Formats supported with Soda for MS SQL Server for SodaCL.

Validity reference

Also known as a referential integrity or foreign key check, Soda executes a validity check with a valid_values_reference_data column configuration key as a separate query, relative to other validity queries. The query counts all values that exist in the named column which also do not exist in the column in the referenced dataset.

The referential dataset must exist in the same data source as the dataset identified by the contract.

dataset: dim_employee

...

columns:
- name: country
  checks:
  - type: invalid_percent
    must_be_less_than: 3
    valid_values_reference_data: 
      dataset: countryID
      column: id

Combine missing and validity

You can combine column configuration keys to include both missing and validity parameters. Soda separately evaluates the parameters to prevent double-counting any rows that fail to meet the specified thresholds so that a row that fails both parameters only counts as one failed row.

dataset: dim_product

...

columns:
- name: size
  checks:
  - type: no_invalid_values
    missing_values: ['N/A']
    valid_values: ['S', 'M', 'L']

If you add both a missing and validity check to a single column, Soda leverages the results of preceding checks when evaluating subsequent ones.

In the example below, Soda considers any row that failed the no_missing_values check as one that will fail the second, no_invalid_values check without re-evaluating the row.

dataset: dim_product

...

columns:
- name: size
  checks:
  - type: no_missing_values
    missing_values: ['N/A']
  - type: no_invalid_values
    valid_values: ['S', 'M', 'L']

In the case where you have configured multiple missing checks that specify different missing values, Soda does not merge the results of the check evaluation; it only honors that last set of missing values. Not supported by valid_values_reference_data.

Check filters

Optionally, you can apply a filter_sql for the following checks:

numeric metrics, except duplicate_count and duplicate_percent
no_missing_values, missing_count and missing_percent
no_invalid_values, invalid_count and invalid_percent

The example below verifies that the only valid value for the column currency is pounds when the value of the country column for the fow is UK.

dataset: dim_product

...

columns:
- name: country
- name: currency
  checks:
  - type: no_invalid_values
    valid_values: ['pounds']
    filter_sql: country = 'UK'

List of threshold keys

Threshold key	Expected value	Example
`must_be`	number	`must_be: 0`
`must_not_be`	number	`must_not_be: 0`
`must_be_greater_than`	number	`must_be_greater_than: 100`
`must_be_greater_than_or_equal_to`	number	`must_be_greater_than_or_equal_to: 100`
`must_be_less_than`	number	`must_be_less_than: 100`
`must_be_less_than_or_equal_to`	number	`must_be_less_than_or_equal_to: 100`
`must_be_between`	list of 2 numbers	`must_be_between: [0, 100]`
`must_be_not_between`	list of 2 numbers	`must_be_not_between: [0, 100]`

Threshold boundaries

When you use must_be_between threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 100 or 120 each passes.

dataset: dim_customer

columns:
- name: first_name
- name: middle_name
- name: last_name

checks:
- type: row_count
  must_be_between: [100, 120]

When you use must_be_between threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 0 or 120 each fails.

dataset: dim_customer

columns:
- name: first_name
- name: middle_name
- name: last_name

checks:
- type: row_count
  must_be_not_between: [0, 120]

Use multiple thresholds to adjust the inclusion of boundary values.

dataset: dim_customer

columns:
- name: total_children
  # check passes if values are outside the range, inclusive of 20 
  checks:
  - type: avg
    must_be_less_than: 10
    must_be_greater_than_or_equal_to: 20
- name: yearly_income
  # check passes if values are inside the range, inclusive of 100
  checks:
  - type: avg
    must_be_greater_than_or_equal_to: 100
    must_be_less_than: 200

Go further

Learn how to prepare a data contract.
Need help? Join the Soda community on Slack.

Was this documentation helpful?

What could we do to improve this page?

Suggest a docs change in GitHub.
Share feedback in the Soda community on Slack.

Documentation always applies to the latest version of Soda products
Last modified on 10-Jul-24