Link Search Menu Expand Document

Data contract check reference

experimental
Last modified on 10-Jul-24

× Data contracts is an experimental project in Soda Core.

As the development team explores data contracts, expect minor imperfections, inconsistencies, and limited support, compatibility, and functionality if you download and use the soda-core-contracts package.

Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. Be aware, Soda data contracts checks do not use SodaCL. Learn more About Soda data contracts.

✖️    Requires Soda Core Scientific
✔️    Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Snowflake, and Spark
✖️    Supported in Soda Core CLI
✖️    Supported in Soda Library + Soda Cloud
✖️    Supported in Soda Cloud Agreements + Soda Agent

✖️    Available as a no-code check

What follows is reference documentation and examples of each type of data contract check.
Note that data contracts checks do not follow SodaCL syntax.

Duplicate
Freshness
Missing
Row count
SQL aggregation
SQL metric expression
SQL metric query
Validity
Check filters
List of threshold keys

Duplicate

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
no_duplicate_values no - name
columns
duplicate_count required - name
duplicate_percent required - name
dataset: dim_employee

...

columns:
- name: id
  checks:
  - type: no_duplicate_values
- name: last_name
  checks:
  - type: duplicate_count
    must_be_less_than: 10
    name: Fewer than 10 duplicate names
- name: address_line1
  checks:
  - type: duplicate_percent
    must_be_less_than: 1

checks:
- type: no_duplicate_values
  columns: ['phone', 'email']

Freshness

This check compares the maximum value in the column to the time the scan runs; the check fails if that computed value exceeds the threshold you specified in the check.

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
freshness_in_days required - name
freshness_in_hours required - name
freshness_in_minutes required - name
dataset: dim_customer

...

columns:
- name: date_first_purchase
  checks:
    type: freshness_in_days
    must_be_less_than: 2
    name: New data arrived within the last 2 days

Missing

If you do not use an optional column configuration key to identify the values Soda ought to consider as missing, Soda uses NULL to identify missing values.

See also: Combine missing and validity

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
no_missing_values no - name
missing_values
missing_sql_regex
missing_count required - name
missing_values
missing_sql_ regex
missing_percent required - name
missing_values
missing_sql_regex
dataset: dim_customer

...

columns: 
- name: title
  checks: 
  - type: no_missing_values 
- name: middle_name
  checks: 
  - type: missing_count
    must_be_less_than: 10
    # Soda includes 'NULL' in list of values by default
    missing_values: ['xxx', 'none', 'NA']
- name: last_name
  checks:
  - type: missing_count
    must_be_less_than: 5 
- name: first_name
  checks: 
  - type: missing_percent
    must_be_less_than: 1
    name: No whitespace entries
    # regular expression must match the dialect of your SQL engine
    missing_sql_regex: '[\s]'

Row count

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
rows_exist no - name
row_count required - name
dataset: dim_customer

...

columns: 
- name: first_name
  checks: 
  - type: row_count
    must_be_between: [100, 120]
    name: Verify row count range

checks: 
- type: rows_exist

SQL aggregation

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
avg required - name
sum required - name
dataset: dim_customer

...

columns:
- name: yearly_income
  checks:
  - type: avg
    must_be_between: [50000, 80000]
    name: Average salary within expected range

- name: total_children
  checks:
  - type: sum
    must_be_less_than: 10

SQL metric expression

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
metric_expression required - name

Use a SQL metric expression check to monitor a custom metric that you define using a SQL expression.

You can apply a SQL metric check to one or more columns or to an entire dataset. As Soda data contracts pushes results to Soda Cloud, it associates column checks with the column name in Soda Cloud.

Relative to a SQL metric query check, a SQL metric expression check offers slightly better performance during contract verification. In the case where Soda must also compute other metrics during verification, it appends a SQL metric expression to the same query so that it only requires a single pass over the data to compute all the metrics. A SQL metric query executes independently of other queries during verification, essentially requiring a separate pass.

dataset: CUSTOMERS
...

columns:
  - name: id
  # SQL metric expression check for a column
  - name: country
    checks:
    - type: metric_expression
      metric: us_count
      expression_sql: COUNT(CASE WHEN country = 'US' THEN 1 END)
      must_be: 0
dataset: CUSTOMERS
...

columns:
  - name: id
  - name: country
checks:
# SQL metric expression check for a dataset
- type: metric_expression
  metric: us_count
  expression_sql: COUNT(CASE WHEN country = 'US' THEN 1 END)
  must_be: 0


SQL metric query

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
metric_expression required - name

Use a SQL metric query check to monitor a custom metric that you define using a SQL query.

You can apply a SQL metric check to one or more columns or to an entire dataset. As Soda data contracts pushes results to Soda Cloud, it associates column checks with the column name in Soda Cloud.

dataset: CUSTOMERS
...

columns:
  # SQL metric query check for a column
  - name: id
    checks:
    - type: metric_query
      metric: us_count
      query_sql: |
        SELECT COUNT(*)
        FROM {table_name}
        WHERE country = 'US'
      must_be_not_between: [0, 5]
  - name: country
dataset: CUSTOMERS
...

columns:
  - name: id
checks:
  # SQL metric expression check for a dataset
  - type: metric_query
    metric: us_count
    query_sql: |
      SELECT COUNT(*)
      FROM {table_name}
      WHERE country = 'US'
    must_be_not_between: [0, 5]

Validity

Type of check Accepts
threshold values
Column config
keys: required
Column config
keys: optional
no_invalid_values no At least one of:
valid_values
valid_format Valid formats
valid_sql_regex
valid_min
valid_max
valid_length
valid_min_length
valid_max_length
valid_values_reference_data
invalid_values
invalid_format
invalid_sql_regex
name
invalid_count required At least one of:
valid_values
valid_format Valid formats
valid_sql_regex
valid_min
valid_max
valid_length
valid_min_length
valid_max_length
valid_values_reference_data
invalid_values
invalid_format
invalid_sql_regex
name
invalid_percent required At least one of:
valid_values
valid_format Valid formats
valid_sql_regex
valid_min
valid_max
valid_length
valid_min_length
valid_max_length
valid_values_reference_data
invalid_values
invalid_format
invalid_sql_regex
name
dataset: dim_customer

...

columns: 
- name: first_name
  data_type: character varying
  checks: 
  - type: no_invalid_values
    valid_min_length: 2
- name: email_address
  checks: 
  - type: invalid_count
    must_be_less_than: 25
    valid_format: email
- name: id
  checks:
  - type: invalid_percent
    must_be_less_than: 5
    valid_sql_regex: '^ID.$'
    name: Less than 5% invalid
- name: total_children
  checks:
  - type: invalid_count
    # With multiple configurations, rows must meet ALL criteria
    valid_min: 0
    valid_max: 12
    must_be_less_than: 10
    name: Acceptable range of offspring count
  - name: comment
    checks:
    - type: no_invalid_values
      valid_min_length: 0
      valid_max_length: 160


Valid formats

For a list of the available formats to use with the valid_formats column configuration key, see: List of valid formats and Formats supported with Soda for MS SQL Server for SodaCL.


Validity reference

Also known as a referential integrity or foreign key check, Soda executes a validity check with a valid_values_reference_data column configuration key as a separate query, relative to other validity queries. The query counts all values that exist in the named column which also do not exist in the column in the referenced dataset.

The referential dataset must exist in the same data source as the dataset identified by the contract.

dataset: dim_employee

...

columns:
- name: country
  checks:
  - type: invalid_percent
    must_be_less_than: 3
    valid_values_reference_data: 
      dataset: countryID
      column: id


Combine missing and validity

You can combine column configuration keys to include both missing and validity parameters. Soda separately evaluates the parameters to prevent double-counting any rows that fail to meet the specified thresholds so that a row that fails both parameters only counts as one failed row.

dataset: dim_product

...

columns:
- name: size
  checks:
  - type: no_invalid_values
    missing_values: ['N/A']
    valid_values: ['S', 'M', 'L']


If you add both a missing and validity check to a single column, Soda leverages the results of preceding checks when evaluating subsequent ones.

In the example below, Soda considers any row that failed the no_missing_values check as one that will fail the second, no_invalid_values check without re-evaluating the row.

dataset: dim_product

...

columns:
- name: size
  checks:
  - type: no_missing_values
    missing_values: ['N/A']
  - type: no_invalid_values
    valid_values: ['S', 'M', 'L']

In the case where you have configured multiple missing checks that specify different missing values, Soda does not merge the results of the check evaluation; it only honors that last set of missing values. Not supported by valid_values_reference_data.


Check filters

Optionally, you can apply a filter_sql for the following checks:

  • numeric metrics, except duplicate_count and duplicate_percent
  • no_missing_values, missing_count and missing_percent
  • no_invalid_values, invalid_count and invalid_percent

The example below verifies that the only valid value for the column currency is pounds when the value of the country column for the fow is UK.

dataset: dim_product

...

columns:
- name: country
- name: currency
  checks:
  - type: no_invalid_values
    valid_values: ['pounds']
    filter_sql: country = 'UK'

List of threshold keys

Threshold key Expected value Example
must_be number must_be: 0
must_not_be number must_not_be: 0
must_be_greater_than number must_be_greater_than: 100
must_be_greater_than_or_equal_to number must_be_greater_than_or_equal_to: 100
must_be_less_than number must_be_less_than: 100
must_be_less_than_or_equal_to number must_be_less_than_or_equal_to: 100
must_be_between list of 2 numbers must_be_between: [0, 100]
must_be_not_between list of 2 numbers must_be_not_between: [0, 100]

Threshold boundaries

When you use must_be_between threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 100 or 120 each passes.

dataset: dim_customer

columns:
- name: first_name
- name: middle_name
- name: last_name

checks:
- type: row_count
  must_be_between: [100, 120]


When you use must_be_between threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 0 or 120 each fails.

dataset: dim_customer

columns:
- name: first_name
- name: middle_name
- name: last_name

checks:
- type: row_count
  must_be_not_between: [0, 120]


Use multiple thresholds to adjust the inclusion of boundary values.

dataset: dim_customer

columns:
- name: total_children
  # check passes if values are outside the range, inclusive of 20 
  checks:
  - type: avg
    must_be_less_than: 10
    must_be_greater_than_or_equal_to: 20
- name: yearly_income
  # check passes if values are inside the range, inclusive of 100
  checks:
  - type: avg
    must_be_greater_than_or_equal_to: 100
    must_be_less_than: 200

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 10-Jul-24