Data contract check reference
Last modified on 10-Jul-24
As the development team explores data contracts, expect minor imperfections, inconsistencies, and limited support, compatibility, and functionality if you download and use the
soda-core-contracts
package. Soda data contracts is a Python library that verifies data quality standards as early and often as possible in a data pipeline so as to prevent negative downstream impact. Be aware, Soda data contracts checks do not use SodaCL. Learn more About Soda data contracts.
✖️ Requires Soda Core Scientific
✔️ Experimentally supported in Soda Core 3.3.3 or greater for PostgreSQL, Snowflake, and Spark
✖️ Supported in Soda Core CLI
✖️ Supported in Soda Library + Soda Cloud
✖️ Supported in Soda Cloud Agreements + Soda Agent
✖️ Available as a no-code check
What follows is reference documentation and examples of each type of data contract check.
Note that data contracts checks do not follow SodaCL syntax.
Duplicate
Freshness
Missing
Row count
SQL aggregation
SQL metric expression
SQL metric query
Validity
Check filters
List of threshold keys
Duplicate
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
no_duplicate_values | no | - | name columns |
duplicate_count | required | - | name |
duplicate_percent | required | - | name |
dataset: dim_employee
...
columns:
- name: id
checks:
- type: no_duplicate_values
- name: last_name
checks:
- type: duplicate_count
must_be_less_than: 10
name: Fewer than 10 duplicate names
- name: address_line1
checks:
- type: duplicate_percent
must_be_less_than: 1
checks:
- type: no_duplicate_values
columns: ['phone', 'email']
Freshness
This check compares the maximum value in the column to the time the scan runs; the check fails if that computed value exceeds the threshold you specified in the check.
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
freshness_in_days | required | - | name |
freshness_in_hours | required | - | name |
freshness_in_minutes | required | - | name |
dataset: dim_customer
...
columns:
- name: date_first_purchase
checks:
type: freshness_in_days
must_be_less_than: 2
name: New data arrived within the last 2 days
Missing
If you do not use an optional column configuration key to identify the values Soda ought to consider as missing, Soda uses NULL to identify missing values.
See also: Combine missing and validity
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
no_missing_values | no | - | name missing_values missing_sql_regex |
missing_count | required | - | name missing_values missing_sql_ regex |
missing_percent | required | - | name missing_values missing_sql_regex |
dataset: dim_customer
...
columns:
- name: title
checks:
- type: no_missing_values
- name: middle_name
checks:
- type: missing_count
must_be_less_than: 10
# Soda includes 'NULL' in list of values by default
missing_values: ['xxx', 'none', 'NA']
- name: last_name
checks:
- type: missing_count
must_be_less_than: 5
- name: first_name
checks:
- type: missing_percent
must_be_less_than: 1
name: No whitespace entries
# regular expression must match the dialect of your SQL engine
missing_sql_regex: '[\s]'
Row count
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
rows_exist | no | - | name |
row_count | required | - | name |
dataset: dim_customer
...
columns:
- name: first_name
checks:
- type: row_count
must_be_between: [100, 120]
name: Verify row count range
checks:
- type: rows_exist
SQL aggregation
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
avg | required | - | name |
sum | required | - | name |
dataset: dim_customer
...
columns:
- name: yearly_income
checks:
- type: avg
must_be_between: [50000, 80000]
name: Average salary within expected range
- name: total_children
checks:
- type: sum
must_be_less_than: 10
SQL metric expression
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
metric_expression | required | - | name |
Use a SQL metric expression check to monitor a custom metric that you define using a SQL expression.
You can apply a SQL metric check to one or more columns or to an entire dataset. As Soda data contracts pushes results to Soda Cloud, it associates column checks with the column name in Soda Cloud.
Relative to a SQL metric query check, a SQL metric expression check offers slightly better performance during contract verification. In the case where Soda must also compute other metrics during verification, it appends a SQL metric expression to the same query so that it only requires a single pass over the data to compute all the metrics. A SQL metric query executes independently of other queries during verification, essentially requiring a separate pass.
dataset: CUSTOMERS
...
columns:
- name: id
# SQL metric expression check for a column
- name: country
checks:
- type: metric_expression
metric: us_count
expression_sql: COUNT(CASE WHEN country = 'US' THEN 1 END)
must_be: 0
dataset: CUSTOMERS
...
columns:
- name: id
- name: country
checks:
# SQL metric expression check for a dataset
- type: metric_expression
metric: us_count
expression_sql: COUNT(CASE WHEN country = 'US' THEN 1 END)
must_be: 0
SQL metric query
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
metric_expression | required | - | name |
Use a SQL metric query check to monitor a custom metric that you define using a SQL query.
You can apply a SQL metric check to one or more columns or to an entire dataset. As Soda data contracts pushes results to Soda Cloud, it associates column checks with the column name in Soda Cloud.
dataset: CUSTOMERS
...
columns:
# SQL metric query check for a column
- name: id
checks:
- type: metric_query
metric: us_count
query_sql: |
SELECT COUNT(*)
FROM {table_name}
WHERE country = 'US'
must_be_not_between: [0, 5]
- name: country
dataset: CUSTOMERS
...
columns:
- name: id
checks:
# SQL metric expression check for a dataset
- type: metric_query
metric: us_count
query_sql: |
SELECT COUNT(*)
FROM {table_name}
WHERE country = 'US'
must_be_not_between: [0, 5]
Validity
Type of check | Accepts threshold values | Column config keys: required | Column config keys: optional |
---|---|---|---|
no_invalid_values | no | At least one of:valid_values valid_format Valid formatsvalid_sql_regex valid_min valid_max valid_length valid_min_length valid_max_length valid_values_reference_data invalid_values invalid_format invalid_sql_regex | name |
invalid_count | required | At least one of:valid_values valid_format Valid formatsvalid_sql_regex valid_min valid_max valid_length valid_min_length valid_max_length valid_values_reference_data invalid_values invalid_format invalid_sql_regex | name |
invalid_percent | required | At least one of:valid_values valid_format Valid formatsvalid_sql_regex valid_min valid_max valid_length valid_min_length valid_max_length valid_values_reference_data invalid_values invalid_format invalid_sql_regex | name |
dataset: dim_customer
...
columns:
- name: first_name
data_type: character varying
checks:
- type: no_invalid_values
valid_min_length: 2
- name: email_address
checks:
- type: invalid_count
must_be_less_than: 25
valid_format: email
- name: id
checks:
- type: invalid_percent
must_be_less_than: 5
valid_sql_regex: '^ID.$'
name: Less than 5% invalid
- name: total_children
checks:
- type: invalid_count
# With multiple configurations, rows must meet ALL criteria
valid_min: 0
valid_max: 12
must_be_less_than: 10
name: Acceptable range of offspring count
- name: comment
checks:
- type: no_invalid_values
valid_min_length: 0
valid_max_length: 160
Valid formats
For a list of the available formats to use with the valid_formats
column configuration key, see: List of valid formats and Formats supported with Soda for MS SQL Server for SodaCL.
Validity reference
Also known as a referential integrity or foreign key check, Soda executes a validity check with a valid_values_reference_data
column configuration key as a separate query, relative to other validity queries. The query counts all values that exist in the named column which also do not exist in the column in the referenced dataset.
The referential dataset must exist in the same data source as the dataset identified by the contract.
dataset: dim_employee
...
columns:
- name: country
checks:
- type: invalid_percent
must_be_less_than: 3
valid_values_reference_data:
dataset: countryID
column: id
Combine missing and validity
You can combine column configuration keys to include both missing and validity parameters. Soda separately evaluates the parameters to prevent double-counting any rows that fail to meet the specified thresholds so that a row that fails both parameters only counts as one failed row.
dataset: dim_product
...
columns:
- name: size
checks:
- type: no_invalid_values
missing_values: ['N/A']
valid_values: ['S', 'M', 'L']
If you add both a missing and validity check to a single column, Soda leverages the results of preceding checks when evaluating subsequent ones.
In the example below, Soda considers any row that failed the no_missing_values
check as one that will fail the second, no_invalid_values
check without re-evaluating the row.
dataset: dim_product
...
columns:
- name: size
checks:
- type: no_missing_values
missing_values: ['N/A']
- type: no_invalid_values
valid_values: ['S', 'M', 'L']
In the case where you have configured multiple missing checks that specify different missing values, Soda does not merge the results of the check evaluation; it only honors that last set of missing values. Not supported by valid_values_reference_data
.
Check filters
Optionally, you can apply a filter_sql
for the following checks:
- numeric metrics, except
duplicate_count
andduplicate_percent
no_missing_values
,missing_count
andmissing_percent
no_invalid_values
,invalid_count
andinvalid_percent
The example below verifies that the only valid value for the column currency
is pounds
when the value of the country
column for the fow is UK
.
dataset: dim_product
...
columns:
- name: country
- name: currency
checks:
- type: no_invalid_values
valid_values: ['pounds']
filter_sql: country = 'UK'
List of threshold keys
Threshold key | Expected value | Example |
---|---|---|
must_be | number | must_be: 0 |
must_not_be | number | must_not_be: 0 |
must_be_greater_than | number | must_be_greater_than: 100 |
must_be_greater_than_or_equal_to | number | must_be_greater_than_or_equal_to: 100 |
must_be_less_than | number | must_be_less_than: 100 |
must_be_less_than_or_equal_to | number | must_be_less_than_or_equal_to: 100 |
must_be_between | list of 2 numbers | must_be_between: [0, 100] |
must_be_not_between | list of 2 numbers | must_be_not_between: [0, 100] |
Threshold boundaries
When you use must_be_between
threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 100
or 120
each passes.
dataset: dim_customer
columns:
- name: first_name
- name: middle_name
- name: last_name
checks:
- type: row_count
must_be_between: [100, 120]
When you use must_be_between
threshold keys, Soda includes the boundary values as acceptable. In the following example, a check result of 0
or 120
each fails.
dataset: dim_customer
columns:
- name: first_name
- name: middle_name
- name: last_name
checks:
- type: row_count
must_be_not_between: [0, 120]
Use multiple thresholds to adjust the inclusion of boundary values.
dataset: dim_customer
columns:
- name: total_children
# check passes if values are outside the range, inclusive of 20
checks:
- type: avg
must_be_less_than: 10
must_be_greater_than_or_equal_to: 20
- name: yearly_income
# check passes if values are inside the range, inclusive of 100
checks:
- type: avg
must_be_greater_than_or_equal_to: 100
must_be_less_than: 200
Go further
- Learn how to prepare a data contract.
- Need help? Join the Soda community on Slack.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 10-Jul-24