Distribution checks
Last modified on 10-Jul-24
Use a distribution check to determine whether the distribution of a column has changed between two points in time. For example, if you trained a model at a particular moment in time, you can use a distribution check to find out how much the data in the column has changed over time, or if it has changed all.
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
distribution reference file: ./cars_owned_dist_ref.yml
method: chi_square
# (optional) filter to a specific point in time or any other dimension
filter: purchase_date > 2022-10-01 and purchase_date < 2022-12-01
# (optional) database specific sampling query for continuous columns. For
# example, for PostgreSQL the following query randomly samples 50% of the data
# with seed 61.
sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
✔️ Requires Soda Core Scientific (included in a Soda Agent)
✔️ Supported in Soda Core
✔️ Supported in Soda Library + Soda Cloud
✖️ Supported in Soda Cloud Agreements + Soda Agent
✖️ Available as a no-code check
About distribution checks
Install Soda Scientific
Generate a distribution reference object (DRO)
Define a distribution check
Distribution check details
Bins and weights
Define the sample size
Distribution check examples
Optional check configurations
List of comparison symbols and phrases
Troubleshoot Soda Scientific installation
Go further
About distribution checks
To detect changes in the distribution of a column between different points in time, Soda uses approaches based on hypothesis testing and based on metrics that quantify the distance between samples.
When using hypothesis testing, a distribution check allows you to determine whether enough evidence exists to conclude that the distribution of a column has changed. It returns the probability that the difference between samples taken at two points in time would have occurred if they came from the same distribution (see p-value). If this probability is smaller than a threshold that you define, the check warns you that the column’s distribution has changed.
You can use the following statistical tests for hypothesis testing in your distribution checks.
- Kolmogorov-Smirnov for continuous data
- Chi-square for categorical data
When using a metric to measure distance between samples, a distribution check returns the value of the distance metric that you chose based on samples taken at two points in time. If the value of the distance metric is larger than a threshold that you define, the check warns that the column’s distribution has changed.
You can use the following distance metrics in your distribution checks.
- Population Stability Index (PSI) for continuous or categorical data
- Standardized Wasserstein Distance (SWD) (standardized using the sample standard deviation) for continuous or categorical data
- Standardized Earth Mover’s Distance (standardized using the sample standard deviation, this metric is equal to the SWD) for continuous or categorical data
Sample sizes in distribution checks
In hypothesis testing, the statistical power of a test refers to its ability to reject the null hypothesis when it is false. Specifically, the power of a test tells you how likely it is that the null hypothesis will be rejected if the true difference with the alternative hypothesis were of a particular size; see effect size. A very powerful test is able to reject the null hypothesis even if the true difference is small.Since distribution checks issue warnings based on the p-value alone and do not take effect size into account, having too much power can make the results of the checks hard to interpret. An extremely powerful test rejects the null hypothesis for effect sizes that are negligible. Because the power of a test increases as its sample size increases, there is a sample size limit of one million in distribution checks.
The default sample size limit of 1 million rows is based on simulations that used the Kolmogorov-Smirnov test. The simulation generated samples from a normal distribution, an exponential distribution, a laplacian distribution, a beta distribution, and a mixture distribution (generated by randomly choosing between two normal distributions). The Kolmogorov-Smirnov test compared these samples to samples that came from the same distributions, but with different means. For example, it compared samples from a normal distribution to samples from another normal distribution with a different mean.
For each distribution type, the Kolmogorov-Smirnov test rejected the null hypothesis 100% of the time if the effect size was equal to, or larger than, a shift to the mean of 1% of the standard deviation, when using a sample size of one million. Using such a sample size does not result in problems with local memory.
If you wish, you can define your own sample size using a SQL query. See Define the sample size section, below.
Distribution check thresholds for distance metrics
The values of the Population Stability Index (PSI) and the Standardized Wasserstein Distance (SWD) can be hard to interpret. Consider carefully investigating which distribution thresholds make sense for your use case.Some common interpretations of the PSI result are as follows:
- PSI < 0.1: no significant distribution change
- 0.1 < PSI < 0.2: moderate distribution change
- PSI >= 0.2: significant distribution change
Install Soda Scientific
To use a distribution check, you must install Soda Scientific in the same directory or virtual environment in which you installed Soda Library. Best practice recommends installing Soda Library and Soda Scientific in a virtual environment to avoid library conflicts, but you can Install Soda Scientific locally if you prefer.
- Set up a virtual environment, and install Soda Library in your new virtual environment.
- Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientific
List of Soda Scientific dependencies
- pandas<2.0.0
- wheel
- pydantic>=1.8.1,<2.0.0
- scipy>=1.8.0
- numpy>=1.23.3, <2.0.0
- inflection==0.5.1
- httpx>=0.18.1,<2.0.0
- PyYAML>=5.4.1,<7.0.0
- cython>=0.22
- prophet>=1.1.0,<2.0.0
Refer to Troubleshoot Soda Scientific installation for help with issues during installation.
Generate a distribution reference object (DRO)
Not yet supported in Soda Cloud
Before defining a distribution check, you must generate a distribution reference object (DRO).
When you run a distribution check, Soda compares the data in a column of your dataset with a snapshot of the same column at a different point in time. This snapshot exists in the DRO, which serves as a point of reference. The distribution check result indicates whether the difference between the distributions of the snapshot and the actual datasets is statistically significant.
To create a DRO, you use the CLI command soda update-dro
. When you execute the command, Soda stores the entire contents of the column(s) you specified in local memory. Before executing the command, examine the volume of data the column(s) contains and ensure that your system can accommodate storing it in local memory.
- If you have not already done so, create a directory to contain the files that Soda uses for a distribution check.
-
Use a code editor to create a file called
distribution_reference.yml
(though, you can name it anything you wish) in your Soda project directory, then add the following example content to the file.dataset: your_dataset_name column: column_name_in_dataset distribution_type: categorical # (optional) filter to a specific point in time or any other dimension filter: "column_name between '2010-01-01' and '2020-01-01'" # (optional) database specific sampling query, for example for postgres\ # the following query randomly samples 50% of the data with seed 61
Optionally, you can define multiple DROs in your
distribution_reference.yml
file by naming them. The following example defines two DROs.dro_name1: dataset: your_dataset_name column: column_name_in_dataset distribution_type: categorical dro_name2: dataset: your_dataset_name column: column_name2_in_dataset distribution_type: continuous
- Change the values for
dataset
andcolumn
to reflect your own dataset’s identifiers. - (Optional) Change the value for
distribution_type
to capturecategorical
orcontinuous
data. - (Optional) Define the value of
filter
to specify the portion of the data in your dataset for which you are creating a DRO. If you trained a model on data in which thedate_first_customer
column contained values between 2010-01-01 and 2020-01-01, you can use a filter based on that period to test whether the distribution of the column has changed since then.
If you do not wish to define a filter, remove the key-value pair from the file. - (Optional) If you wish to define multiple DROs in a single
distribution_reference.yml
file, change the namesdro_name1
anddro_name2
. -
Save the file, then, while still in your Soda project directory, run the
soda update-dro
command to create a distribution reference object. For a list of options available to use with the command, runsoda update-dro --help
.soda update-dro -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.yml
If you defined multiple DROs in your
distribution_reference.yml
file, specify which DRO you want to update using the-n
argument.-n
indicates name. When multiple DROs are defined in a singledistribution_reference.yml
file, Soda requires all of them to be named. Thus, you must provide the DRO name with the-n
argument when using thesoda update-dro
command.soda update-dro -n dro_name1 -d your_datasource_name -c your_configuration_file.yml ./distribution_reference.yml
-
Review the changed contents of your
distribution_reference.yml
file. The following is an example of the information that Soda added to the file.dataset: dim_customer column: number_cars_owned distribution_type: categorical filter: date_first_purchase between '2010-01-01' and '2020-01-01' distribution reference: weights: - 0.34932914953473276 - 0.2641744211209695 - 0.22927937675827742 - 0.08899588833585804 - 0.06822116425016231 bins: - 2 - 1 - 0 - 3 - 4
Soda appended a new key called
distribution reference
to the file, together with an array ofbins
and a corresponding array ofweights
. Read more aboutbins
andweights
, and how Soda computes the number of bins for a DRO.
Define a distribution check
- If you have not already done so, create a
checks.yml
file in your Soda project directory. The checks YAML file stores the Soda Checks you write, including distribution checks; Soda Library executes the checks in the file when it runs a scan of your data. -
In your new file, add the following example content.
checks for your_dataset_name: - distribution_difference(column_name, dro_name) > your_threshold: method: your_method_of_choice distribution reference file: ./distribution_reference.yml # (optional) filter to a specific point in time, or any other dimension filter: column_name > min_allowed_column_value and column_name < max_allowed_value # (optional) database specific sampling query for continuous columns. For # example, for PostgreSQL, the following query randomly samples 50% of the data # with seed 61 sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
- Replace the following values with your own dataset and threshold details.
your_dataset_name
- the name of your datasetcolumn_name
- the column against which to compare the DROdro_name
- the name of the DRO (optional, required ifdistribution_reference.yml
contains named DROs)> your_threshold
- the threshold for the distribution check that you specify as acceptable
- Replace the value of
your_method_of_choice
with the type of test you want to use in the distribution check. If you do not specify amethod
, the distribution check defaults toks
for continuous data, orchi_square
for categorical data.ks
for the Kolmogorov-Smirnov testchi_square
for the Chi-square testpsi
for the Population Stability Index metricswd
for the Standardized Wasserstein Distance (SWD) metricsemd
for the Standardized Earth Mover’s Distance (SEMD) metric
SWD and the SEMD are the same metric.
-
(Optional), to filter the data in the distribution check, replace the value of
filter
with a filter that specifies the portion of the data in your dataset for which you are checking the distribution. -
(Optional), to sample the data in the distribution check for continuous columns only, replace the value of
sample
with a query that specifies the portion of the data in your dataset for which you are checking the distribution. The data source you are using must support the query you write.
For example, for PostgreSQL, you can use theTABLESAMPLE
clause to randomly sample 50% of the data with seed 61. Best practice dictates that you use sampling for large datasets that might not fit in memory. Refer to the define the sample size for details. If you do not usesample
orfilter
in a distribution check for continuous columns, Soda fetches up to 1 million records by applying alimit
clause for better memory management. For categorical columns, Soda does not supportsample
. -
Run a soda scan of your data source to execute the distribution check(s) you defined.
soda scan -d your_datasource_name checks.yml -c /path/to/your_configuration_file.yml your_check_file.yml
When Soda Library executes the distribution check above, it compares the values in
column_name
to a sample that Soda creates based on thebins
,weights
, anddata_type
indro_name
defined in thedistribution_reference.yml
file. Specifically, it checks whether the value ofyour_method_of_choice
is larger than0.05
.
Distribution check details
-
For continuous columns, When you execute the
soda scan
command, Soda stores up to one million records in local memory. If the column has more than one million records, then Soda applieslimit
SQL clause to make sure that your system can accommodate storing it in local memory. -
For continuous columns, as explained in Generate a Distribution Reference Object (DRO), Soda uses bins and weights to take random samples from your DRO. Therefore, it is possible that the original dataset that you used to create the DRO resembles a different underlying distribution than the dataset that Soda creates by sampling from the DRO. To limit the impact of this possibility, Soda runs the tests in each distribution check ten times and returns the median of the results, either as a p-value or a distance metric).
For example, if you use the Kolmogorov-Smirnov test and a threshold of 0.05, the distribution check uses the Kolmogorov-Smirnov test to compare ten different samples from your DRO to the data in your column. If the median of the returned p-values is smaller than 0.05, the check issues a warning. This approach does change the interpretation of the distribution check results. For example, the probability of a type I error is multiple orders of magnitude smaller than the significance level that you choose. -
For categorical columns, Soda fetches the aggregated calculated value counts of each category. If there are more than one million distinct categories, Soda skips the distribution check and issues a warning.
Bins and weights
Soda uses the bins
and weights
to generate a sample from the reference distribution when it executes the distribution check during a scan. By creating a sample using the DRO’s bins and weights, you do not have to save the entire – potentially very large - sample. The distribution_type
value impacts how the weights and bins will be used to generate a sample, so make sure your choice reflects the nature of your data (continuous or categorical).
To compute the number of bins for a DRO, Soda uses different strategies based on whether outlier values are present in the dataset.
By default Soda automatically computes the number of bins for each DRO by taking the maximum of Sturges and Freedman Diaconis Estimator methods. numpy.histogram_bin_edges(data, bins=’auto’) also applies this practice by default.
For datasets with outliers, such as in the example below, the default strategy does not work well. When taking the maximum of Sturges and Freedman Diaconis Estimator methods, it produces a great number of bins, 3466808
, while there are only nine elements in the array. The outlier value 10e6
result in a misleading bin size.
import numpy as np
arr = np.array([0, 0, 0, 1, 2, 3, 3, 4, 10e6])
number_of_bins = np.histogram_bin_edges(arr, bins='auto').size # return 3466808
If the number of bins is greater than the size of data, Soda uses interquantile range (IQR) to detect and filter the outliers. Basically, for data that is greater than Q3 + 1.5 IQR
and less than Q1 - 1.5 IQR
Soda removes the datasets, then recomputes the number of bins with the same method by taking the maximum of Sturges and Freedman Diaconis Estimator.
After removing the outliers, if the number of bins still exceeds the size of the filtered data, Soda takes the square root of the dataset size to set the number of bins. To cover edge cases, if the square root of the dataset size exceeds one million, then Soda sets the number of bins to one million to prevent it from generating too many bins.
Define the sample size
You can add a sample
parameter for both a distribution check and DRO to include a sample SQL clause that Soda passes when it executes the check during a scan.
Apply a sample to a distribution check for continuous columns
If the data to which you wish to apply distribution check does not fit in memory or involves a time constraint, use a sample
to specify a SQL query that returns a sample of the data. The SQL query that you provide is specific to the type of data source you use. In the example below, the SQL query for a PostgreSQL data source randomly samples 50% of the data with seed 61. You can customize the sample
SQL query to meet your needs.
Use sample
for continuous values, only. For categorical values refer to Distribution check details.
checks for dim_customer:
- distribution_difference(budget) < 0.05:
distribution reference file: ./dro_dim_customer.yml
method: ks
# (optional) data source-specific sampling query; for example for postgres\
# the following query randomly samples 50% of the data with seed 61
sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
Sampling Caveats
Some data sources do not have a built-in sampling function. For example, BigQuery does not support TABLESAMPLE BERNOULLI
. In such a case, add a filter
parameter to randomly obtain a sample of the data. The filter
parameter applies a data source-specific SQL WHERE
clause to the data. In the example below, the SQL query for a BigQuery data source randomly samples 50% of the data.
Distribution Check
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
distribution reference file: ./cars_owned_dist_ref.yml
method: chi_square
# (optional) data source-specific sampling query, for example for postgres\
# the following query randomly samples 50% of the data
filter: rand() < 0.5
DRO
dataset: your_dataset_name
column: column_name_in_dataset
distribution_type: categorical
# (optional) data source-specific sampling query; for example for postgres\
# the following query randomly samples 50% of the data
filter: rand() < 0.5
Distribution check examples
You can define multiple distribution checks in a single checks.yml
file. If you create a new DRO for another dataset and column in sales_dist_ref.yml
for example, you can define two distribution checks in the same checks.yml
file, as per the following.
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
method: chi_square
distribution reference file: ./cars_owned_dist_ref.yml
checks for fact_sales_quota:
- distribution_difference(calendar_quarter) < 0.2:
method: psi
distribution reference file: ./sales_dist_ref.yml
Alternatively you can define two DROs in distribution_reference.yml
, naming them cars_owned_dro
and calendar_quarter_dro
, and use both in a single checks.yml
file
checks for dim_customer:
- distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
method: chi_square
distribution reference file: ./distribution_reference.yml
checks for fact_sales_quota:
- distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
method: psi
distribution reference file: ./distribution_reference.yml
You can also define multiple checks for different columns in the same dataset by generating multiple DROs for those columns. Refer to the following example.
checks for dim_customer:
- distribution_difference(number_cars_owned, cars_owned_dro) > 0.05:
method: chi_square
distribution reference file: ./distribution_reference.yml
- distribution_difference(total_children, total_children_dro) < 0.2:
method: psi
distribution reference file: ./distribution_reference.yml
checks for fact_sales_quota:
- distribution_difference(calendar_quarter, calendar_quarter_dro) < 0.2:
method: psi
distribution reference file: ./distribution_reference.yml
Optional check configurations
Supported | Configuration | Documentation |
---|---|---|
✓ | Define a name for a distribution check; see example. | Customize check names |
✓ | Add an identity to a check. | Add a check identity |
Define alert configurations to specify warn and fail thresholds. | - | |
✓ | Apply an in-check filter to return results for a specific portion of the data in your dataset; see example. | Configure in-check filters |
✓ | Use quotes when identifying dataset or column names; see example. Note that the type of quotes you use must match that which your data source uses. For example, BigQuery uses a backtick (`) as a quotation mark. | Use quotes in a check |
Use wildcard characters ( % or * ) in values in the check. | - | |
✓ | Use for each to apply distribution checks to multiple datasets in one scan; see example. | Apply checks to multiple datasets |
✓ | Apply a dataset filter to partition data during a scan; see example. | Scan a portion of your dataset |
✓ | Instruct Soda to collect random samples. Because sampling SQL clauses vary significantly between data sources, consult your data source’s documentation; see example. | - |
Example with check name
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
method: chi_square
distribution reference file: dist_ref.yml
name: Distribution check
Example with quotes
checks for dim_customer:
- distribution_difference("number_cars_owned") < 0.2:
method: psi
distribution reference file: dist_ref.yml
name: Distribution check
Example with for each
for each dataset T:
dataset:
- dim_customer
checks:
- distribution_difference(number_cars_owned) < 0.15:
method: swd
distribution reference file: dist_ref.yml
Example with in-check filter
checks for dim_customer:
- distribution_difference(number_cars_owned) < 0.05:
method: swd
distribution reference file: dist_ref.yml
filter: date_first_purchase between '2010-01-01' and '2022-01-01'
Example with dataset filter
filter dim_customer [first_purchase]:
where: date_first_purchase between '2010-01-01' and '2022-01-01'
checks for dim_customer [first_purchase]:
- distribution_difference(number_cars_owned) < 0.05:
method: swd
distribution reference file: dist_ref.yml
Example with in-check sampling
The following example works for PostgreSQL data sources. It randomly samples 50% of the dataset with seed value 61.
checks for dim_customer:
- distribution_difference(number_cars_owned) > 0.05:
distribution reference file: ./cars_owned_dist_ref.yml
method: chi_square
sample: TABLESAMPLE BERNOULLI (50) REPEATABLE (61)
List of comparison symbols and phrases
=
<
>
<=
>=
!=
<>
between
not between
Troubleshoot Soda Scientific installation
While installing Soda Scientific works on Linux, you may encounter issues if you install Soda Scientific on Mac OS (particularly, machines with the M1 ARM-based processor) or any other operating system. If that is the case, consider using one of the following alternative installation procedures.
Need help? Ask the team in the Soda community on Slack.
Install Soda Scientific Locally
- Set up a virtual environment, and install Soda Library in your new virtual environment.
- Use the following command to install Soda Scientific.
pip install -i https://pypi.cloud.soda.io soda-scientific
List of Soda Scientific dependencies
- pandas<2.0.0
- wheel
- pydantic>=1.8.1,<2.0.0
- scipy>=1.8.0
- numpy>=1.23.3, <2.0.0
- inflection==0.5.1
- httpx>=0.18.1,<2.0.0
- PyYAML>=5.4.1,<7.0.0
- cython>=0.22
- prophet>=1.1.0,<2.0.0
Use Docker to run Soda Library
Use Soda’s Docker image in which Soda Scientific is pre-installed. You need Soda Scientific to be able to use SodaCL distribution checks or anomaly detection checks.
- If you have not already done so, install Docker in your local environment.
- From Terminal, run the following command to pull Soda Library’s official Docker image; adjust the version to reflect the most recent release.
docker pull sodadata/soda-library:v1.0.3
- Verify the pull by running the following command.
docker run sodadata/soda-library:v1.0.3 --help
Output:
Usage: soda [OPTIONS] COMMAND [ARGS]... Soda Library CLI version 1.0.x, Soda Core CLI version 3.0.xx Options: --version Show the version and exit. --help Show this message and exit. Commands: ingest Ingests test results from a different tool scan Runs a scan suggest Generates suggestions for a dataset test-connection Tests a connection update-dro Updates contents of a distribution reference file
When you run the Docker image on a non-Linux/amd64 platform, you may see the following warning from Docker, which you can ignore.
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
- When you are ready to run a Soda scan, use the following command to run the scan via the docker image. Replace the placeholder values with your own file paths and names.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
Optionally, you can specify the version of Soda Library to use to execute the scan. This may be useful when you do not wish to use the latest released version of Soda Library to run your scans. The example scan command below specifies Soda Library version 1.0.0.
docker run -v /path/to/your_soda_directory:/sodacl sodadata/soda-library:v1.0.0 scan -d your_data_source -c /sodacl/your_configuration.yml /sodacl/your_checks.yml
What does the scan command do?
docker run
ensures that the docker engine runs a specific image.-v
mounts your SodaCL files into the container. In other words, it makes the configuration.yml and checks.yml files in your local environment available to the docker container. The command example maps your local directory to/sodacl
inside of the docker container.sodadata/soda-library
refers to the image thatdocker run
must use.scan
instructs Soda Library to execute a scan of your data.-d
indicates the name of the data source to scan.-c
specifies the filepath and name of the configuration YAML file.
Error: Mounts denied
If you encounter the following error, follow the procedure below.
docker: Error response from daemon: Mounts denied:
The path /soda-library-test/files is not shared from the host and is not known to Docker.
You can configure shared paths from Docker -> Preferences... -> Resources -> File Sharing.
See https://docs.docker.com/desktop/mac for more info.
You need to give Docker permission to acccess your configuration.yml and checks.yml files in your environment. To do so:
- Access your Docker Dashboard, then select Preferences (gear symbol).
- Select Resources, then follow the Docker instructions to add your Soda project directory – the one you use to store your configuration.yml and checks.yml files – to the list of directories that can be bind-mounted into Docker containers.
- Click Apply & Restart, then repeat steps 2 - 4 above.
Error: Configuration path does not exist
If you encounter the following error, double check the syntax of the scan command in step 4 above.
- Be sure to prepend
/sodacl/
to both the congifuration.yml filepath and the checks.yml filepath. - Be sure to mount your files into the container by including the
-v
option. For example,-v /Users/MyName/soda_project:/sodacl
.
Soda Library 1.0.x
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Scan summary:
No checks found, 0 checks evaluated.
2 errors.
Oops! 2 errors. 0 failures. 0 warnings. 0 pass.
ERRORS:
Configuration path 'configuration.yml' does not exist
Path "checks.yml" does not exist
Go further
- Need help? Join the Soda community on Slack.
- Reference tips and best practices for SodaCL.
- Use a freshness check to gauge how recently your data was captured.
- Use reference checks to compare the values of one column to another.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 10-Jul-24