Connect Soda to Dask and Pandas
Last modified on 02-Oct-24
Connect configuration reference
Load CSV file into Dataframe
Load JSON file into Dataframe
Add optional parameter for COUNT
Add optional parameter for text data conversion
Troubleshoot
Connection configuration reference
For use with programmatic Soda scans, only.
Install package: soda-pandas-dask
Define a programmatic scan for the data in the DataFrames. You do not need to configure a connection to a data source, but you must still configure a connection to Soda Cloud using API Keys. Refer to the following example.
Why do I need a Soda Cloud account?
To validate your account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.Learn more
Load CSV file into Dataframe
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Read more info in "Note on new release" section
dask.config.set({"dataframe.convert-string": False})
# Create a Soda scan object
scan = Scan()
# Load timeseries data from dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"
# Create an artificial pandas dataframe
df_employee = pd.DataFrame({"email": ["a@soda.io", "b@soda.io", "c@soda.io"]})
# Either add Dask dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders")
# OR, add Pandas dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_pandas_dataframe(dataset_name="employee", pandas_df=df_employee, data_source_name="orders")
# Optionally, add multiple dataframes as unique data sources. Note the change of
# the data_source_name parameter.
scan.add_dask_dataframe(dataset_name="inquiries", dask_df=[...], data_source_name="customers")
# Set the scan definition name and default data source to use
scan.set_scan_definition_name("test")
scan.set_data_source_name("orders")
# Add configuration YAML file
# You do not need connection to a data source; you must have a connection to Soda Cloud
# Choose one of the following two options:
# 1) From a file
scan.add_configuration_yaml_file(file_path="~/.soda/configuration.yml")
# 2) Inline in the code
# For host, use cloud.soda.io for EU region; use cloud.us.soda.io for US region
scan.add_configuration_yaml_str(
"""
soda_cloud:
host: cloud.soda.io
api_key_id: 2e0ba0cb-your-api-key-7b
api_key_secret: 5wd-your-api-key-secret-aGuRg
"""
# Define checks in yaml format
# Alternatively, refer to a yaml file using scan.add_sodacl_yaml_file(<filepath>)
checks = """
for each dataset T:
datasets:
- include %
checks:
- row_count > 0
profile columns:
columns:
- employee.%
checks for employee:
- values in (email) must exist in timeseries (email) # Error expected
- row_count same as timeseries # Error expected
checks for timeseries:
- avg_x_minus_y between -1 and 1:
avg_x_minus_y expression: AVG(x - y)
- failed rows:
samples limit: 50
fail condition: x >= 3
- schema:
name: Confirm that required columns are present
warn:
when required column missing: [x]
when forbidden column present: [email]
when wrong column type:
email: varchar
fail:
when required column missing:
- y
- invalid_count(email) = 0:
valid format: email
- valid_count(email) > 0:
valid format: email
"""
scan.add_sodacl_yaml_str(checks)
scan.set_verbose(True)
scan.execute()
Load JSON file into Dataframe
import pandas as pd
from soda.scan import Scan
# Create a Soda scan object
scan = Scan()
# Load JSON file into DataFrame
df = pd.read_json('your_file.json')
...
Add optional parameter for COUNT
Prior to soda-pandas-dask
version 1.6.4, Soda only supported dask-sql
versions up to 2023.10
in which the COUNT(*)
clause behaved as COUNT(1)
by default. With dask-sql
versions greater than 2023.10
, Dask’s behavior changed so that COUNT(*)
behaves as COUNT(*)
. Therefore, upgrading your soda-pandas-dask
package, which supports newer versions of dask-sql
with the new behavior, might lead to unexpected differences in your check results.
To mitigate confusion, with soda-pandas-dask
version 1.6.4 or greater, use the optional use_dask_count_star_as_count_one
parameter when calling scan.add_dask_dataframe()
or scan.add_pandas_dataframe()
to explicitly set the behavior of the COUNT(*)
clause, as in the following example.
Parameter setting | behavior |
---|---|
use_dask_count_star_as_count_one=True | COUNT(*) behaves as SQL COUNT(1) operation |
use_dask_count_star_as_count_one=False | COUNT(*) behaves as SQL COUNT(*) operation |
If you do not add the parameter, Soda defaults to use_dask_count_star_as_count_one=True
.
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Create a Soda scan object
scan = Scan()
# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"
# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
# Dask uses SQL COUNT(*) operation, instead of COUNT(1)
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
Add optional parameter for text data conversion
In dask>=2023.7.1
and later, if you use pandas>=2
and pyarrow>=12
, Dask Dataframe automatically converts text data to string[pyarrow]
data type. With soda-pandas-dask
version 1.6.4, Soda’s updated codebase uses dask>=2023.7.1
but it still expects text data to be converted to object
data type.
Add the dask.config.set({"dataframe.convert-string": False})
parameter set to False
, as in the following example, to avoid KeyError: string[pyarrow]
errors. Access Dask documentation for further details.
import pandas as pd
import dask
import dask.datasets
from soda.scan import Scan
# Avoid string conversion errors
dask.config.set({"dataframe.convert-string": False})
# Create a Soda scan object
scan = Scan()
# Load timeseries data from Dask datasets
df_timeseries = dask.datasets.timeseries().reset_index()
df_timeseries["email"] = "a@soda.io"
# Add Dask Dataframe to scan and assign a dataset name to refer from checks.yaml
scan.add_dask_dataframe(dataset_name="timeseries", dask_df=df_timeseries, data_source_name="orders", use_dask_count_star_as_count_one=False)
Troubleshoot
Problem: You encounter errors when trying to install soda-pandas-dask
in an environment that uses Python 3.11. This may manifest as an issue with dependencies or as an error that reads, Pre-scan validation failed, see logs for details.
Workaround: Uninstall the soda-pandas-dask
package, then downgrade the version of Python your environment uses to Python 3.9. Install the soda-pandas-dask
package again.
Problem: The COUNT(*
) behavior in dask-sql
is behaving unexpectedly or yielding confusing check results.
Solution: Upgrade soda-pandas-dask
to version 1.6.4 or greater and use the optional use_dask_count_star_as_count_one=True
parameter when calling scan.add_dask_dataframe()
or scan.add_pandas_dataframe()
to persist old dask-sql
behavior. See Add optional parameter for COUNT
.
Problem: You encounter an error that reads KeyError: string[pyarrow]
.
Solution: Upgrade soda-pandas-dask
to version 1.6.4 or greater and use the dask.config.set({"dataframe.convert-string": False})
parameter set to False
. See Add optional parameter text data conversion.
Was this documentation helpful?
What could we do to improve this page?
- Suggest a docs change in GitHub.
- Share feedback in the Soda community on Slack.
Documentation always applies to the latest version of Soda products
Last modified on 02-Oct-24