One major discipline of data preparation is dealing with data quality issues. Before you can do anything meaningful with your data, you’ve to prepare the dataset. Why not starting with a data quality (short DQ) report.
Using pandas to setup a data quality report
Pandas is a reliable buddy on your journey to insights. In this example we’ll use a flatfile but can easily replace it with a database connection. For a better understanding: I used the Happiness Report 2020 dataset. The column titles of the flatfiles become the rows of the report – if you look at the picture above: how many values of the „data field“ country name are present, missing, unique… guess, you got it.
import pandas as pd
data_file = 'whatever.csv'
data = pd.read_csv(
data_file,
sep=','
)
The code above will read the data from the file. Then we’ll setup a few Data Frames which represent our DQ KPIs. By joining the DataFrames we’ll generate an easy-to-read DQ Report.
Adding the data types of our data:
data_types = pd.DataFrame(
data.dtypes,
columns=['Data Type']
)
Checking for missing data:
missing_data = pd.DataFrame(
data.isnull().sum(),
columns=['Missing Values']
)
Check if the values are unique:
unique_values = pd.DataFrame(
columns=['Unique Values']
)
for row in list(data.columns.values):
unique_values.loc[row] = [data[row].nunique()]
If the min or max values are relevant, we can add:
maximum_values = pd.DataFrame(
columns=['Maximum Value']
)
for row in list(data.columns.values):
maximum_values.loc[row] = [data[row].max()]
Now, generate the overview by df.join(other_df):
dq_report = data_type.join(missing_data).join(unique_values).join(maximum_values)
dq_report
What’s next?
- Add more useful data quality KPIs
- Find someone who takes care of the issues 😉
Just one idea: define the most relevant data attributes in your database and check them on a regular basis. Send the DQ Report to the data owner.
[…] wie man mit Python und Pandas einen DQ Report aufsetzt […]
[…] auf Konsistenz zu überprüfen. Sind die Adressdaten nicht identisch, wird der Datensatz in einem Report „Datenqualität“ aufgelistet und an die verantwortlichen Fachbereiche mit Bitte zur Korrektur geschickt. Dem […]