Preparing Your Data¶
When you load JSON data into Haute, it arrives with dot-notation column names like proposer.date_of_birth, vehicle.make, and additional_drivers.1.gender. Before you can build a model, you need to turn these into clean, simple column names.
This page shows you how to use clean_columns() to do that in one line.
What you're starting with¶
After connecting a Quote Input node (or any JSON data source), your data preview will show columns like these:
proposer.date_of_birth proposer.gender
proposer.licence.licence_type proposer.licence.licence_date
vehicle.make vehicle.model
vehicle.security.alarm vehicle.security.immobiliser
additional_drivers.1.gender additional_drivers.2.gender
add_ons.breakdown_cover.selected
address.postcode address.city
policy_details.cover_type policy_details.voluntary_excess
policy_details.cover_start_date policy_details.compulsory_excess
These dot-notation names come directly from the nested JSON structure. They're accurate, but they're awkward to work with.
Using clean_columns()¶
clean_columns() is a utility function in your project's utility/features.py (generated by haute init). It replaces every . with _ in column names.
Add a Polars node after your data source. In the code editor, write:
That's it. Click Run and look at the data preview -- your columns are now clean.
How the naming works¶
The rename is fully mechanical: every . becomes _. No heuristics, no surprises.
| Original column | After clean_columns() |
|---|---|
address.postcode |
address_postcode |
vehicle.make |
vehicle_make |
proposer.licence.licence_type |
proposer_licence_licence_type |
vehicle.security.alarm |
vehicle_security_alarm |
additional_drivers.1.gender |
additional_drivers_1_gender |
add_ons.breakdown_cover.selected |
add_ons_breakdown_cover_selected |
The names preserve the full path from the JSON structure, so you always know exactly which field a column came from. Use the Columns tab to strip prefixes visually, or chain .rename() afterward to shorten specific names.
Stripping prefixes¶
After clean_columns(), names like policy_details_cover_type are longer than you need. The Columns tab on the Quote Input node lets you strip prefixes visually:
- Click the group name (e.g.
policy_details) to strike it through and remove it from all columns in that group - Click individual segments within a column name to strip them (e.g. click
licenceinlicence.licence_typeto getlicence_type)
You can also rename in code:
Adding derived features¶
After clean_columns(), add derived columns with normal Polars code. Your project's utility/features.py has helpers for common operations:
from utility.features import clean_columns, to_date, years_between
df = clean_columns(quotes)
cover_start = to_date("cover_start_date")
df = df.with_columns(
years_between(to_date("proposer_date_of_birth"), cover_start).alias("proposer_age"),
(cover_start.dt.year() - pl.col("year_of_manufacture")).alias("vehicle_age"),
pl.col("postcode").str.split(" ").list.first().alias("postcode_area"),
)
return df
Check your utility helpers
haute init generates utility/features.py with helpers for common tasks: clean_columns, to_date, years_between, months_between, days_between, postcode_area, and cols_matching. Open the file to see what's available -- they're short, readable functions you can modify or extend.
Use the column sidebar
The code editor has an Available Columns panel below it. Click the + button next to any column name to insert it at your cursor. If you're typing inside quotes ("..."), the editor will also suggest column names as you type.
When things go wrong¶
Column name not found¶
If you see ColumnNotFoundError, it usually means a typo. The code editor suggests column names as you type inside quotes -- use these suggestions. You can also check the Available Columns sidebar.
Names are too long¶
Use the Columns tab to strip prefixes visually, or rename in code: