Skip to content

Preparing Your Data

When you load JSON data into Haute, it arrives with dot-notation column names like proposer.date_of_birth, vehicle.make, and additional_drivers.1.gender. Before you can build a model, you need to turn these into clean, simple column names.

This page shows you how to use clean_columns() to do that in one line.


What you're starting with

After connecting a Quote Input node (or any JSON data source), your data preview will show columns like these:

proposer.date_of_birth          proposer.gender
proposer.licence.licence_type   proposer.licence.licence_date
vehicle.make                    vehicle.model
vehicle.security.alarm          vehicle.security.immobiliser
additional_drivers.1.gender     additional_drivers.2.gender
add_ons.breakdown_cover.selected
address.postcode                address.city
policy_details.cover_type       policy_details.voluntary_excess
policy_details.cover_start_date policy_details.compulsory_excess

These dot-notation names come directly from the nested JSON structure. They're accurate, but they're awkward to work with.


Using clean_columns()

clean_columns() is a utility function in your project's utility/features.py (generated by haute init). It replaces every . with _ in column names.

Add a Polars node after your data source. In the code editor, write:

df = clean_columns(quotes)
return df

That's it. Click Run and look at the data preview -- your columns are now clean.


How the naming works

The rename is fully mechanical: every . becomes _. No heuristics, no surprises.

Original column After clean_columns()
address.postcode address_postcode
vehicle.make vehicle_make
proposer.licence.licence_type proposer_licence_licence_type
vehicle.security.alarm vehicle_security_alarm
additional_drivers.1.gender additional_drivers_1_gender
add_ons.breakdown_cover.selected add_ons_breakdown_cover_selected

The names preserve the full path from the JSON structure, so you always know exactly which field a column came from. Use the Columns tab to strip prefixes visually, or chain .rename() afterward to shorten specific names.


Stripping prefixes

After clean_columns(), names like policy_details_cover_type are longer than you need. The Columns tab on the Quote Input node lets you strip prefixes visually:

  • Click the group name (e.g. policy_details) to strike it through and remove it from all columns in that group
  • Click individual segments within a column name to strip them (e.g. click licence in licence.licence_type to get licence_type)

You can also rename in code:

df = clean_columns(quotes)
df = df.rename({"address_postcode": "postcode", "vehicle_make": "make"})

Adding derived features

After clean_columns(), add derived columns with normal Polars code. Your project's utility/features.py has helpers for common operations:

from utility.features import clean_columns, to_date, years_between

df = clean_columns(quotes)

cover_start = to_date("cover_start_date")

df = df.with_columns(
    years_between(to_date("proposer_date_of_birth"), cover_start).alias("proposer_age"),
    (cover_start.dt.year() - pl.col("year_of_manufacture")).alias("vehicle_age"),
    pl.col("postcode").str.split(" ").list.first().alias("postcode_area"),
)

return df

Check your utility helpers

haute init generates utility/features.py with helpers for common tasks: clean_columns, to_date, years_between, months_between, days_between, postcode_area, and cols_matching. Open the file to see what's available -- they're short, readable functions you can modify or extend.

Use the column sidebar

The code editor has an Available Columns panel below it. Click the + button next to any column name to insert it at your cursor. If you're typing inside quotes ("..."), the editor will also suggest column names as you type.


When things go wrong

Column name not found

If you see ColumnNotFoundError, it usually means a typo. The code editor suggests column names as you type inside quotes -- use these suggestions. You can also check the Available Columns sidebar.

Names are too long

Use the Columns tab to strip prefixes visually, or rename in code:

df = df.rename({"proposer_licence_licence_type": "proposer_licence_type"})