ChatGPTIntermediate

How to Clean Messy Data with ChatGPT

Standardize dates, fix inconsistent categories, and strip duplicates in a dirty dataset using ChatGPT's Python sandbox.

10 minIntermediate

Real data is rarely tidy: dates in five formats, Region values like USA, U.S.A., and United States that should be one thing, and duplicate rows. ChatGPT can fix these systematically with Python and hand you a clean file. This guide walks the common cleaning steps and how to direct each one.

What you need

  • A ChatGPT account with file upload
  • A messy CSV or Excel file
  • A rough idea of which columns are unreliable

Step 1: Profile the mess

Upload the file and ask ChatGPT to report problems before changing anything: missing values per column, duplicate rows, and the unique values in any category column you suspect. This gives you a checklist to clean against.

ChatGPT — data profile
You
Profile this file: count nulls per column, count duplicate rows, and list unique values in 'country'.
Agent
37 nulls in email, 12 duplicate rows, and country has 9 spellings of 4 actual countries (e.g. UK, U.K., United Kingdom).

Step 2: Standardize categories with an explicit mapping

Do not let the model guess silently. Ask it to propose a mapping from the messy values to canonical ones, review the mapping, then apply it. This keeps you in control of how United Kingdom is spelled.

category mapping
mapping = {
    "uk": "United Kingdom", "u.k.": "United Kingdom",
    "united kingdom": "United Kingdom",
    "usa": "United States", "u.s.a.": "United States",
    "united states": "United States",
}
df["country"] = (df["country"].str.strip().str.lower()
                   .map(mapping).fillna(df["country"]))

Step 3: Parse dates and drop duplicates

Ask ChatGPT to parse the date column into a real datetime and flag any rows it could not parse, rather than silently dropping them. Then remove exact duplicate rows, keeping the first occurrence.

dates and duplicates
df["signup_date"] = pd.to_datetime(df["signup_date"],
                                   errors="coerce")
unparsed = df[df["signup_date"].isna()]
print("Could not parse:", len(unparsed))

before = len(df)
df = df.drop_duplicates(keep="first")
print("Removed", before - len(df), "duplicate rows")
Coerce hides errors
errors='coerce' turns unparseable dates into blanks instead of crashing. Always have ChatGPT report how many rows that affected so you do not lose data without noticing.

Step 4: Export the cleaned file

Once the column counts look right, ask ChatGPT to save the cleaned DataFrame to a new CSV and give you a download link. Keep the original file untouched so you can always rerun the cleaning.

Export
$Save the cleaned data as customers_clean.csv and give me a download link.
Saved 1,153 rows (was 1,165). Download: customers_clean.csv
$

Result

Your dataset now has one spelling per country, real date values, no duplicate rows, and a clear record of what was changed, ready for analysis or import elsewhere.

Watch related tutorials

Tags
#data-analysis#data-cleaning#python#pandas