How to Clean Messy Data with ChatGPT
Standardize dates, fix inconsistent categories, and strip duplicates in a dirty dataset using ChatGPT's Python sandbox.
Real data is rarely tidy: dates in five formats, Region values like USA, U.S.A., and United States that should be one thing, and duplicate rows. ChatGPT can fix these systematically with Python and hand you a clean file. This guide walks the common cleaning steps and how to direct each one.
What you need
- A ChatGPT account with file upload
- A messy CSV or Excel file
- A rough idea of which columns are unreliable
Step 1: Profile the mess
Upload the file and ask ChatGPT to report problems before changing anything: missing values per column, duplicate rows, and the unique values in any category column you suspect. This gives you a checklist to clean against.
Step 2: Standardize categories with an explicit mapping
Do not let the model guess silently. Ask it to propose a mapping from the messy values to canonical ones, review the mapping, then apply it. This keeps you in control of how United Kingdom is spelled.
mapping = {
"uk": "United Kingdom", "u.k.": "United Kingdom",
"united kingdom": "United Kingdom",
"usa": "United States", "u.s.a.": "United States",
"united states": "United States",
}
df["country"] = (df["country"].str.strip().str.lower()
.map(mapping).fillna(df["country"]))Step 3: Parse dates and drop duplicates
Ask ChatGPT to parse the date column into a real datetime and flag any rows it could not parse, rather than silently dropping them. Then remove exact duplicate rows, keeping the first occurrence.
df["signup_date"] = pd.to_datetime(df["signup_date"],
errors="coerce")
unparsed = df[df["signup_date"].isna()]
print("Could not parse:", len(unparsed))
before = len(df)
df = df.drop_duplicates(keep="first")
print("Removed", before - len(df), "duplicate rows")Step 4: Export the cleaned file
Once the column counts look right, ask ChatGPT to save the cleaned DataFrame to a new CSV and give you a download link. Keep the original file untouched so you can always rerun the cleaning.
Result
Your dataset now has one spelling per country, real date values, no duplicate rows, and a clear record of what was changed, ready for analysis or import elsewhere.
Watch related tutorials
1:42:18
28:14
41:09
9:47
8:23
52:31