CleanCSV API Documentation
Upload a CSV, get clean data. No API key required.
Endpoints
| Method | Path | Description | Limits |
|---|---|---|---|
| POST | /api/demo/clean | Clean CSV (rate-limited demo) | 1 MB, 10 req/min |
| POST | /api/demo/suggest | Suggest dedup rules (rate-limited demo) | 1 MB, 10 req/min |
| POST | /api/v1/clean | Clean CSV (open, no auth) | 4.5 MB |
| POST | /api/v1/suggest | Suggest dedup rules (open, no auth) | 4.5 MB |
Quick Start
Send a CSV as multipart form data with an optional JSON config:
curl -X POST https://your-domain.vercel.app/api/v1/clean \
-F "file=@data.csv" \
-F 'config={"format":"json","dedup":{"columns":["email"],"strategy":"most_complete"}}'Configuration
Pass as a config field in multipart form data, or as JSON body:
{
"format": "json",
"dedup": {
"columns": ["email"],
"strategy": "most_complete",
"fuzzy": false,
"threshold": 0.85,
"normalize": true,
"preview": false,
"nulls_equal": true
},
"encoding": {
"encoding": "auto",
"fallback": ["windows-1252", "iso-8859-1"],
"fix_mojibake": true,
"normalize_unicode": "NFC",
"strip_control_chars": true,
"strip_invisible": true,
"custom_mojibake_mappings": [
{ "from": "é", "to": "é" }
]
},
"null_handling": {
"null_values": ["", "NULL", "N/A", "-"],
"treat_empty_as_null": true,
"type_inference": true,
"fill": "leave",
"per_column": {
"age": { "cast": "int", "fill": "fill_median" },
"status": { "fill": "fill_default:unknown" }
}
},
"null_representation": ""
}Parameters
| Field | Type | Default | Description |
|---|---|---|---|
| format | json | csv | json | Output format |
| dedup.columns | string[] | all | Columns to deduplicate by |
| dedup.strategy | string | first | first | last | most_complete | merge |
| dedup.fuzzy | boolean | false | Enable fuzzy matching |
| dedup.threshold | number | 0.85 | Fuzzy similarity threshold (0-1) |
| dedup.normalize | boolean | false | Normalize values before comparing |
| dedup.preview | boolean | false | Return all rows + duplicate_groups |
| dedup.nulls_equal | boolean | true | Treat two null values as equal in dedup |
| null_handling | object | {} | Null detection and fill config (see below) |
| null_representation | string | "" | How nulls appear in CSV output |
Null Handling
null_handling Configuration
| Field | Type | Default | Description |
|---|---|---|---|
| null_values | string[] | 17 defaults | Strings treated as null |
| treat_empty_as_null | boolean | true | Convert empty strings to null |
| type_inference | boolean | true | Auto-detect and cast column types |
| fill | string | leave | Global fill strategy |
| per_column | object | {} | Per-column overrides (fill, cast, null_values) |
Fill Strategies
leaveKeep null as-is
drop_rowDrop entire row if any configured column is null
drop_columnDrop column if >50% values are null
forward_fillFill null with previous non-null value
fill_meanFill with column mean (numeric columns)
fill_medianFill with column median (numeric columns)
fill_modeFill with most common value
fill_default:VALUEFill with a custom default value
Type Inference
When enabled, columns are auto-detected as int, float, boolean, or string using an 80% threshold.
Boolean values recognized: true/false, yes/no, 1/0, on/off, t/f, y/n
Important Notes
0andfalseare NEVER treated as null unless explicitly listed in null_values- Pipeline order:
fix encoding → normalize text → detect nulls → fill → cast types → dedup → output - JSON output uses native null/number/boolean values; CSV uses configurable null_representation
- Per-column config overrides global settings
Keep Strategies
firstKeep the first occurrence in original order
lastKeep the last occurrence
most_completeKeep the row with the most non-null/non-empty values
mergeMerge all rows in group — take the first non-null value per column
Encoding Issue Types
| Type | Description |
|---|---|
| mojibake_detected | Mojibake patterns found but not fully repaired |
| mojibake_repaired | Mojibake successfully repaired |
| double_encoding_repaired | Double/triple UTF-8 encoding reversed |
| custom_mappings_applied | User-defined mappings applied |
| bom_removed | Byte order mark stripped |
| control_chars_removed | Control characters removed |
| invisible_chars_removed | Zero-width/invisible characters removed |
| unicode_normalized | Unicode normalization applied (NFC/NFD/NFKC/NFKD) |
| line_endings_normalized | Line endings normalized |
| encoding_fallback_used | Fallback encoding used for better results |
Response Examples
Clean Response (JSON)
{
"meta": {
"rows": 142,
"columns": 5,
"removed": 8,
"encoding_detected": "utf-8",
"encoding_confidence": 0.99,
"encoding_issues": ["mojibake_repaired: known_mojibake_replaced"],
"null_profiles": [
{
"column": "email",
"null_count": 3,
"null_percentage": 0.021,
"top_null_values": [{ "value": "N/A", "count": 2 }],
"inferred_type": "string"
}
],
"null_warnings": [
{ "column": "age", "message": "Column \"age\" appears numeric but has non-numeric values", "severity": "warning" }
],
"type_map": { "id": "int", "email": "string", "age": "int", "active": "boolean", "name": "string" }
},
"data": [
{ "id": 1, "email": "alice@example.com", "age": 30, "active": true, "name": "Alice" },
{ "id": 2, "email": "bob@example.com", "age": null, "active": false, "name": "Bob" }
],
"duplicate_groups": [
{
"reason": "exact match on email",
"confidence": 1.0,
"rows": [
{ "id": 1, "email": "alice@example.com", "age": 30, "active": true, "name": "Alice" },
{ "id": 3, "email": "alice@example.com", "age": 30, "active": true, "name": "alice" }
],
"kept": { "id": 1, "email": "alice@example.com", "age": 30, "active": true, "name": "Alice" },
"kept_row_index": 0,
"droppedIndices": [2]
}
]
}Suggest Response
{
"row_count": 150,
"column_count": 5,
"columns": ["id", "email", "name", "age", "active"],
"suggested_rules": [
{
"dedupBy": ["id"],
"fuzzy": false,
"normalize": false,
"reason": "\"id\" looks like a unique identifier"
},
{
"dedupBy": ["email"],
"fuzzy": false,
"normalize": true,
"reason": "\"email\" contains email addresses — normalizing for dedup"
}
],
"null_profiles": [
{ "column": "age", "null_count": 12, "null_percentage": 0.08, "top_null_values": [{ "value": "N/A", "count": 8 }], "inferred_type": "int" }
],
"warnings": [
{ "column": "age", "message": "Column \"age\" has null-like placeholders: N/A", "severity": "info" }
]
}Error Codes
| Status | Meaning | Example |
|---|---|---|
| 400 | Bad request | Invalid config, missing file, bad strategy |
| 413 | File too large | Exceeds 1 MB (demo) or 4.5 MB (v1) |
| 429 | Rate limited | Demo endpoints: 10 requests/minute |