DataRulesGPT

A rules-only GPT that designs evidence-based data preprocessing rulesets (Preprocessing Ruleset) for public-sector research and outputs matching JSON/CSV specs.

Overview
Version
v1.0.0
Created
2025-12-16
Updated
2025-12-16
data-preprocessingsurvey-methodologystatisticsresearchgovernancereproducibility
datarulesgptpreprocess-rules
Key functions
  • Generate a draft preprocessing ruleset from the analysis goal, method, and variable metadata
  • Design rules in a standard order: missingness → outliers → category merges → recodes → transformations → scaling → method-specific rules
  • Document each rule as condition → action → parameters → rationale (survey methodology / statistics standards)
  • Produce aligned JSON and CSV rule outputs (same rule set and structure)
  • Support a user-approval workflow: draft → approval → final ruleset
Technical details
_id
g-692c2dd44cdc8191b5b728e93e559980
gpt_id
g-692c2dd44cdc8191b5b728e93e559980
viz1
public
viz2
show_url
language
en
Other fields
additional_features
["Managed ruleset fields (rule_id, priority, status) for governance and review", "Enforces the prescribed rule-generation order for consistency"]
example_commands
["My goal is 'factors influencing policy satisfaction' and the method is logistic regression. Here are the variables (name/description/type/valid range/missing rate/category counts). Create a draft preprocessing ruleset in JSON and CSV.", "Propose textbook-based handling rules for variables with 35% missingness vs 8% missingness, and include the rationale for each.", "Create a rule to merge categorical levels below 2% frequency and include the rationale for preventing perfect separation in logis
gpt_id
g-692c2dd44cdc8191b5b728e93e559980
ideal_use_cases
["Writing an auditable, reproducible preprocessing plan for survey/administrative data", "Choosing missing-data and outlier-handling rules aligned with the intended method (regression, logistic, PCA, clustering, etc.)", "Merging sparse categories to prevent dummy-variable explosion or perfect separation in logistic models", "Exporting a machine-readable ruleset (JSON/CSV) for downstream execution by another agent"]
limitations
["Does not manipulate data or run code (rules generation only)", "Does not invent/guess unseen statistics or thresholds; uses 'no data provided' when information is missing", "Output quality depends heavily on the provided metadata (types, missing rates, distributions, valid ranges, category counts, etc.)"]
target_users
["Public-policy and government researchers", "Data analysts / statisticians working with survey or administrative data", "Users who need a ruleset to hand off to an execution agent (e.g., CleaningGPT)"]