Warning
This repository is under active development and may not be stable.
HashPrep is a Python library for intelligent dataset profiling and debugging that acts as a comprehensive pre-training quality assurance tool for machine learning projects. Think of it as "Pandas Profiling + PyLint for datasets", designed specifically for machine learning workflows.
It catches critical dataset issues before they derail your ML pipeline, explains the problems, and suggests context-aware fixes.
If you want, HashPrep can even apply those fixes for you automatically.
Key features include:
- Intelligent Profiling: Detect missing values, skewed distributions, outliers, and data type inconsistencies.
- ML-Specific Checks: Identify data leakage, dataset drift, class imbalance, and high-cardinality features.
- Automated Preparation: Get suggestions for encoding, imputation, scaling, and transformations.
- Rich Reporting: Generate statistical summaries and exportable reports (HTML/PDF/Markdown/JSON) with embedded visualizations.
- Production-Ready Pipelines: Output reproducible cleaning and preprocessing code (
fixes.py) that integrates seamlessly with ML workflows. - Modern Themes: Choose between "Minimal" (professional) and "Neubrutalism" (bold) report styles.
pip install hashprep# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install hashprep
uv pip install hashprep
# Or for development from source
git clone https://github.com/cachevector/hashprep.git
cd hashprep
uv syncAfter installation, the hashprep command will be available directly in your terminal.
HashPrep can be used both as a command-line tool and as a Python library.
Get a quick summary of critical issues in your terminal.
hashprep scan dataset.csvOptions:
--critical-only: Show only critical issues--quiet: Minimal output (counts only)--json: Output in JSON format--target COLUMN: Specify target column for ML-specific checks--checks CHECKS: Run specific checks (comma-separated)--sample-size N: Limit analysis to N rows--no-sample: Disable automatic sampling
Example:
# Scan with target column and specific checks
hashprep scan train.csv --target Survived --checks outliers,high_missing_values,class_imbalance
# Quick scan with JSON output
hashprep scan dataset.csv --json --quietGet comprehensive details about all detected issues.
hashprep details dataset.csvOptions: Same as scan command
Example:
hashprep details train.csv --target SurvivedGenerate comprehensive reports in multiple formats with visualizations.
hashprep report dataset.csv --format html --theme minimalOptions:
--format {md,json,html,pdf}: Report format (default: md)--theme {minimal,neubrutalism}: HTML report theme (default: minimal)--with-code: Generate Python scripts for fixes and pipelines--full/--no-full: Include/exclude full summaries (default: True)--visualizations/--no-visualizations: Include/exclude plots (default: True)--target COLUMN: Specify target column--checks CHECKS: Run specific checks--comparison FILE: Compare with another dataset for drift detection--sample-size N: Limit analysis to N rows--no-sample: Disable automatic sampling
Examples:
# Generate HTML report with minimal theme
hashprep report dataset.csv --format html --theme minimal --full
# Generate PDF report without visualizations (faster)
hashprep report dataset.csv --format pdf --no-visualizations
# Generate report with automatic fix scripts
hashprep report dataset.csv --with-code
# This creates:
# - dataset_hashprep_report.md (or .html/.pdf/.json)
# - dataset_hashprep_report_fixes.py (pandas script)
# - dataset_hashprep_report_pipeline.py (sklearn pipeline)
# Compare two datasets for drift detection
hashprep report train.csv --comparison test.csv --format htmlCheck HashPrep version.
hashprep versionoutliers- Detect outliers using IQR methodduplicates- Find duplicate rowshigh_missing_values- Columns with >50% missing datadataset_missingness- Overall missing data patternshigh_cardinality- Categorical columns with too many unique valuessingle_value_columns- Constant columns with no varianceclass_imbalance- Imbalanced target variable (requires --target)feature_correlation- Highly correlated featurestarget_leakage- Features that may leak target informationdataset_drift- Distribution drift between datasets (requires --comparison)uniform_distribution- Uniformly distributed numeric columnsunique_values- Columns where >95% values are uniquemany_zeros- Columns with excessive zero values
import pandas as pd
from hashprep import DatasetAnalyzer
# Load your dataset
df = pd.read_csv("dataset.csv")
# Create analyzer
analyzer = DatasetAnalyzer(df)
# Run analysis
summary = analyzer.analyze()
# Access results
print(f"Critical issues: {summary['critical_count']}")
print(f"Warnings: {summary['warning_count']}")
# Iterate through issues
for issue in summary['issues']:
print(f"{issue['severity']}: {issue['description']}")# Specify target for ML-specific checks
analyzer = DatasetAnalyzer(
df,
target_col='target_column'
)
summary = analyzer.analyze()# Only run specific checks
analyzer = DatasetAnalyzer(
df,
selected_checks=['outliers', 'high_missing_values', 'class_imbalance']
)
summary = analyzer.analyze()# Generate analysis with plots
analyzer = DatasetAnalyzer(df, include_plots=True)
summary = analyzer.analyze()
# Plots are stored in summary['summaries']['plots']# Compare two datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
analyzer = DatasetAnalyzer(
train_df,
comparison_df=test_df,
selected_checks=['dataset_drift']
)
summary = analyzer.analyze()from hashprep.reports import generate_report
# Analyze dataset
analyzer = DatasetAnalyzer(df, include_plots=True)
summary = analyzer.analyze()
# Generate HTML report
generate_report(
summary,
format='html',
full=True,
output_file='report.html',
theme='minimal'
)
# Generate PDF report
generate_report(
summary,
format='pdf',
full=True,
output_file='report.pdf'
)
# Generate JSON report
generate_report(
summary,
format='json',
full=True,
output_file='report.json'
)
# Generate Markdown report
generate_report(
summary,
format='md',
full=True,
output_file='report.md'
)from hashprep.checks.core import Issue
from hashprep.preparers.codegen import CodeGenerator
from hashprep.preparers.pipeline_builder import PipelineBuilder
from hashprep.preparers.suggestions import SuggestionProvider
# After running analysis
analyzer = DatasetAnalyzer(df, target_col='target')
summary = analyzer.analyze()
# Convert issues to proper format
issues = [Issue(**i) for i in summary['issues']]
column_types = summary.get('column_types', {})
# Get suggestions
provider = SuggestionProvider(
issues=issues,
column_types=column_types,
target_col='target'
)
suggestions = provider.get_suggestions()
# Generate pandas fix script
codegen = CodeGenerator(suggestions)
fixes_code = codegen.generate_pandas_script()
with open('fixes.py', 'w') as f:
f.write(fixes_code)
# Generate sklearn pipeline
builder = PipelineBuilder(suggestions)
pipeline_code = builder.generate_pipeline_code()
with open('pipeline.py', 'w') as f:
f.write(pipeline_code)from hashprep.utils.sampling import SamplingConfig
# Configure sampling for large datasets
sampling_config = SamplingConfig(max_rows=10000)
analyzer = DatasetAnalyzer(
df,
sampling_config=sampling_config,
auto_sample=True
)
summary = analyzer.analyze()
# Check if sampling occurred
if 'sampling_info' in summary:
info = summary['sampling_info']
print(f"Sampled: {info['sample_fraction']*100:.1f}%")This project is licensed under the MIT License.
We welcome contributions from the community to make HashPrep better!
Before you get started, please:
- Review our CONTRIBUTING.md for detailed guidelines and setup instructions
- Write clean, well-documented code
- Follow best practices for the stack or component you’re working on
- Open a pull request (PR) with a clear description of your changes and motivation