Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,141 @@ If you use DiffUTE in your research or wish to refer to the baseline results pub
Please feel free to contact us if you have any problems.

Email: [hx.chen@hotmail.com](hx.chen@hotmail.com) or [zhuoerxu.xzr@antgroup.com](zhuoerxu.xzr@antgroup.com)

# DiffUTE Training Scripts V2

This repository contains updated training scripts for the DiffUTE (Diffusion Universal Text Editor) model. The scripts have been modernized with improved data handling, better code organization, and MinIO integration for efficient data storage.

## Key Changes

1. Replaced pcache_fileio with MinIO for data handling
2. Removed alps dependencies
3. Improved code organization and readability
4. Enhanced error handling and logging
5. Better type hints and documentation
6. Modernized training loops

## Requirements

Install the required packages:

```bash
pip install -r requirements.txt
```

## Directory Structure

```
.
├── README.md
├── requirements.txt
├── train_vae_v2.py
├── train_diffute_v2.py
└── utils/
└── minio_utils.py
```

## Training Scripts

### VAE Training

Train the VAE component using:

```bash
python train_vae_v2.py \
--pretrained_model_name_or_path "path/to/model" \
--output_dir "vae-fine-tuned" \
--data_path "path/to/data.csv" \
--resolution 512 \
--train_batch_size 16 \
--num_train_epochs 100 \
--learning_rate 1e-4 \
--minio_endpoint "your-minio-endpoint" \
--minio_access_key "your-access-key" \
--minio_secret_key "your-secret-key" \
--minio_bucket "your-bucket-name"
```

### DiffUTE Training

Train the complete DiffUTE model using:

```bash
python train_diffute_v2.py \
--pretrained_model_name_or_path "path/to/model" \
--output_dir "diffute-fine-tuned" \
--data_path "path/to/data.csv" \
--resolution 512 \
--train_batch_size 16 \
--num_train_epochs 100 \
--learning_rate 1e-4 \
--guidance_scale 0.8 \
--minio_endpoint "your-minio-endpoint" \
--minio_access_key "your-access-key" \
--minio_secret_key "your-secret-key" \
--minio_bucket "your-bucket-name"
```

## Data Format

The training data should be specified in a CSV file with the following columns:

For VAE training:
- `path`: Path to the image file in MinIO storage

For DiffUTE training:
- `image_path`: Path to the image file in MinIO storage
- `ocr_path`: Path to the OCR results JSON file in MinIO storage

## MinIO Setup

1. Install and configure MinIO server
2. Create a bucket for storing training data
3. Upload your training images and OCR results
4. Configure access credentials in the training scripts

## Model Architecture

The DiffUTE model consists of three main components:

1. VAE (Variational AutoEncoder):
- Handles image encoding/decoding
- Pre-trained and frozen during DiffUTE training
- Reduces computational complexity by working in latent space

2. UNet:
- Main trainable component
- Performs denoising in latent space
- Conditioned on text embeddings
- Takes concatenated input of noisy latents, mask, and masked image

3. TrOCR:
- Pre-trained text recognition model
- Provides text embeddings for conditioning
- Frozen during training

## Training Process

1. Data Preparation:
- Images are loaded from MinIO storage
- OCR results are used to identify text regions
- Images are preprocessed and normalized

2. Training Loop:
- VAE encodes images to latent space
- Random noise is added according to diffusion schedule
- UNet predicts noise or velocity
- Loss is calculated and model is updated
- Checkpoints are saved periodically

## Error Handling

The scripts include robust error handling:
- Graceful handling of failed image loads
- Fallback mechanisms for missing data
- Detailed logging of errors
- Proper cleanup of resources

## Contributing

All rights go to original authors
19 changes: 12 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
accelerate>=0.16.0
torchvision
transformers>=4.25.1
datasets
ftfy
tensorboard
Jinja2
torch>=2.0.0
accelerate>=0.20.0
transformers>=4.30.0
diffusers>=0.15.0
albumentations>=1.3.0
opencv-python>=4.7.0
pandas>=2.0.0
numpy>=1.24.0
Pillow>=9.5.0
tqdm>=4.65.0
minio>=7.1.0
scikit-image>=0.20.0
168 changes: 168 additions & 0 deletions stable_diffusion_text_inpaint/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Using Stable Diffusion for Text Inpainting

This guide explains how to use Stable Diffusion's inpainting capability to add text to specific regions in an image. While not as specialized as DiffUTE for text editing, this approach can still achieve decent results.

## Requirements

```python
pip install diffusers transformers torch
```

## Basic Implementation

```python
import torch
from diffusers import StableDiffusionInpaintPipeline
from PIL import Image, ImageDraw
import numpy as np

def create_text_mask(image, text_box):
"""Create a binary mask for the text region

Args:
image: PIL Image
text_box: tuple of (x1, y1, x2, y2) coordinates
"""
mask = Image.new("RGB", image.size, "black")
draw = ImageDraw.Draw(mask)
draw.rectangle(text_box, fill="white")
return mask

# Load the model
model_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = StableDiffusionInpaintPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Load your image
image = Image.open("your_image.png")

# Define the text region (x1, y1, x2, y2)
text_box = (100, 100, 300, 150) # Example coordinates

# Create the mask
mask = create_text_mask(image, text_box)

# Generate the inpainting
prompt = "Clear black text saying 'Hello World' on a white background"
negative_prompt = "blurry, unclear text, multiple texts, watermark"

result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=image,
mask_image=mask,
num_inference_steps=50,
guidance_scale=7.5,
).images[0]
```

## Tips for Better Results

1. **Mask Preparation**:
- Make the mask slightly larger than the text area
- Use anti-aliasing on mask edges for smoother blending
- Consider the text baseline and x-height in mask creation

2. **Prompt Engineering**:
- Be specific about text style: "sharp, clear black text"
- Mention text properties: "centered, serif font"
- Include context: "text on a white background"

3. **Negative Prompts**:
- "blurry, unclear text"
- "multiple texts, overlapping text"
- "watermark, artifacts"
- "distorted, warped text"

4. **Parameter Tuning**:
```python
# For clearer text
result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=image,
mask_image=mask,
num_inference_steps=50, # More steps for better quality
guidance_scale=7.5, # Higher for more prompt adherence
strength=0.8, # Control how much to change
).images[0]
```

## Advanced Usage

### 1. Style Matching

To match existing text styles in the image:

```python
def match_text_style(image, text_region):
"""Analyze existing text style in the image"""
# Add OCR or style analysis here
return "style_description"

style = match_text_style(image, text_region)
prompt = f"Text saying 'Hello World' in style: {style}"
```
Comment on lines +100 to +108
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace placeholder function with actual implementation.

The match_text_style function is defined but only contains a placeholder comment. Consider implementing this function with a concrete example that uses the TextStyleAnalyzer from the related utils.style_utils module, as referenced in the relevant code snippets.

 def match_text_style(image, text_region):
     """Analyze existing text style in the image"""
-    # Add OCR or style analysis here
-    return "style_description"
+    from stable_diffusion_text_inpaint.utils.style_utils import TextStyleAnalyzer, generate_style_prompt
+    
+    # Initialize style analyzer
+    analyzer = TextStyleAnalyzer()
+    
+    # Analyze the region
+    style_props = analyzer.analyze_text_region(image, text_region)
+    
+    # Generate a descriptive prompt
+    return generate_style_prompt(style_props)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```python
def match_text_style(image, text_region):
"""Analyze existing text style in the image"""
# Add OCR or style analysis here
return "style_description"
style = match_text_style(image, text_region)
prompt = f"Text saying 'Hello World' in style: {style}"
```
def match_text_style(image, text_region):
"""Analyze existing text style in the image"""
- # Add OCR or style analysis here
- return "style_description"
+ from stable_diffusion_text_inpaint.utils.style_utils import TextStyleAnalyzer, generate_style_prompt
+
+ # Initialize style analyzer
+ analyzer = TextStyleAnalyzer()
+
+ # Analyze the region
+ style_props = analyzer.analyze_text_region(image, text_region)
+
+ # Generate a descriptive prompt
+ return generate_style_prompt(style_props)
style = match_text_style(image, text_region)
prompt = f"Text saying 'Hello World' in style: {style}"


### 2. Context-Aware Masking

```python
def create_context_mask(image, text_box, padding=10):
"""Create a mask with context awareness"""
x1, y1, x2, y2 = text_box
padded_box = (x1-padding, y1-padding, x2+padding, y2+padding)
mask = create_text_mask(image, padded_box)
return mask
```

### 3. Multiple Attempts

```python
def generate_multiple_attempts(pipe, image, mask, prompt, num_attempts=3):
"""Generate multiple versions and pick the best"""
results = []
for _ in range(num_attempts):
result = pipe(
prompt=prompt,
image=image,
mask_image=mask,
num_inference_steps=50,
).images[0]
results.append(result)
return results
```

## Limitations

1. Less precise text control compared to DiffUTE
2. May require multiple attempts to get desired results
3. Text style matching is less reliable
4. May introduce artifacts around text regions

## Best Practices

1. **Preparation**:
- Clean the text region thoroughly
- Create precise masks
- Use high-resolution images

2. **Generation**:
- Start with lower strength values
- Generate multiple variations
- Use detailed prompts

3. **Post-processing**:
- Check text clarity and alignment
- Verify style consistency
- Touch up edges if needed

## When to Use DiffUTE Instead

Consider using DiffUTE when:
- Precise text style matching is crucial
- Multiple text regions need editing
- Text needs to perfectly match surrounding context
- Working with complex backgrounds
19 changes: 19 additions & 0 deletions stable_diffusion_text_inpaint/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""Text inpainting package using Stable Diffusion."""

from .text_inpainter import TextInpainter
from .utils.mask_utils import (
create_text_mask,
create_context_mask,
create_antialiased_mask,
)
from .utils.style_utils import TextStyleAnalyzer, generate_style_prompt

__version__ = "0.1.0"
__all__ = [
"TextInpainter",
"create_text_mask",
"create_context_mask",
"create_antialiased_mask",
"TextStyleAnalyzer",
"generate_style_prompt",
]
Binary file not shown.
Loading