Skip to content

A full end to end analysis and prediction pipeline for the University of Wisconsin Corn Breeding Program.

Notifications You must be signed in to change notification settings

JohnSearl007/BreedStream_App

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

BreedStream Shiny App

John Searl 2025-09-25

Introduction

This Shiny App was developed as a full end-to-end analysis and prediction pipeline for the University of Wisconsin - Madison Corn Breeding Programs led by Drs. Shawn Kaeppler and Natalia de Leon. This Shiny App builds upon functionality and ease of use for the underlying custom R package BreedStream. The functions found within the BreedStream package are largely built on the foundations of the StageWise and COMA R packages, both of which were developed by Dr. Jeff Endelman at the University of Wisconsin - Madison and can be found on his GitHub. Another source for inspiration was SimpleMating developed by the Resende Lab at the University of Florida.

Major emphasis was placed on ensuring large datasets were capable of being processed under limited computing resources. In development, 2545 Stiff Stalk (SS) and 1509 Non-Stiff Stalk (NSS) families (2135 and 1365 of which were trialed in 5 environments) were used for ensuring RAM limitations were avoided and that computational performance was still retained. A full half diallel of ~3.9 million hybrids between the SS and NSS families along with the ~3.2 million and ~1.3 million SS/NSS potential breeding populations were successfully evaluated on a Dell XPS 15 9520 laptop with a 12th Gen Intel(R) Core(TM) i9-12900HK and 64 Gb of 4800 MT/s DDR5 memory.

This pipeline:

  1. Allows the user to visually clean raw phenotypic files for erroneous plot values.
  2. Formats the cleaned phenotypic data in a standardized manner. Removes border/filler plots. Calculates yield from plot combine weight and moisture readings (adjustable for different crops and plot sizes).
  3. Generates diagnostic plots for quality control measures on the first stage model before the user proceeds further in analysis.
  4. Allows the user to generate in silico genotypes.
  5. Fits a Two-Stage model with many options, including allowing the user to utilize an optimized H Matrix based on AIC values.
  6. Supports the usage of multi-trait and restricted index selection indices.
  7. Allows for the calculation of genotypic values from the fitted Two-Stage model without the need for genotype or pedigree data.
  8. Estimates additive and optionally dominant marker effects.
  9. Estimates a usefulness criterion for use in the development of next cycle breeding crosses.
  10. Enables deployment of Optimal Mate Allocation for the selection of next cycle breeding crosses.
  11. Performs RR-BLUP predictions on tested and untested hybrids.

System Dependencies for the Shiny App

This Shiny App relies on Unix-like shell commands for file processing (e.g., combining CSVs). It runs natively on Linux and macOS. On Windows, you’ll need a compatibility layer (see below).

Ubuntu/Debian Linux

Most dependencies are pre-installed, but run this to ensure everything is available:

sudo apt update
sudo apt install bash coreutils diffutils xdg-utils

Verification: Open a terminal and run diff --version and xdg-open --version. If they output version info, you’re good.

macOS

macOS has most tools pre-installed, but use Homebrew for any missing ones. First, install Homebrew if you don’t have it:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then run:

brew install coreutils diffutils xdg-utils

Notes:

  • Add Homebrew to your PATH if needed (add export PATH="/opt/homebrew/bin:$PATH" to ~/.zshrc and restart Terminal).
  • macOS’s native open command can often substitute for xdg-open, but the app uses the latter for cross-platform consistency.

Verification: Open Terminal and run gdiff --version (note the g prefix for Homebrew’s GNU diffutils) and xdg-open --version.

Windows

The app’s shell commands aren’t native to Windows. Install a Unix-like environment:

  • Recommended: Git for Windows (includes Git Bash with all needed tools).
    1. Download from git-scm.com.
    2. Install with defaults (ensure “Add Git to PATH” is selected).
    3. Restart R/RStudio.
  • Alternative: Windows Subsystem for Linux (WSL).
    1. Enable via Windows Settings > Apps > Optional Features > “Windows Subsystem for Linux”.
    2. Install Ubuntu from Microsoft Store.
    3. Open Ubuntu terminal and run Ubuntu’s commands above.
    4. Run the app from within WSL (install R there if needed).

Verification: Open Git Bash (or WSL terminal) and run diff --version.

General Troubleshooting

  • Command not found? Check your PATH: Run echo $PATH in the terminal. Restart your terminal/R/RStudio after installs.
  • Permission errors? On Linux/Mac, use sudo for apt/brew if needed.
  • R Dependencies: Beyond the requirement for ASReml R (a commercially licensed package for advanced mixed model analysis, available from VSN International), ensure these R packages are installed: shiny, data.table, dplyr, bigmemory, BreedStream, StageWise, ibdsim2, shinyWidgets, shinyFiles. In R Run:
# Install from CRAN
install.packages(c("shiny", "data.table", "dplyr", "bigmemory", "ibdsim2", "shinyWidgets", "shinyFiles"))

# Install from GitHub
install.packages("devtools")
devtools::install_github("jendelman/StageWise", build_vignettes=FALSE)
devtools::install_github("JohnSearl007/BreedStream", build_vignettes=FALSE)
  • If issues persist, check the app’s console output for specific command errors.

Installing and Running the App

To get started with the BreedStream Shiny App, follow these steps:

  1. Install R (version 4.0 or later recommended) from cran.r-project.org.
  2. Install RStudio (optional but recommended for development) from posit.co.
  3. Install the required R packages as detailed in the “R Dependencies” section above.
  4. Download the Shiny App repository from GitHub and update the file paths in the main app script (app.r) to correctly point to where you have placed the individual panel files (e.g., Cleaning.R).
  5. Open the main app script (e.g., app.R) in RStudio and click “Run App,” or run shiny::runApp() in the R console from the app directory.

Note: Ensure your working directory has write permissions for output files. The app will prompt you to select a directory for saving outputs.

Getting Started

Upon launching the app, select a file path (working directory) where you would like output files to be saved. This directory will be used throughout the session for storing results like cleaned data, models, and predictions.

Navigate through the tabs on the top to access each step of the pipeline. It’s recommended to proceed sequentially, as later steps often depend on outputs from earlier ones (e.g., formatted phenotypic data is needed for modeling).

Clean & Format

This section handles the initial processing of raw phenotypic data. It consists of three sub-panels: Clean, Format, and Quality Check.

Clean

Purpose: Upload and clean raw phenotypic data for a single environment by visually identifying and removing outliers using histograms and slider adjustments. This helps remove erroneous plot values that could skew downstream analyses.

Step-by-Step Guide:

  1. Upload Phenotypic Data: Select a CSV file containing your raw phenotypic data (such as the output file from a Harvest Master equiped plot combine running the Mirus software) for one environment. The file should have numeric columns for traits.
  2. Enter Location Name: Provide a name for this environment (e.g., “WM_24”). This will be used in the output file name.
  3. Select Traits to Clean: Once uploaded, a checkbox list of numeric columns (traits) appears. Choose which traits to clean.
  4. Clean Each Trait: For each selected trait:
    • A histogram of the trait values is displayed.
    • Use the “Lower bound” and “Upper bound” sliders to trim outliers. The histogram updates in real-time to show only data within the bounds.
    • Click “Confirm bounds” to apply and move to the next trait.
  5. Manual NA Assignment (Optional): After cleaning all traits:
    • Optionally upload a CSV with columns “Id 1” and “trait” to set specific values to NA.
    • Or optionally enter manually in the text area (one per line: “Id 1,trait”).
    • Click “Apply NA” to update.
  6. Select Columns to Retain: Choose which columns from the cleaned data to keep (defaults include key ones like “Date/Time”, “Range”, “Row”, “Id 1”, and selected traits).
  7. Confirm and Save: Click “Confirm Columns”, then “Save Cleaned Data”. The file is saved as “[Location]_cleaned_data.csv” in your working directory.

Outputs: A cleaned CSV file per environment, with outliers set to NA and unnecessary columns removed.

Tips: Process one environment at a time. Repeat for multiple environments before moving to the Format panel.

Format

Purpose: Combine cleaned data from multiple environments, standardize formatting, calculate yield, apply nesting if needed, and filter out border/filler plots. This prepares a unified “Pheno.csv” for modeling.

Step-by-Step Guide:

  1. Enter Environment Names: Provide comma-separated names matching your cleaned files (e.g., “WM_24,ARL_24”).
  2. Processing Parameters:
    • Fixed Terms: Comma-separated terms to treat as fixed effects (default: “id_1”).
    • Random Terms: Comma-separated terms to treat as random effects (required; default: “rep,range,row”).
    • Moisture (%), Bushel Weight (lb), Plot Length (ft), Row Spacing (in), Number of Plot Rows: Values for yield calculation from combine data (defaults for corn; adjust for other crops).
    • Nest Terms: Semicolon-separated groups of terms to nest (e.g., “term1,term2;term3,term4”). This creates nested columns like “nested_rep” to distinguish reps across environments.
    • Nest Names: Comma-separated names for the nested columns.
  3. Plot Filtering: Enter comma-separated plot designations to set to NA (default: “B” for borders).
  4. File Uploads: For each environment:
    • Upload “Design File” (experimental design CSV).
    • Upload “Clean File” (from Clean panel).
  5. Format Data: Click “Format Data”. The app processes each environment using the harvest.master function, combines them, renames columns (e.g., “id_1” to “id”), filters NA plots, and orders by env, range, row.
  6. Preview and Save: A table preview shows the head of the combined data. Click “Save Formatted Data” to output “Pheno.csv”.

Outputs: “Pheno.csv” – a standardized phenotypic dataset ready for quality checks and modeling.

Tips: Ensure cleaned files match environment names. Yield is calculated as adjusted weight based on moisture and plot dimensions.

Quality Check

Purpose: Fit a first-stage model using SpATS to generate diagnostic plots, evaluate residuals per plot/environment, and compute heritability. This serves as a quality control step before full analysis.

Step-by-Step Guide:

  1. Upload Formatted Phenotypic Data: Select “Pheno.csv” from the Format panel.
  2. Model Specifications:
    • Select Traits: Choose traits to analyze (from numeric columns, excluding non-trait like “date_time”).
    • Fixed Effects: Select from available columns.
    • Random Effects: Select from available columns.
  3. Run Analysis: Click “Run Analysis”. The app uses Stage1_plots to fit models per trait, generating boxplots, QQ plots, heritability plots, and spatial residual plots (per environment if applicable).
  4. Review Outputs: Plots are displayed below, including:
    • Boxplot of raw data.
    • QQ plot of residuals.
    • Heritability estimates.
    • Spatial residual plots for each environment.
  5. If satisfied with diagnostics (e.g., normal residuals, no patterns), proceed; otherwise, revisit Clean/Format.

Outputs: PNG files of plots saved temporarily and displayed; not permanently saved, but can be recreated.

Tips: The model automatically includes environment, range, and row if present. High heritability and normal residuals indicate good data quality. Use this to catch issues early.

in silico Hybrids

Purpose: Generate in silico hybrid genotypes from parental data. This represents hybrids for prediction without additional genotyping.

Step-by-Step Guide:

  1. Enter Hybrid Set Names: Comma-separated names (e.g., “SS,NSS”) for different hybrid sets.
  2. Upload Files per Set: For each set:
    • Female Parent Genotype Data (CSV).
    • Male Parent Genotype Data (CSV).
  3. Generate Hybrids: Click “Generate Hybrids”. The app uses the insilico function to create hybrids, checks metadata consistency (marker, chrom, position), and combines if multiple sets.
  4. Save: Click “Save Hybrids” to output “insilicohybrids.csv”.

Outputs: “insilicohybrids.csv” – combined hybrid genotypes.

Tips: Ensure parental CSVs have matching metadata. For single sets, it’s a direct copy; for multiple, columns are pasted together.

Analysis

This section fits models and derives key items such as marker effects.

Model

Purpose: Fit a two-stage mixed model (using ASReml via BreedStream/StageWise functions). Optionally optimize an H-matrix (blended G and A matrices) for improved accuracy with genomic and pedigree data.

Step-by-Step Guide:

  1. Upload Data:
    • Phenotypic Data (“Pheno.csv” from Format).
    • Optional: Genotypic Data (CSV) and Pedigree Data (CSV) for H-matrix or marker effects.
  2. Model Specifications:
    • Select Traits: Multiple traits for multi-trait models.
    • Fixed Effects: From phenotypic columns.
    • Random Effects: From phenotypic columns (range/row must be specified here, unlike Quality Check).
  3. H-Matrix Optimization (if geno/ped provided):
    • Ploidy, max iterations, blending bounds (lower near 0 for minimal blending).
    • Include dominance/map, workspace sizes, minor allele threshold.
    • Optional: Fixed effect markers, covariates, mask data (CSV to exclude specific id/env/trait), method (MME/Vinv), non-additive effects, tolerance.
  4. Run Analysis: Click “Run Two-Stage Analysis”. Uses H.Matrix_Optimized to fit models and find optimal blending based on AIC.
  5. Save: Click “Save Results” to output “two_stage_results.rds”.

Outputs: “two_stage_results.rds” – fitted model object for downstream use.

Tips: For no H-matrix, omit geno/ped or set both the lower and upper blending paremeters to 1e-5. Environment is automaticly applied to the model.

Index

Purpose: Calculate coefficients for a multi-trait selection index, with optional restrictions. Skip for single-trait.

Step-by-Step Guide:

  1. Upload Two Stage Results: “.rds” from Model panel.
  2. Restricted Index: Yes/No (default Yes).
  3. Trait Inputs: For each trait in the model:
    • Coefficient (default 1; economic/merit weight).
    • If restricted: Sign (>, =, <) for constraints.
  4. Calculate: Click “Calculate Index Coefficients”. Uses gain from StageWise.
  5. Preview and Save: Table shows coefficients. Click “Save Index Coefficients” for “index_coeff.rds”.

Outputs: “index_coeff.rds” – index coefficients.

Tips: Model must be multi-trait. Use positive coefficients for desired gains.

Values

Purpose: Compute genotypic values (GV) from the fitted model, optionally using index coefficients.

Step-by-Step Guide:

  1. Upload Files:
    • Two Stage Results “.rds”.
    • Optional: Multi-trait Index Coefficients “.rds”.
  2. Calculate: Click “Calculate Genetic Values”. Uses blup with what=“GV”.
  3. Preview and Save: Table shows head of values. Click “Save Results” for “GV_values.rds” and “GV_values.csv”.

Outputs: RDS and CSV of genotypic values.

Tips: No geno/ped needed here; based on model BLUPs.

Marker Effects

Purpose: Estimate additive (and optionally dominant) marker effects from the model.

Step-by-Step Guide:

  1. Upload Files:
    • Genotype Data (CSV).
    • Optimal Weight File (TXT; from H-matrix optimization).
    • Pedigree File (CSV).
    • Two Stage Results “.rds”.
    • Optional: Multi-trait Index Coefficients “.rds”.
  2. Parameters: Ploidy, map included, minor allele threshold, include dominance.
  3. Calculate: Click “Calculate Marker Effects”. Uses read_geno and blup (AM/DM).
  4. Preview and Save: Table shows additive effects head. Click “Save All Results” for “marker_effects.rds”, “add_effects.csv”, and optionally “dom_effects.csv”.

Outputs: RDS of full results, CSVs of effects.

Tips: Requires geno/ped from Model step.

Breeding Crosses

This section uses marker effects to simulate breeding outcomes and optimize crosses.

Usefulness Criterion

Purpose: Simulate recombination between markers to calculate a usefulness criterion (expected performance of superior progeny), used as merit for Optimal Mate Allocation. Requires chromosome maps for genetic positions.

Step-by-Step Guide:

  1. Number of Chromosomes: Enter the number (e.g., 10 for maize).
  2. Upload Chromosome Maps: For each chromosome, upload a TXT file with physical and genetic positions (cM). Use resources like MaizeGDB for maize.
  3. Group Names: Comma-separated (e.g., “SS,NSS”).
  4. Upload Genotype Files per Group: CSV files for each group.
  5. Marker Effects: Upload “.rds” from Marker Effects panel.
  6. Parameters: Ploidy, map included, minor allele threshold, include dominance.
  7. Calculate: Click Calculate Marker Effects. Processes genotypes, builds relationship matrices, simulates recombination via ibdsim2, and computes UC using getUsefA from BreedStream (more efficient implementation of SimpleMating::getUsefA).
  8. Preview and Save: Tables show head of UC per group. Click “Save All Results” for “UC.rds” and per-group “_usef_add.csv”.

Outputs: RDS of full data, CSVs of usefulness criteria.

Tips: Ensure maps are in correct format (use ConvertToMap). Dominance requires corresponding effects.

Optimal Mate Allocation

Purpose: Perform Optimal Mate Allocation (OMA) using UC as merit, balancing inbreeding and gain for long-term progress in closed programs. Improves on Optimal Contribution Selection by returning specific crosses.

Step-by-Step Guide:

  1. Upload Usefulness Criterion: “.rds” from Usefulness Criterion panel.
  2. Group Names: Comma-separated, matching UC groups.
  3. dF Values: Comma-separated inbreeding rates (e.g., “0.005,0.01”).
  4. Parameters: Ploidy, dF.adapt step/max (for adaptive inbreeding control), CVXR solver (ECOS/SCS), base method (RM/current).
  5. Process: Click “Process Data”. Uses oma_optimized per group, filtering non-zero contributions.
  6. Preview and Save: Tables show non-zero matings per group. Click “Save All Results” for “OMA.rds” and per-group “_om.csv”.

Outputs: RDS of full OMA, CSVs of optimal matings.

Tips: dF controls inbreeding; smaller values preserve diversity. Use adaptive for fine-tuning.

RR-BLUP Prediction

This section generates predictions for hybrids using Ridge Regression BLUP.

Transpose

Purpose: Generate transposed in silico hybrid genotypes for efficient RR-BLUP predictions. Avoids direct transposition of large matrices by regenerating in transposed format.

Step-by-Step Guide:

  1. Enter Hybrid Set Names: Comma-separated (e.g., “SS,NSS”).
  2. Upload Files per Set: Female and Male Parent Genotype Data (CSV) for each.
  3. Generate Hybrids: Click “Generate Hybrids”. Uses insilico_transposed, combines sets (cat/tail for multiples).
  4. Save: Click “Save Hybrids” to output “insilico_transposedhybrids.csv”.

Outputs: “insilico_transposedhybrids.csv” – transposed hybrids.

Tips: Similar to in silico Hybrids but optimized for prediction computation.

Prediction

Purpose: Compute RR-BLUP predictions using transposed genotypes and marker effects (additive/dominant).

Step-by-Step Guide:

  1. Upload Files:
    • Transposed Genotype CSV (from Transpose).
    • Additive Marker Effects CSV (from Marker Effects).
    • Optional: Check “Include Dominance” and upload Dominant Marker Effects CSV.
  2. Parameters: Ploidy, batch size (for memory-efficient processing), minor allele threshold.
  3. Process: Click “Process Predictions”. Processes genotypes column-wise, computes predictions in batches, writes directly to CSV.
  4. Preview and Save: Table shows top 10 predictions (sorted descending). Automatically saves “HybridPredictions.csv”.

Outputs: “HybridPredictions.csv” – id and predicted values.

Tips: Batch size helps with large datasets. Errors logged if issues (e.g., invalid inputs).

Additional Notes

  1. Performance Tips: While this Shiny App was built to handle RAM limitations by writing to disk and merging individual files through terminal commands to avoid opening the files, it is still recommended to ensure sufficient RAM and consider running on a server or high-performance machine for large datasets.
  2. Additional Reading: This Shiny App drew heavy inspiration from;
    • Endelman, J. B. (2023). Fully efficient, two-stage analysis of multi-environment trials with directional dominance and multi-trait genomic selection. Theoretical and Applied Genetics, 136(4), 65.
    • Endelman, J. B. (2025). Genomic prediction of heterosis, inbreeding control, and mate allocation in outbred diploid and tetraploid populations. Genetics, 229(2), iyae193.
    • Peixoto, M. A., Amadeu, R. R., Bhering, L. L., Ferrão, L. F. V., Munoz, P. R., & Resende Jr, M. F. (2025). SimpleMating: R‐package for prediction and optimization of breeding crosses using genomic selection. The Plant Genome, 18(1), e20533.
  3. Citations: If using this app in research, please cite the BreedStream App GitHub Repository and underlying packages like StageWise, COMA, and SimpleMating.

About

A full end to end analysis and prediction pipeline for the University of Wisconsin Corn Breeding Program.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages