Skip to content

sudo97/charlie

Repository files navigation

Charlie 🕵️

This is a tool for analyzing and visualizing your git history. Based on ideas from "Your Code as a Crime Scene" by Adam Tornhill. Highly recommended reading.

Table of Contents

Motivation

In "Your Code as a Crime Scene", Adam Tornhill presents powerful techniques for mining insights from version control systems to identify problematic code patterns, architectural issues, and team dynamics. The book demonstrates these concepts using Code Maat, a command-line tool that extracts and analyzes VCS data. While Code Maat is excellent for research and deep analysis, it requires exporting git logs to files and often involves additional Python scripts to generate visualizations from CSV outputs.

CodeScene, the commercial evolution of these ideas, provides beautiful visualizations and automated analysis through GitHub integration. While it's a powerful tool, some developers prefer a completely local solution without any external service dependencies.

Charlie bridges this gap by providing a tool similar to bundle-analyzer or dependency-cruiser - you can run it locally with a single command and immediately see visual results in your browser. No file exports, no online services, just instant insights into your codebase's behavioral patterns.

Building this tool (with a little help from my friends) has been the best way to truly understand the concepts from the book. As they say, you don't really know something until you can build it yourself.

Installation and usage

Install from npm

This will make the charlie command available globally. You may also omit -g if you want to install it locally.

npm install -g charlie-git

Install from source

After cloning the repository:

npm install
npm run build
npm pack
npm i -g

Usage

After installing, you can run the tool with:

charlie

Don't forget to cd into your project directory before running the tool.

Alternatively, you can run the tool with:

charile /path/to/your/project

After running the tool, in the root of your project you should see a file called charlie-report.html. Open it in your browser to see the report.

Core Concepts

Hotspots

A hotspot is a file or module that is both frequently modified AND has high complexity. These represent the most problematic areas of your codebase - they change often (indicating active development or bug fixes) and are complex (making them risky to modify). Hotspots should be your top priority for refactoring. The circle size represents the complexity of the file being changed. Colors represent the frequency of the file being changed (from gray i.e. low frequency, to blue-ish, to red-ish i.e. high frequency).

Coupling Analysis

The Coupling view combines two powerful metrics to help you identify architectural problems and find clusters of tightly related files:

Sum of Coupling (SOC)

SOC is a metric calculated per file that counts how many times the file appears in commits with other files (i.e., it's not alone in the commit). Every time a file is committed alongside other files, we assume it might be coupled with them. A high SOC score indicates a file that's frequently involved in multi-file changes, which could signal architectural problems.

Coupled Pairs Integration

Each file in the coupling view can be expanded to reveal its coupled pairs - files that frequently appear together in the same commits. When two files are consistently modified together, it suggests they're more tightly coupled than your architecture might indicate. High coupling can lead to ripple effects where changes in one file require changes in another.

Finding Clusters

By expanding high-SOC files, you can identify clusters of tightly coupled files that might benefit from:

  • Being moved into the same module or package
  • Being refactored to reduce dependencies
  • Being split if they're doing too many things

When a file is both a hotspot AND has high SOC with many coupled pairs, it becomes a critical refactoring priority. The expandable view helps you understand not just that coupling exists, but exactly which files are involved in the coupling relationships.

The Power of Data Over Time

These metrics might sound overly simplistic at first glance, but when you collect data over months or a full year, powerful patterns emerge. Individual commits might seem random, but aggregate behavior reveals the true structure and pain points of your codebase. Data is king - it shows you what's actually happening, not what you think is happening.

Complexity Calculation

Charlie calculates complexity using a simple but effective approach: it adds 1 to the complexity score for each line of code in the file, and adds another point whenever a line has more leading whitespace than the previous line (indicating nested code blocks). This method is language-agnostic and works well for identifying complex areas across different codebases. As long as the formatting is consistent, this approach will work.

While cyclomatic complexity might be more academically accurate, this nested-based approach is sufficient for the behavioral analysis goals of this tool. For individual file analysis, I still recommend measuring cyclomatic complexity, but for understanding large-scale patterns and trends, this simpler metric serves us well.

.charlie.config.json

The .charlie.config.json file allows you to customize Charlie's analysis behavior. This file should be placed in the root of your repository (the same directory where you run the charlie command). Additional analysis options like coupling thresholds and percentile filters are available through the interactive frontend.

Configuration Fields

include (optional)

Type: string[] (array of regex patterns)
Default: [] (includes all files)

An array of regular expression patterns to specify which files should be included in the analysis. If this field is empty or not provided, all files are included by default.

{
  "include": ["^src/", "^lib/", "\\.ts$", "\\.js$"]
}

exclude (optional)

Type: string[] (array of regex patterns)
Default: [] (excludes no files)

An array of regular expression patterns to specify which files should be excluded from the analysis. These patterns are applied after the include patterns.

{
  "exclude": ["node_modules/", "\\.test\\.", "\\.spec\\.", "dist/", "build/"]
}

after (optional)

Type: string (ISO date format)
Default: One year ago from the current date

Specifies the earliest date for git commits to include in the analysis. Only commits made after this date will be considered.

{
  "after": "2023-01-01T00:00:00.000Z"
}

architecturalGroups (optional)

Type: Record<string, string> (regex pattern → group name mapping)
Default: undefined (no grouping)

Allows you to group files into architectural components for analysis. The key is a regex pattern that matches file paths, and the value is the name of the architectural group. Files matching the same group will be consolidated into single entries. Only the first group that matches a file is used.

When architecturalGroups is specified, Charlie generates both file-level and grouped visualizations in the report:

  1. File-level Hotspots - Shows individual files as separate hotspots
  2. Grouped Hotspots - Shows architectural groups as consolidated hotspots
  3. Coupling Analysis - Shows both file-level and group-level coupling relationships with expandable details

This allows you to see both the detailed file-level view and the higher-level architectural view simultaneously.

{
  "architecturalGroups": {
    "^src/components/": "UI Components",
    "^src/services/": "Business Logic",
    "^src/utils/": "Utilities",
    "^src/hooks/": "React Hooks"
  }
}

Complete Example

Here's a comprehensive example of a .charlie.config.json file:

{
  "include": ["^src/", "^lib/"],
  "exclude": [
    "node_modules/",
    "\\.test\\.",
    "\\.spec\\.",
    "dist/",
    "build/",
    "__tests__/",
    "\\.d\\.ts$"
  ],
  "after": "2023-06-01T00:00:00.000Z",
  "architecturalGroups": {
    "^src/components/": "UI Layer",
    "^src/services/": "Service Layer",
    "^src/store/": "State Management",
    "^src/utils/": "Utilities",
    "^src/types/": "Type Definitions"
  }
}

How It Works

  1. File Filtering: Charlie first applies the include patterns (if any), then applies the exclude patterns to filter which files are analyzed.

  2. Date Filtering: Git commits are filtered to only include those made after the specified after date.

  3. Architectural Grouping: If architecturalGroups is specified, files matching the regex patterns are grouped together and their complexity/revision metrics are combined. Both the original file-level hotspots and the grouped architectural hotspots are displayed in separate visualizations.

  4. Interactive Analysis: Additional filtering options for SOC analysis, coupled pairs, and other metrics are available through the interactive frontend, allowing you to adjust thresholds and percentiles dynamically without regenerating the analysis.

This configuration system allows you to focus your analysis on specific parts of your codebase and organize the results in a way that makes sense for your project's architecture.

Thoughts

On Architectural Grouping

I personally haven't yet found an easy and useful case for architectural grouping. Usually when the codebase is messy, it's very hard to group things properly, but these types of codebases are the ones you usually need to analyze with tools like Charlie. And the codebases where things are easy to group, well... things are usually obvious enough without needing to group them.

This creates an interesting paradox: the feature works best on codebases that need it least, and struggles most on codebases that would benefit from it the most. That said, your mileage may vary - if you have a reasonably well-organized codebase with clear architectural boundaries that just needs some fine-tuning, architectural grouping might provide valuable insights. Or, perhaps, you have a good codebase, but the number of files is so large that it's hard to see the forest for the trees.

Credits

Special thanks to Aleksandra Kozlova and Darya Losich for their contributions and support in making this project possible. Also I'm thankful to Adam Tornhill for his book and for the inspiration. And special thanks for my wife, Olga, for making impressed faces when I show her the visualizations.

TODO:

  • Calculate the complexity of each file based on the number of lines and the number of peaks
  • Retrieve number of revisions for each file
  • Calculate the hotspots based on the complexity and the number of revisions
  • Produce D3 visualization of the hotspots inside an html page
  • Pick prettier colours for the visualization
  • Move colours into theme.ts file
  • Add a diagram (and calculation) for the coupling pairs
  • Add a diagram (and calculation) for the SOC
  • Add .charlie.config.json file support. It should support a list of files that should be excluded from the analysis, and a list of files that should be included, and a list of files that should be grouped into "architectural components".
  • Add a way to group files into "architectural components"
  • Coupled pairs and SOC should show only significant data. This filtering should be implemented in the backend using revisionsPercentile and minCouplingPercentage configuration options.
  • Rewrite frontend in react or similar, add dynamic filtering and grouping. Consider rendering svg directly instead of d3, consider using react-spring for animations.
    • Rewrite hotspots
    • Rewrite coupled pairs
      • Rewrite coupled pairs table
      • Add filtering on the frontend
      • Rewrite coupled pairs diagram
    • Add some form of routing
    • Rewrite SOC
    • Rewrite word count
    • Integrate coupling analysis: combine SOC and coupled pairs into unified expandable view
  • Default date should be 1 year before the last commit in the repo
  • Architectural groups should produce 1. hotpsots 2. coupled pairs
  • Alternatively, maybe .charlie.config.json should be a starting point, but then in the webpage the user could change the config to see different slices of data.
  • Reading file for complexity should be done in chunks
  • Coupled pairs are sorted in the table by normalized coupling percentage times the number of revisions, but filtered by two thresholds: minCouplingPercentage and revisionsPercentile. Instead, we should use the sorting that we have, and filter by single percentile threshold.
  • Add a way to find file/module owners
  • Add a way to find teams that happen to form by analyzing authors
  • Add a way to show fractal diagrams for the files or modules based on their ownership
  • Add a way to show the "social graph" of the team based on the git history
  • Add a way to calculate truck-factor of a project (i.e. how many people can leave before 50% of the code is left without a knowledgeable maintainer)

About

Behavioural Analysis Tool

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors