Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
example.json	example.json
op_types.html	op_types.html
plot_statistics.ipynb	plot_statistics.ipynb
stats_total.json	stats_total.json

Structural features extracted from commits

Download link.

JSON files with structural features of 1.6 million commits in 622 Java repositories on GitHub. Those features are extracted from the corresponding AST differences using Coming.

Format

All the commits for a particular repository are stored in a single JSON file, one object per line. That JSON file is stored in the directory corresponding to that repository. For example, the features of the commits in ReactiveX/RxJava are stored in the directory ReactiveX/RxJava. There is also stats.json in each directory with feature extraction statistics. Finally, everything is xz-compressed. The uncompressed size is 49GB.

Format of the JSON objects

Each JSON object corresponds to a commit. For every modified file in that particular commit, it stores an array of edits which would produce the destination AST if applied to the source AST. Every edit contains information about the type of a change (INS, DEL, MOV, UPD), the changed entity (the types of entities are listed below), the list of parent and children nodes in the AST and the location in the file. Depending on the type of the change, some fields may be missing. For example, if the type of the change is DEL, the field location_dst — which corresponds to the location in the new version of the file — will not be present. Here is a sample pretty-printed JSON object:

{
    "id": "hash of the commit",
    "files": [{
        "file_name": "name of the modified file",
        "features": [{
            "label": "label of the modified element, i.e. java.util.Map$Entry#getKey()",
            "type": "type of the modified element, i.e. Invocation",
            "op": "short name of the edit action, i.e. INS, DEL, MOV, UPD",
            "children": "Json representation of the AST subtree corresponding to this element",
            "location_src": [
                "number of starting line",
                "number of ending line",
                "number of starting character",
                "number of ending character"
            ],
            "location_dst": "same as for location_src but w.r.t. the file after the changes",
            "parents_src": {
                "parent_ids": "array of ids of parent nodes in the source AST; could be
                               used for matching the changes, i.e. some element may have
                               been deleted from a subtree which was moved; it's ordered
                               from the immediate parent up to the root",
                "parent_names": "array of names of parent nodes; same order as for parent_ids"
            },
            "parents_dst": "same as for parents_src but w.r.t. the AST after the changes",
            "upd_to_tree": "present only in the case of UPD action. This field contains
                            a Json representation of the resulting AST subtree which
                            correspond to the element updated"
        }]
    }]
}

Besides, there is a full example.json.

Types of edit actions

INS: Insertion of a node or a subtree in the AST
DEL: Deletion of a node or a subtree in the AST
MOV: Move of a node or a subtree within the AST
UPD: Update of a node or a subtree in the AST

Types of entities

Complete list of entity types.

```
Annotation
```
```
AnnotationFieldAccess
```
```
ArrayAccess
```
```
ArrayRead
```
```
ArrayWrite
```
```
Assert
```
```
Assignment
```
```
BinaryOperator
```
```
Block
```
```
Case
```
```
Catch
```
```
CatchVariableImpl
```
```
CFlowBreak
```
```
CodeSnippetExpression
```
```
Comment
```
```
Conditional
```
```
Constructor
```
```
ConstructorCall
```
```
Do
```
```
Enum
```
```
EnumValue
```
```
Field
```
```
FieldAccess
```
```
FieldRead
```
```
FieldWrite
```
```
For
```
```
ForEach
```
```
If
```
```
Import
```
```
Interface
```
```
Invocation
```
```
JavaDoag
```
```
LabelledFlowBreak
```
```
Lambda
```
```
Literal
```
```
LocalVariable
```
```
Method
```
```
NewArray
```
```
NewClass
```
```
OperatorAssignment
```
```
Parameter
```
```
Return
```
```
SuperAccess
```
```
Synchronized
```
```
TargetedExpression
```
```
ThisAccess
```
```
Throw
```
```
Try
```
```
TryWithResource
```
```
Type
```
```
TypeAccess
```
```
TypeMember
```
```
UnaryOperator
```
```
VariableRead
```
```
VariableWrite
```
```
While
```

There is a plot which shows the distribution of different types of changes. Note that the x-axis has a log scale. The type of a change is a composition of the type of the edit action and the type of the entity. For example, INS/VariableRead means that a variable access was inserted into the code. Jupyter Notebook with code to produce this plot could be accessed here.

Origins

The included repositories have more than 500 stars and more than 1000 commits. We considered only the default branch. We forked and modified Coming. Internally, this tool uses GumTreeDiff to compute the set of AST edits. Be aware that this algorithm is not perfect and in some cases may produce a few non-intuitive edits.

License

Open Data Commons Open Database License (ODbL)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Structural features extracted from commits

Format

Format of the JSON objects

Types of edit actions

Types of entities

Origins

License

FilesExpand file tree

StructuralCommitFeatures

Directory actions

More options

Directory actions

More options

Latest commit

History

StructuralCommitFeatures

Folders and files

parent directory

README.md

Structural features extracted from commits

Format

Format of the JSON objects

Types of edit actions

Types of entities

Origins

License