JSON files with structural features of 1.6 million commits in 622 Java repositories on GitHub. Those features are extracted from the corresponding AST differences using Coming.
All the commits for a particular repository are stored in a single JSON file, one object per line.
That JSON file is stored in the directory corresponding to that repository.
For example, the features of the commits in ReactiveX/RxJava are stored in the directory ReactiveX/RxJava.
There is also stats.json in each directory with feature extraction statistics.
Finally, everything is xz-compressed. The uncompressed size is 49GB.
Each JSON object corresponds to a commit. For every modified file in that particular commit, it stores an array of edits which would produce the destination AST if applied to the source AST. Every edit contains information about the type of a change (INS, DEL, MOV, UPD), the changed entity (the types of entities are listed below), the list of parent and children nodes in the AST and the location in the file. Depending on the type of the change, some fields may be missing. For example, if the type of the change is DEL, the field location_dst — which corresponds to the location in the new version of the file — will not be present. Here is a sample pretty-printed JSON object:
{
"id": "hash of the commit",
"files": [{
"file_name": "name of the modified file",
"features": [{
"label": "label of the modified element, i.e. java.util.Map$Entry#getKey()",
"type": "type of the modified element, i.e. Invocation",
"op": "short name of the edit action, i.e. INS, DEL, MOV, UPD",
"children": "Json representation of the AST subtree corresponding to this element",
"location_src": [
"number of starting line",
"number of ending line",
"number of starting character",
"number of ending character"
],
"location_dst": "same as for location_src but w.r.t. the file after the changes",
"parents_src": {
"parent_ids": "array of ids of parent nodes in the source AST; could be
used for matching the changes, i.e. some element may have
been deleted from a subtree which was moved; it's ordered
from the immediate parent up to the root",
"parent_names": "array of names of parent nodes; same order as for parent_ids"
},
"parents_dst": "same as for parents_src but w.r.t. the AST after the changes",
"upd_to_tree": "present only in the case of UPD action. This field contains
a Json representation of the resulting AST subtree which
correspond to the element updated"
}]
}]
}Besides, there is a full example.json.
INS: Insertion of a node or a subtree in the ASTDEL: Deletion of a node or a subtree in the ASTMOV: Move of a node or a subtree within the ASTUPD: Update of a node or a subtree in the AST
Complete list of entity types.
Annotation
AnnotationFieldAccess
ArrayAccess
ArrayRead
ArrayWrite
Assert
Assignment
BinaryOperator
Block
Case
Catch
CatchVariableImpl
CFlowBreak
CodeSnippetExpression
Comment
Conditional
Constructor
ConstructorCall
Do
Enum
EnumValue
Field
FieldAccess
FieldRead
FieldWrite
For
ForEach
If
Import
Interface
Invocation
JavaDoag
LabelledFlowBreak
Lambda
Literal
LocalVariable
Method
NewArray
NewClass
OperatorAssignment
Parameter
Return
SuperAccess
Synchronized
TargetedExpression
ThisAccess
Throw
Try
TryWithResource
Type
TypeAccess
TypeMember
UnaryOperator
VariableRead
VariableWrite
While
There is a plot which shows the distribution of different types of changes. Note that the x-axis has a log scale. The type of a change is a composition of the type of the edit action and the type of the entity. For example, INS/VariableRead means that a variable access was inserted into the code.
Jupyter Notebook with code to produce this plot could be accessed here.
The included repositories have more than 500 stars and more than 1000 commits. We considered only the default branch. We forked and modified Coming. Internally, this tool uses GumTreeDiff to compute the set of AST edits. Be aware that this algorithm is not perfect and in some cases may produce a few non-intuitive edits.