TNM is an open-source tool for mining socio-technical data from Git repositories and vizualizing it. Instead of implementing their own mining pipeline, researchers can use our tool or integrate it in their own mining pipelines.
TNM incorporates implementations of several established data mining techniques, or individual miners. Every class implements one interface or abstract class for each task type, which makes it easily extendable.
Base classes:
Mapper
- assign unique IDs to specific entities, e.g., users.Miner
- mine data from various sources for one task.GitMiner
- extend theMiner
interface. Use specifically Git repositories as the source of data.DataProcessor
- process mined data. Work as a buffer forMiner
classes.Calculation
- calculate complex dependencies in data. Use data from GitMiners.Visualization
- visualize processed data.
Git miners are classes implementing mining tasks.
All miners use a local Git repository for data extraction and extend the abstract class GitMiner
with two functions to process the commit history in chosen branches and save the results.
The current version of TNM includes the following GitMiner
implementations:
-
FilesOwnershipMiner
is based on the Degree of Knowledge (DOK) (paper). DOK quantifies the knowledge of a developer or a group of developers about a particular section of code. The miner yields a knowledge score for every developer to a file pair in the form of a nested map. It also extracts information about the authorship of code on a line level. -
CommitInfluenceGraphMiner
is based on an application of the PageRank to commits (paper). It finds bug-fixing commits by searching forfix
in commit messages. Then, using git blame, the miner finds previous commits in which the lines changed in the fix commit had been introduced. The output of the miner is a map of lists, with keys corresponding to fixing commit IDs and values corresponding to commits introducing the lines changed by the fixes. -
AssignmentMatrixMiner
yields a modification count for each developer to a file pair in the form of a nested map. The result is used to calculate socio-technical congruence (paper). -
FileDependencyMatrixMiner
processes commits to find the files that were changed in the same commit. For each pair of files, the miner yields a number of times they have been edited in the same commit. This data can be utilized to build the edges of the socio-technical software network based on Conway's law (paper) and calculation of socio-technical congruence. (paper). -
CoEditNetworksMiner
is based on git2net (github, paper). Yields a JSON file with a dict of commits information and a list of edits. Each edit includes pre/post file path, start line, length, number of chars, entropy of the changed block of code, Levenshtein distance between the previous and new block of code, type of edit. -
ComplexityCodeChangesMiner
is based on paper. Yields a JSON file with a dict of periods that includes the period's entropy and the stats of files changed in that period. Each file stat includes entropy and History Complexity Period Factors, such as HCPF2 and HCPF3. -
WorkTimeMiner
is a simple miner for mining the distribution of commits over time in the week. This data can be used, e.g., to improve work scheduling by finding intersections in the time distributions between different developers. -
UserChangedFilesMiner
is a simple miner for mining sets of changed files for each developer. It can be used, for example, to count how many times a certain file was edited by specific developers.
Some forms of data require non-trivial computations. To ensure extensibility, processing code is separated from the miners into dedicated classes.
-
CoordinationNeedsMatrixCalculation
computes the coordination needs matrix according to the algorithm of paper, using the data obtained byFileDependencyMatrixMiner
andAssignmentMatrixMiner
. The computation results are represented as a matrix C[i][j], where i, j are developer user IDs, and C[i][j] is the relative coordination need between the two individuals. -
MirrorCongruenceCalculation
computes the socio-technical congruence according to paper. Its output is a single number in the [0, 1] range with higher values corresponding to higher socio-technical congruence. -
PageRankCalculation
computes a PageRank vector according to the algorithm of Suzuki et al. paper. A PageRank vector contains importance rankings for each commit. The input data forPageRankCalculation
is the commit influence graph produced by theCommitInfluenceGraphMiner
. The output is a vector where each element represents the importance of a commit.
TNM includes a basic browser-based visualization class WeightedEdgesGraphHTML
for the output of
FileDependencyMatrixMiner
and CoordinationNeedsMatrixCalculation
.
- Run
./gradlew :cli:shadowJar
- Now you can use shell script to use cli
./run.sh
The script should be executed as:
./run.sh commandName options arguments
When run without arguments, run.sh
shows all available commands. Also, you can call ./run.sh commandName -h
to get information about necessary options and arguments.
Example of script usage:
./run.sh AssignmentMatrixMiner --repository ./local_repository/.git main
Modify build.gradle.kts
repositories {
maven {
url = uri("https://packages.jetbrains.team/maven/p/ictl-public/public-maven")
}
}
dependencies {
implementation("org.jetbrains.research.ictl:tnm:0.4.16")
}
val localGitPath = "./your_repository_dir/.git"
val repository = FileRepository(localGitPath)
val numThreads = 4
val branches = setOf("main", "dev")
val dataProcessor = WorkTimeDataProcessor()
val miner = WorkTimeMiner(repository, branches, numThreads = numThreads)
miner.run(dataProcessor)
val resultFile = File("./path_where_to_store_results")
val idToUserFile = File("./path_where_to_store_idToUser")
HelpFunctionsUtil.saveToJson(
resultFile,
dataProcessor.workTimeDistribution
)
HelpFunctionsUtil.saveToJson(
idToUserFile,
dataProcessor.idToUser
)
val repository = FileRepository(repositoryDirectory)
val numThreads = 4
val branches = setOf("main", "dev")
val dataProcessor = CommitInfluenceGraphDataProcessor()
val miner = CommitInfluenceGraphMiner(repository, branches, numThreads = numThreads)
miner.run(dataProcessor)
val resultFile = File("./path_where_to_store_results")
val idToCommitFile = File("./path_where_to_store_idToUser")
HelpFunctionsUtil.saveToJson(
resultFile,
dataProcessor.adjacencyMap
)
HelpFunctionsUtil.saveToJson(
idToCommitFile,
dataProcessor.idToCommit
)
Miners, calculation and mapper classes use the JSON output format. JSON is easy to read; objects (such as hash maps and arrays) serialized in the JSON format can be deserialized in other programming languages. Visualization classes generate an interactive HTML graph which can be viewed in any modern web browser and shared without worrying about the dependencies. The graph can also be edited manually to adjust its appearance if required.
// Mark processing data with marker interface InputData
data class UserName(val email: String) : InputData
// Extend data processor
class MyDataProcessor : DataProcessorMapped<UserName>() {
// Using Java Concurrent package for storing results
private val _result = ConcurrentSkipListSet<Int>()
// Backing field for immutable public field
val result : Set<Int>
get() = _result
override fun processData(data: UserName) {
val userId = userMapper.add(data.email)
_result.add(userId)
}
override fun calculate() {
println("Calculation called!")
}
}
// Extend GitMiner and override function [process]
class MyGitMiner(
repository: File,
neededBranches: Set<String>,
numThreads: Int = ProjectConfig.DEFAULT_NUM_THREADS
) : GitMiner<MyDataProcessor>(repository, neededBranches, numThreads = numThreads) {
override fun process(dataProcessor: MyDataProcessor, commit: RevCommit) {
val data = UserName(commit.authorIdent.emailAddress)
dataProcessor.processData(data)
}
}
fun main() {
val repository = File("./.git")
val branches = setOf("main")
val numThreads = 4
val miner = MyGitMiner(repository, branches, numThreads)
val dataProcessor = MyDataProcessor()
miner.run(dataProcessor)
println(dataProcessor.result)
}
You can create your own miner class by extending the Miner
interface with the necessary DataProcessor
as a generic
type. Then all you need to do is iteratively transmit data to DataProcessor
in method run(dataProcessor: T)
Yes, you can! Use the example above and when you extend GitMiner
, set the parameter numThreads
to 1. Also, you don't
need to use the Java Concurrent package for storing your results.
class MyGitMiner(
repository: FileRepository,
neededBranches: Set<String>
) : GitMiner<MyDataProcessor>(repository, neededBranches, numThreads = 1) {
// ...
}