clonedigger package¶
Submodules¶
clonedigger.abstract_syntax_tree module¶
abstract_syntax_tree module
Todo
What do free_variable_cost represent ?
Todo
Move free_variable_cost to FreeVariable.free_variable_cost
-
class
clonedigger.abstract_syntax_tree.
AbstractSyntaxTree
(name=None, line_numbers=[], source_file=None)[source]¶ Bases:
object
Tree structure representing code.
From (Bulychev et al., 2008): II. A. Strictly speaking, the abstract syntax trees we use are not always trees, since leaves containing the same variable references may be merged, […].
A statement (stmt) is a node in the tree that contains an particular tree of code. It is defined by the Language, see simple statements and compound statements. A (really) non exhaustive example (made by printing a tree):
- a class is a statement
- an assignement using only builtin operation is a statement
- __init__ function is a statement
- global is a statement
But:
- a += is not a statement
- return is not
- calling a function is not
- print is not
- defining a function is not
Todo
Why is += not a stmt, as it is a simple stmt according to the reference. Same for function definition.
Todo
self._line_numbers is only used to initialize _covered_line_numbers in self.propagateCoveredLineNumbers. Remove the attribute and initialize _covered_line_numbers ?
Parameters: - _name (str) – Name of the node
- _source_file (SourceFile) – SourceFile containing this tree
- _line_numbers (List[int]) – Index of lines covered by the tree.
- _covered_line_numbers (List[int]) – Line covered by the tree and subtrees
- _is_statement (bool) – This tree is a statement
- _hash (int) – Hash of the tree
- _mark (Cluster) – Used for clustering
- _parent (AbstractSyntaxTree) – Parent node
- _childs (List[AbstractSyntaxTree]) – List of child nodes
- _height (int) – Depth of the tree
- _size (float) – Number of node + free variable costs
- _none_count (int) – Number of None node in subtrees
-
addChild
(child, save_parent=False)[source]¶ Add a child and set its parent to self if save_parent
Parameters: - child (AbstractSyntaxTree) – Child to add
- save_parent (bool, optional) – Set child’s parent to self, defaults to False
-
getAllStatementSequences
()[source]¶ Return sequences of statement that cover at least arguments.size_threshold lines.
Recursively search for statements, a new sequence is created when the lines covered by the sequence exceeds arguments.size_threshold.
Todo
Are the sequence in a specific order and what do they represent ?
Todo
Is it possible that a subtree s of t and t are in the same sequence ?
Returns: todo: Return type: {List[StatementSequence]}
-
getAncestors
()[source]¶ Return ancestors which are statements.
Used only in StatementSequence.getAncestors.
Returns: Ancestors that are statements Return type: {List[AbstractSyntaxTree]}
-
getDCupHash
(level)[source]¶ Compute a hash using child nodes
The hash is computed by summing per node hashes. A node hash is made of the depth of the node, the name and the number of child.
Parameters: level (int) – maximum depth to compute hash Returns: a tree hash Return type: {int}
-
getHeight
()[source]¶ Return height for this tree
The height is the maximum depth of the tree
Returns: Height of tree Return type: {int}
-
getSize
(ignore_none=True)[source]¶ Return number of nodes and free variables cost
Parameters: ignore_none (bool, optional) – Do not account for None nodes, defaults to True Returns: size of tree Return type: {float}
-
getSourceLines
()[source]¶ Return source lines covered by the tree
Returns: A list of lines Return type: {List[str]}
-
getTokenCount
()[source]¶ Count certain tokens in tree
Tokens are listed below
Returns: Number of tokens Return type: {int} Todo
What is this ? t.getName()[0] != “’” and t.getName() != ‘Pass’
-
propagateCoveredLineNumbers
()[source]¶ Compute self._covered_line_numbers for self and childrens
Returns: A set of line numbers Return type: {Set[int]}
-
class
clonedigger.abstract_syntax_tree.
PairSequences
(sequences)[source]¶ Bases:
object
-
class
clonedigger.abstract_syntax_tree.
SourceFile
(file_name)[source]¶ Bases:
object
Abstract class that create AST from a code file.
Read a code file, parse it and create a corresponding AST.
Parameters: - _source_lines (List[str]) – Original lines of the source file.
- _file_name (str) – Name of the source file.
- _tree (AbstractSyntaxTree) – AST representing the source file.
-
distance_threshold
= 5¶ Maximum edit distance of a clone.
-
getSourceLine
(n)[source]¶ Return n-th line of the file
Parameters: n (int) – Line number Returns: n-th line of self._file_name Return type: {str}
-
size_threshold
= 5¶ Minimum number of covered lines of a clone.
-
class
clonedigger.abstract_syntax_tree.
StatementSequence
(sequence=[])[source]¶ Bases:
object
Holds a sequence of statements (AST)
Used in suffix_tree.SuffixTree to find patterns in code.
Parameters: - _sequence (List[AbstractSyntaxTree]) – A sequence of Statement.
- _source_file (SourceFile) – Source file Statements are from.
-
constructTree
()[source]¶ Create a tree where childrens are the element of the sequence.
Used in PairSequences to compute distance between two sequences.
Returns: A tree containing every element of the sequence. Return type: {AbstractSyntaxTree}
-
getAncestors
()[source]¶ Return ancestors of first element
Used only in clone_detection_algorithm.remove_dominated_clones. Depends on AST.getAllStatementSequences.
Todo
Why self[0], and not a set of all ancestors of every statement ?
-
getLineNumberHashables
()[source]¶ Return covered line numbers as (source_file, line_number)
Return tuples so aggregating line numbers from different files will not overwrite line numbers from different files.
Todo
rename function to getCoveredLineNumbersSourceFile ? or merge with getCoveredLineNumbers by adding a parameter ?
Returns: List of covered line numbers Return type: {List[Tuple[str, int]]}
-
clonedigger.abstract_syntax_tree.
filter_func
(s)[source]¶ Remove trailing whitespace if begins with non space characters
Todo
Equivalent of re.sub(r’[^s]+)s*’, r’’, s) ?
Parameters: s (str) – String to be filtered Returns: Trimmed string Return type: {str} >>> filter_func('def function(): \n') 'def function():' >>> filter_func('\n \n') '\n \n'
clonedigger.anti_unification module¶
anti_unification module
-
class
clonedigger.anti_unification.
Cluster
(tree=None)[source]¶ Bases:
object
Create a cluster consisting of AbstractSyntaxTree
TODO: how it is used
Parameters: tree (AbstractSyntaxTree, optional) – First element to add, defaults to None -
addWithoutUnification
(tree)[source]¶ Add tree to cluster without updating unifier_tree
Parameters: tree (AbstractSyntaxTree) – Tree
-
count
= 0¶
-
getAddCost
(tree)[source]¶ Compute the cost of adding a tree to the cluster.
Parameters: tree (AbstractSyntaxTree) – tree
-
unify
(tree)[source]¶ Add tree to cluster by unifying cluster’s unifier with tree.
Parameters: tree (AbstractSyntaxTree) – Tree
-
-
class
clonedigger.anti_unification.
FreeVariable
[source]¶ Bases:
clonedigger.abstract_syntax_tree.AbstractSyntaxTree
AST node used as placeholder for anti-unifiyng.
-
addChild
(child, save_parent=False)¶ Add a child and set its parent to self if save_parent
Parameters: - child (AbstractSyntaxTree) – Child to add
- save_parent (bool, optional) – Set child’s parent to self, defaults to False
-
free_variables_count
= 1¶ Count instanciation of this object
-
getAllStatementSequences
()¶ Return sequences of statement that cover at least arguments.size_threshold lines.
Recursively search for statements, a new sequence is created when the lines covered by the sequence exceeds arguments.size_threshold.
Todo
Are the sequence in a specific order and what do they represent ?
Todo
Is it possible that a subtree s of t and t are in the same sequence ?
Returns: todo: Return type: {List[StatementSequence]}
-
getAncestors
()¶ Return ancestors which are statements.
Used only in StatementSequence.getAncestors.
Returns: Ancestors that are statements Return type: {List[AbstractSyntaxTree]}
-
getChildCount
()¶
-
getChilds
()¶
-
getCoveredLineNumbers
()¶
-
getDCupHash
(level)¶ Compute a hash using child nodes
The hash is computed by summing per node hashes. A node hash is made of the depth of the node, the name and the number of child.
Parameters: level (int) – maximum depth to compute hash Returns: a tree hash Return type: {int}
-
getFullHash
()¶ Compute a hash for the whole tree
Returns: a tree hash Return type: {int}
-
getHeight
()¶ Return height for this tree
The height is the maximum depth of the tree
Returns: Height of tree Return type: {int}
-
getLineNumbers
()¶
-
getMark
()¶
-
getName
()¶
-
getParent
()¶
-
getSize
(ignore_none=True)¶ Return number of nodes and free variables cost
Parameters: ignore_none (bool, optional) – Do not account for None nodes, defaults to True Returns: size of tree Return type: {float}
-
getSourceFile
()¶
-
getSourceLines
()¶ Return source lines covered by the tree
Returns: A list of lines Return type: {List[str]}
-
getTokenCount
()¶ Count certain tokens in tree
Tokens are listed below
Returns: Number of tokens Return type: {int} Todo
What is this ? t.getName()[0] != “’” and t.getName() != ‘Pass’
-
isStatement
()¶
-
markAsStatement
(val=True)¶
-
propagateCoveredLineNumbers
()¶ Compute self._covered_line_numbers for self and childrens
Returns: A set of line numbers Return type: {Set[int]}
-
propagateHeight
()¶ Compute height for this tree
The height is the maximum depth of the tree
Returns: Height of self Return type: {int}
-
setMark
(mark)¶
-
setName
(name)¶
-
setParent
(parent)¶
-
storeSize
()¶ Compute the number of nodes and free variables cost recursively
The number of nodes and free_variable_cost, also compute the number of None node
Returns: Size of tree Return type: {float}
-
-
class
clonedigger.anti_unification.
Substitution
(initial_value=None)[source]¶ Bases:
object
A Dict[AbstractSyntaxTree, AbstractSyntaxTree] that replace recursively a tree by another.
TODO: This could extend dict
-
getSize
()[source]¶ Compute size of a substitution.
It is the sum of the size of the trees in substitution.
Returns: Size of substitution. Return type: {float}
-
substitute
(tree, without_copying=False)[source]¶ Recursively replace trees by substitution.
Parameters: - tree (AbstractSyntaxTree) – Tree to be substituted.
- without_copying (bool, optional) – Return tree pointers, defaults to False
Returns: Substituated tree
Return type: {AbstractSyntaxTree}
-
-
class
clonedigger.anti_unification.
Unifier
(t1, t2, ignore_parametrization=False)[source]¶ Bases:
object
An anti-unifier is a tree that generalize other trees (2) using placeholder.
Example
Tree 1: Add (Name (i), Name (j)) Tree 2: Add (Name (n), Const (1)) Anti-unifier: Add (Name (?1 ), ?2 )
- Used in:
- abstract_syntax_tree.PairSequences: used to compute distance between 2 trees
- anti_unification.Cluster: used to add trees to cluster
- report.Report: to know if there is a substitution between two code fragment and display them red or cyan
From (Bulychev et al., 2008) II. A.: As the name suggests, given two terms, it produces a more general one that covers both rather than a more specific one as in unification. Let E1 and E2 be two terms. Term E is a generalization of E1 and E2 if there exist two substitutions σ1 and σ2 such that σ1(E) = E1 and σ2(E) = E2. The most specific generalization of E1 and E2 is called anti-unifier. The process of finding an anti-unifier is called anti-unification. […] The anti-unifier tree of two trees T1 and T2 is obtained by replacing some subtrees in T1 and T2 by special nodes, containing term placeholders which are marked with integers. We will represent such nodes as ?n. For example, the anti-unifier of Add (Name (i), Name (j)) and Add (Name (n), Const (1)) will be Add (Name (?1 ), ?2 ). In some abstract syntax tree representations occurrences of the same variable refer to the same leaf in a tree. In this case the anti-unifier of Add(Name(i),Name(i)) and Add(Name(j),Name(j)) will be Add(Name(?1),Name(?1)).
Parameters: - t1 (AbstractSyntaxTree) – Tree 1
- t2 (AbstractSyntaxTree) – Tree 2
- ignore_parametrization (bool, optional) – todo:
-
_combineSubs
(node, s, t, ignore_parametrization=False)[source]¶ Aggregate substitutions.
Given an anti-unifier and two substitutions, modify node by replacing trees if the substittion are the same. If ignore_parametrization is True, this function morally performs t.update(s). Else: some keys are not updated, but they will modify the node.
Parameters: - node (AbstractSyntaxTree) – anti-unifier
- s (Tuple[Substitution,Substitution]]) – A substitution to use as update
- t (Tuple[Substitution,Substitution]]) – A substitution to update
- ignore_parametrization (bool, optional) – TODO:, defaults to False
Returns: an updated unifier
Return type: {Tuple[AbstractSyntaxTree, Tuple[Substitution,Substitution]]}
-
_unify
(node1, node2, ignore_parametrization)[source]¶ Create anti-unifier for node1 and node2.
Recursively create an anti-unifier using substitutions.
Parameters: - node1 (AbstractSyntaxTree) – Tree to be anti-unified
- node2 (AbstractSyntaxTree) – Tree to be anti-unified
- ignore_parametrization (bool) – todo:
Returns: An anti-unifier and the substitutions performed.
Return type: {Tuple[AbstractSyntaxTree, Tuple[Substitution,Substitution]]}
clonedigger.antlr_sourcefile module¶
antlk_sourcefile module
-
class
clonedigger.antlr_sourcefile.
ANTLRSourceFile
(file_name)[source]¶ Bases:
clonedigger.abstract_syntax_tree.SourceFile
-
_setTree
(tree)¶
-
distance_threshold
= 5¶
-
getFileName
()¶
-
getSourceLine
(n)¶ Return n-th line of the file
Parameters: n (int) – Line number Returns: n-th line of self._file_name Return type: {str}
-
getTree
()¶
-
size_threshold
= 5¶
-
-
class
clonedigger.antlr_sourcefile.
JavaANTLRSourceFile
(file_name)[source]¶ Bases:
clonedigger.antlr_sourcefile.ANTLRSourceFile
-
_setTree
(tree)¶
-
distance_threshold
= 7¶
-
extension
= 'java'¶
-
getFileName
()¶
-
getSourceLine
(n)¶ Return n-th line of the file
Parameters: n (int) – Line number Returns: n-th line of self._file_name Return type: {str}
-
getTree
()¶
-
parse
(file_name)¶
-
size_threshold
= 10¶
-
-
class
clonedigger.antlr_sourcefile.
JsANTLRSourceFile
(file_name)[source]¶ Bases:
clonedigger.antlr_sourcefile.ANTLRSourceFile
-
_setTree
(tree)¶
-
distance_threshold
= 5¶
-
extension
= 'js'¶
-
getFileName
()¶
-
getSourceLine
(n)¶ Return n-th line of the file
Parameters: n (int) – Line number Returns: n-th line of self._file_name Return type: {str}
-
getTree
()¶
-
parse
(file_name)¶
-
size_threshold
= 5¶
-
-
class
clonedigger.antlr_sourcefile.
LuaANTLRSourceFile
(file_name)[source]¶ Bases:
clonedigger.antlr_sourcefile.ANTLRSourceFile
-
_setTree
(tree)¶
-
distance_threshold
= 5¶
-
extension
= 'lua'¶
-
getFileName
()¶
-
getSourceLine
(n)¶ Return n-th line of the file
Parameters: n (int) – Line number Returns: n-th line of self._file_name Return type: {str}
-
getTree
()¶
-
parse
(file_name)¶
-
size_threshold
= 5¶
-
clonedigger.arguments module¶
arguments module
clonedigger.ast_suppliers module¶
ast_suppliers module
clonedigger.clone_detection_algorithm module¶
clone_detection_algorithm module
-
clonedigger.clone_detection_algorithm.
all_pairsubsequences_size_n_threshold
(n, pair_sequences)[source]¶
-
clonedigger.clone_detection_algorithm.
build_hash_to_statement
(statement_sequences, dcup_hash=True)[source]¶ Compute hash for every statement
Two statement can have the same hash
Parameters: - statement_sequences (List[StatementSequence]) – Statements to compute hash
- dcup_hash (bool, optional) – Use dcup hash, defaults to True
Returns: A map Hash -> List[Statement]
Return type: {Dict[int -> List[Statement]]}
-
clonedigger.clone_detection_algorithm.
build_unifiers
(hash_to_statement)[source]¶ Populate Cluster object with Statement.
Greedily add the cheapest statement to Clusters by unifying, thus creating Unifiers for each cluster. This clusters on hash, but finer.
Parameters: hash_to_statement (Dict[int, List[AbstractSyntaxTree]]) – Statements grouped by hash Returns: A list of unified Cluster Return type: {Dict[int, Cluster]}
-
clonedigger.clone_detection_algorithm.
clusterize
(hash_to_statement, clusters_map)[source]¶ Add statements to Cluster’s unifier
Greedily add statements to cluster based on distance to Cluster’s unifier
Parameters: - hash_to_statement (Dict[int, List[Statement]]) – Statements grouped by hash
- clusters_map (Dict[int, List[Cluster]]) – Clusters grouped by hash
-
clonedigger.clone_detection_algorithm.
filterOutLongEquallyLabeledSequences
(statement_sequences)[source]¶
-
clonedigger.clone_detection_algorithm.
filterOutLongSequences
(statement_sequences, max_length)[source]¶
-
clonedigger.clone_detection_algorithm.
findHugeSequences
(statement_sequences)[source]¶ Return candidate clones which cover at least arguments.size_threshold lines
Use a suffixTree to find candidate clones.
Parameters: statement_sequences (List[PairSequences]) – Candidate StatetementSequences
clonedigger.clonedigger module¶
clonedigger module
clonedigger.python_compiler module¶
python_compiler module
-
class
clonedigger.python_compiler.
PythonCompilerSourceFile
(file_name, func_prefixes=())[source]¶ Bases:
clonedigger.abstract_syntax_tree.SourceFile
-
_setTree
(tree)¶
-
distance_threshold
= 5¶
-
extension
= 'py'¶
-
getFileName
()¶
-
getSourceLine
(n)¶ Return n-th line of the file
Parameters: n (int) – Line number Returns: n-th line of self._file_name Return type: {str}
-
getTree
()¶
-
ignored_statements
= ['Import', 'From']¶
-
rec_build_tree
(node, is_statement=False)[source]¶ Build an AST from a compiler.ast.Node
Parameters: - node (Union[None, compiler.ast.Node]) – Compiler node to build the AST from.
- is_statement (bool, optional) – Direct childs of a ‘Stmt’ node are statements, defaults to False
Returns: An AST representing the file.
Return type: {AbstractSyntaxTree}
-
size_threshold
= 5¶
-
clonedigger.reports module¶
reports module
-
class
clonedigger.reports.
CPDXMLReport
[source]¶ Bases:
clonedigger.reports.Report
-
addClone
(clone)¶
-
addErrorInformation
(error_info)¶
-
addFileName
(file_name)¶
-
getTimerValues
()¶
-
getTotalTime
()¶
-
sortByCloneSize
()¶
-
startTimer
(descr)¶
-
stopTimer
(descr='')¶
-
-
class
clonedigger.reports.
HTMLReport
[source]¶ Bases:
clonedigger.reports.Report
-
addClone
(clone)¶
-
addErrorInformation
(error_info)¶
-
addFileName
(file_name)¶
-
getTimerValues
()¶
-
getTotalTime
()¶
-
sortByCloneSize
()¶
-
startTimer
(descr)¶
-
stopTimer
(descr='')¶
-
very_strange_const
= 'VERY_STRANGE_CONST'¶
-
clonedigger.suffix_tree module¶
suffix_tree module
-
class
clonedigger.suffix_tree.
SuffixTree
(f_code=None)[source]¶ Bases:
object
Data structure that holds suffixes of iterables
Exemple: t = SuffixTree() t.add(‘banana’)
(r (a (na (na)),banana,na (na)))t.add(‘ananas’)
(r (a (s,na (s,na (s))),banana,na (s,na (s)),s))Parameters: - _node (SuffixTreeNode) – Root node of the suffix tree
- _f_code (Function[E, K], optional) – Function acting as key to add elements in SuffixTree, defaults to identity
-
class
StringPosition
(string, position, prevelem)[source]¶ Bases:
object
Holds a position in a string
:todo what is it morally used for
Parameters: - string (Iterable[E]) – Original string of the suffix
- position (int) – Beginning position of the suffix in the original string
- prevelem (Union[K, None]) – Is this the first element of the string
-
class
SuffixTreeNode
[source]¶ Bases:
object
A node of a suffix tree
Parameters: - childs (Dict[K -> E]) – Child nodes
- string_positions (List[StringPosition]) – Information about the strings that uses this node
- ending_strings (List[StringPosition]) – Information about the strings that end in this node
-
_add
(string, prevelem)[source]¶ Add a suffix to the tree
[description]
Parameters: - string (Iterable[E]) – Suffix to add to tree
- prevelem (K) – Key of previous element
-
add
(string)[source]¶ Add all suffixes of string in the tree
[description]
Parameters: string (Iterable[E]) – String to add
-
getBestMaxSubstrings
(threshold, f=None, f_elem=None, node=None, initial_threshold=None)[source]¶ [summary]
[description]
Parameters: - threshold (int) – Used to know when to start adding candidate
- f (Function[K -> int], optional) – Used to lower the threshold when visiting a children, defaults to None
- f_elem (Function[List[E] -> int], optional) – Used to validate candidate according to initial_threshold, defaults to None
- node (SuffixTreeNode, optional) – Node to use as root, defaults to None
- initial_threshold ([type], optional) – Leave empty, keep original threshold in recursive calls, defaults to None
Returns: List of candidate clones
Return type: {List[Tuple[List[E], List[E]]]}