clonedigger package

Submodules

clonedigger.abstract_syntax_tree module

abstract_syntax_tree module

Todo

What do free_variable_cost represent ?

Todo

Move free_variable_cost to FreeVariable.free_variable_cost

class clonedigger.abstract_syntax_tree.AbstractSyntaxTree(name=None, line_numbers=[], source_file=None)[source]

Bases: object

Tree structure representing code.

From (Bulychev et al., 2008): II. A. Strictly speaking, the abstract syntax trees we use are not always trees, since leaves containing the same variable references may be merged, […].

A statement (stmt) is a node in the tree that contains an particular tree of code. It is defined by the Language, see simple statements and compound statements. A (really) non exhaustive example (made by printing a tree):

  • a class is a statement
  • an assignement using only builtin operation is a statement
  • __init__ function is a statement
  • global is a statement

But:

  • a += is not a statement
  • return is not
  • calling a function is not
  • print is not
  • defining a function is not

Todo

Why is += not a stmt, as it is a simple stmt according to the reference. Same for function definition.

Todo

self._line_numbers is only used to initialize _covered_line_numbers in self.propagateCoveredLineNumbers. Remove the attribute and initialize _covered_line_numbers ?

Parameters:
  • _name (str) – Name of the node
  • _source_file (SourceFile) – SourceFile containing this tree
  • _line_numbers (List[int]) – Index of lines covered by the tree.
  • _covered_line_numbers (List[int]) – Line covered by the tree and subtrees
  • _is_statement (bool) – This tree is a statement
  • _hash (int) – Hash of the tree
  • _mark (Cluster) – Used for clustering
  • _parent (AbstractSyntaxTree) – Parent node
  • _childs (List[AbstractSyntaxTree]) – List of child nodes
  • _height (int) – Depth of the tree
  • _size (float) – Number of node + free variable costs
  • _none_count (int) – Number of None node in subtrees
addChild(child, save_parent=False)[source]

Add a child and set its parent to self if save_parent

Parameters:
  • child (AbstractSyntaxTree) – Child to add
  • save_parent (bool, optional) – Set child’s parent to self, defaults to False
getAllStatementSequences()[source]

Return sequences of statement that cover at least arguments.size_threshold lines.

Recursively search for statements, a new sequence is created when the lines covered by the sequence exceeds arguments.size_threshold.

Todo

Are the sequence in a specific order and what do they represent ?

Todo

Is it possible that a subtree s of t and t are in the same sequence ?

Returns:todo:
Return type:{List[StatementSequence]}
getAncestors()[source]

Return ancestors which are statements.

Used only in StatementSequence.getAncestors.

Returns:Ancestors that are statements
Return type:{List[AbstractSyntaxTree]}
getChildCount()[source]
getChilds()[source]
getCoveredLineNumbers()[source]
getDCupHash(level)[source]

Compute a hash using child nodes

The hash is computed by summing per node hashes. A node hash is made of the depth of the node, the name and the number of child.

Parameters:level (int) – maximum depth to compute hash
Returns:a tree hash
Return type:{int}
getFullHash()[source]

Compute a hash for the whole tree

Returns:a tree hash
Return type:{int}
getHeight()[source]

Return height for this tree

The height is the maximum depth of the tree

Returns:Height of tree
Return type:{int}
getLineNumbers()[source]
getMark()[source]
getName()[source]
getParent()[source]
getSize(ignore_none=True)[source]

Return number of nodes and free variables cost

Parameters:ignore_none (bool, optional) – Do not account for None nodes, defaults to True
Returns:size of tree
Return type:{float}
getSourceFile()[source]
getSourceLines()[source]

Return source lines covered by the tree

Returns:A list of lines
Return type:{List[str]}
getTokenCount()[source]

Count certain tokens in tree

Tokens are listed below

Returns:Number of tokens
Return type:{int}

Todo

What is this ? t.getName()[0] != “’” and t.getName() != ‘Pass’

isStatement()[source]
markAsStatement(val=True)[source]
propagateCoveredLineNumbers()[source]

Compute self._covered_line_numbers for self and childrens

Returns:A set of line numbers
Return type:{Set[int]}
propagateHeight()[source]

Compute height for this tree

The height is the maximum depth of the tree

Returns:Height of self
Return type:{int}
setMark(mark)[source]
setName(name)[source]
setParent(parent)[source]
storeSize()[source]

Compute the number of nodes and free variables cost recursively

The number of nodes and free_variable_cost, also compute the number of None node

Returns:Size of tree
Return type:{float}
class clonedigger.abstract_syntax_tree.PairSequences(sequences)[source]

Bases: object

calcDistance()[source]
getLength()[source]

Return length of first sequence

Todo

Is len(self[0]) and len(self[1]) always equal ?

getMaxCoveredLineNumbersCount()[source]
getWeight()[source]

Unused

Todo

unused

subSequence(first, length)[source]
class clonedigger.abstract_syntax_tree.SourceFile(file_name)[source]

Bases: object

Abstract class that create AST from a code file.

Read a code file, parse it and create a corresponding AST.

Parameters:
  • _source_lines (List[str]) – Original lines of the source file.
  • _file_name (str) – Name of the source file.
  • _tree (AbstractSyntaxTree) – AST representing the source file.
_setTree(tree)[source]
distance_threshold = 5

Maximum edit distance of a clone.

getFileName()[source]
getSourceLine(n)[source]

Return n-th line of the file

Parameters:n (int) – Line number
Returns:n-th line of self._file_name
Return type:{str}
getTree()[source]
size_threshold = 5

Minimum number of covered lines of a clone.

class clonedigger.abstract_syntax_tree.StatementSequence(sequence=[])[source]

Bases: object

Holds a sequence of statements (AST)

Used in suffix_tree.SuffixTree to find patterns in code.

Parameters:
addStatement(statement)[source]
constructTree()[source]

Create a tree where childrens are the element of the sequence.

Used in PairSequences to compute distance between two sequences.

Returns:A tree containing every element of the sequence.
Return type:{AbstractSyntaxTree}
getAncestors()[source]

Return ancestors of first element

Used only in clone_detection_algorithm.remove_dominated_clones. Depends on AST.getAllStatementSequences.

Todo

Why self[0], and not a set of all ancestors of every statement ?

getCoveredLineNumbers()[source]
getCoveredLineNumbersCount()[source]
getLength()[source]
getLineNumberHashables()[source]

Return covered line numbers as (source_file, line_number)

Return tuples so aggregating line numbers from different files will not overwrite line numbers from different files.

Todo

rename function to getCoveredLineNumbersSourceFile ? or merge with getCoveredLineNumbers by adding a parameter ?

Returns:List of covered line numbers
Return type:{List[Tuple[str, int]]}
getLineNumbers()[source]
getSourceFile()[source]
getSourceLines()[source]
getWeight()[source]

Unused

Todo

AST.getCluster does not exist

Todo

unused

isEmpty()[source]
clonedigger.abstract_syntax_tree.filter_func(s)[source]

Remove trailing whitespace if begins with non space characters

Todo

Equivalent of re.sub(r’[^s]+)s*’, r’’, s) ?

Parameters:s (str) – String to be filtered
Returns:Trimmed string
Return type:{str}
>>> filter_func('def function(): \n')
'def function():'
>>> filter_func('\n \n')
'\n \n'

clonedigger.anti_unification module

anti_unification module

class clonedigger.anti_unification.Cluster(tree=None)[source]

Bases: object

Create a cluster consisting of AbstractSyntaxTree

TODO: how it is used

Parameters:tree (AbstractSyntaxTree, optional) – First element to add, defaults to None
addWithoutUnification(tree)[source]

Add tree to cluster without updating unifier_tree

Parameters:tree (AbstractSyntaxTree) – Tree
count = 0
eraseAllTrees()[source]
getAddCost(tree)[source]

Compute the cost of adding a tree to the cluster.

Parameters:tree (AbstractSyntaxTree) – tree
getCount()[source]
getMaxCoveredLines()[source]
getUnifierSize()[source]
getUnifierTree()[source]
unify(tree)[source]

Add tree to cluster by unifying cluster’s unifier with tree.

Parameters:tree (AbstractSyntaxTree) – Tree
class clonedigger.anti_unification.FreeVariable[source]

Bases: clonedigger.abstract_syntax_tree.AbstractSyntaxTree

AST node used as placeholder for anti-unifiyng.

addChild(child, save_parent=False)

Add a child and set its parent to self if save_parent

Parameters:
  • child (AbstractSyntaxTree) – Child to add
  • save_parent (bool, optional) – Set child’s parent to self, defaults to False
free_variables_count = 1

Count instanciation of this object

getAllStatementSequences()

Return sequences of statement that cover at least arguments.size_threshold lines.

Recursively search for statements, a new sequence is created when the lines covered by the sequence exceeds arguments.size_threshold.

Todo

Are the sequence in a specific order and what do they represent ?

Todo

Is it possible that a subtree s of t and t are in the same sequence ?

Returns:todo:
Return type:{List[StatementSequence]}
getAncestors()

Return ancestors which are statements.

Used only in StatementSequence.getAncestors.

Returns:Ancestors that are statements
Return type:{List[AbstractSyntaxTree]}
getChildCount()
getChilds()
getCoveredLineNumbers()
getDCupHash(level)

Compute a hash using child nodes

The hash is computed by summing per node hashes. A node hash is made of the depth of the node, the name and the number of child.

Parameters:level (int) – maximum depth to compute hash
Returns:a tree hash
Return type:{int}
getFullHash()

Compute a hash for the whole tree

Returns:a tree hash
Return type:{int}
getHeight()

Return height for this tree

The height is the maximum depth of the tree

Returns:Height of tree
Return type:{int}
getLineNumbers()
getMark()
getName()
getParent()
getSize(ignore_none=True)

Return number of nodes and free variables cost

Parameters:ignore_none (bool, optional) – Do not account for None nodes, defaults to True
Returns:size of tree
Return type:{float}
getSourceFile()
getSourceLines()

Return source lines covered by the tree

Returns:A list of lines
Return type:{List[str]}
getTokenCount()

Count certain tokens in tree

Tokens are listed below

Returns:Number of tokens
Return type:{int}

Todo

What is this ? t.getName()[0] != “’” and t.getName() != ‘Pass’

isStatement()
markAsStatement(val=True)
propagateCoveredLineNumbers()

Compute self._covered_line_numbers for self and childrens

Returns:A set of line numbers
Return type:{Set[int]}
propagateHeight()

Compute height for this tree

The height is the maximum depth of the tree

Returns:Height of self
Return type:{int}
setMark(mark)
setName(name)
setParent(parent)
storeSize()

Compute the number of nodes and free variables cost recursively

The number of nodes and free_variable_cost, also compute the number of None node

Returns:Size of tree
Return type:{float}
class clonedigger.anti_unification.Substitution(initial_value=None)[source]

Bases: object

A Dict[AbstractSyntaxTree, AbstractSyntaxTree] that replace recursively a tree by another.

TODO: This could extend dict

getMap()[source]
getSize()[source]

Compute size of a substitution.

It is the sum of the size of the trees in substitution.

Returns:Size of substitution.
Return type:{float}
substitute(tree, without_copying=False)[source]

Recursively replace trees by substitution.

Parameters:
  • tree (AbstractSyntaxTree) – Tree to be substituted.
  • without_copying (bool, optional) – Return tree pointers, defaults to False
Returns:

Substituated tree

Return type:

{AbstractSyntaxTree}

class clonedigger.anti_unification.Unifier(t1, t2, ignore_parametrization=False)[source]

Bases: object

An anti-unifier is a tree that generalize other trees (2) using placeholder.

Example

Tree 1: Add (Name (i), Name (j)) Tree 2: Add (Name (n), Const (1)) Anti-unifier: Add (Name (?1 ), ?2 )

Used in:
  • abstract_syntax_tree.PairSequences: used to compute distance between 2 trees
  • anti_unification.Cluster: used to add trees to cluster
  • report.Report: to know if there is a substitution between two code fragment and display them red or cyan

From (Bulychev et al., 2008) II. A.: As the name suggests, given two terms, it produces a more general one that covers both rather than a more specific one as in unification. Let E1 and E2 be two terms. Term E is a generalization of E1 and E2 if there exist two substitutions σ1 and σ2 such that σ1(E) = E1 and σ2(E) = E2. The most specific generalization of E1 and E2 is called anti-unifier. The process of finding an anti-unifier is called anti-unification. […] The anti-unifier tree of two trees T1 and T2 is obtained by replacing some subtrees in T1 and T2 by special nodes, containing term placeholders which are marked with integers. We will represent such nodes as ?n. For example, the anti-unifier of Add (Name (i), Name (j)) and Add (Name (n), Const (1)) will be Add (Name (?1 ), ?2 ). In some abstract syntax tree representations occurrences of the same variable refer to the same leaf in a tree. In this case the anti-unifier of Add(Name(i),Name(i)) and Add(Name(j),Name(j)) will be Add(Name(?1),Name(?1)).

Parameters:
_combineSubs(node, s, t, ignore_parametrization=False)[source]

Aggregate substitutions.

Given an anti-unifier and two substitutions, modify node by replacing trees if the substittion are the same. If ignore_parametrization is True, this function morally performs t.update(s). Else: some keys are not updated, but they will modify the node.

Parameters:
Returns:

an updated unifier

Return type:

{Tuple[AbstractSyntaxTree, Tuple[Substitution,Substitution]]}

_unify(node1, node2, ignore_parametrization)[source]

Create anti-unifier for node1 and node2.

Recursively create an anti-unifier using substitutions.

Parameters:
Returns:

An anti-unifier and the substitutions performed.

Return type:

{Tuple[AbstractSyntaxTree, Tuple[Substitution,Substitution]]}

getSize()[source]
getSubstitutions()[source]
getUnifier()[source]

clonedigger.antlr_sourcefile module

antlk_sourcefile module

class clonedigger.antlr_sourcefile.ANTLRSourceFile(file_name)[source]

Bases: clonedigger.abstract_syntax_tree.SourceFile

_setTree(tree)
distance_threshold = 5
getFileName()
getSourceLine(n)

Return n-th line of the file

Parameters:n (int) – Line number
Returns:n-th line of self._file_name
Return type:{str}
getTree()
parse(file_name)[source]
size_threshold = 5
class clonedigger.antlr_sourcefile.ExpatHandler(start_node, parent)[source]

Bases: object

end_element(name)[source]
start_element(xml_node_name, attrs)[source]
class clonedigger.antlr_sourcefile.JavaANTLRSourceFile(file_name)[source]

Bases: clonedigger.antlr_sourcefile.ANTLRSourceFile

_setTree(tree)
distance_threshold = 7
extension = 'java'
getFileName()
getSourceLine(n)

Return n-th line of the file

Parameters:n (int) – Line number
Returns:n-th line of self._file_name
Return type:{str}
getTree()
parse(file_name)
size_threshold = 10
class clonedigger.antlr_sourcefile.JsANTLRSourceFile(file_name)[source]

Bases: clonedigger.antlr_sourcefile.ANTLRSourceFile

_setTree(tree)
distance_threshold = 5
extension = 'js'
getFileName()
getSourceLine(n)

Return n-th line of the file

Parameters:n (int) – Line number
Returns:n-th line of self._file_name
Return type:{str}
getTree()
parse(file_name)
size_threshold = 5
class clonedigger.antlr_sourcefile.LuaANTLRSourceFile(file_name)[source]

Bases: clonedigger.antlr_sourcefile.ANTLRSourceFile

_setTree(tree)
distance_threshold = 5
extension = 'lua'
getFileName()
getSourceLine(n)

Return n-th line of the file

Parameters:n (int) – Line number
Returns:n-th line of self._file_name
Return type:{str}
getTree()
parse(file_name)
size_threshold = 5

clonedigger.arguments module

arguments module

clonedigger.ast_suppliers module

ast_suppliers module

clonedigger.clone_detection_algorithm module

clone_detection_algorithm module

clonedigger.clone_detection_algorithm.all_pairsubsequences_size_n_threshold(n, pair_sequences)[source]
clonedigger.clone_detection_algorithm.build_hash_to_statement(statement_sequences, dcup_hash=True)[source]

Compute hash for every statement

Two statement can have the same hash

Parameters:
  • statement_sequences (List[StatementSequence]) – Statements to compute hash
  • dcup_hash (bool, optional) – Use dcup hash, defaults to True
Returns:

A map Hash -> List[Statement]

Return type:

{Dict[int -> List[Statement]]}

clonedigger.clone_detection_algorithm.build_unifiers(hash_to_statement)[source]

Populate Cluster object with Statement.

Greedily add the cheapest statement to Clusters by unifying, thus creating Unifiers for each cluster. This clusters on hash, but finer.

Parameters:hash_to_statement (Dict[int, List[AbstractSyntaxTree]]) – Statements grouped by hash
Returns:A list of unified Cluster
Return type:{Dict[int, Cluster]}
clonedigger.clone_detection_algorithm.clusterize(hash_to_statement, clusters_map)[source]

Add statements to Cluster’s unifier

Greedily add statements to cluster based on distance to Cluster’s unifier

Parameters:
  • hash_to_statement (Dict[int, List[Statement]]) – Statements grouped by hash
  • clusters_map (Dict[int, List[Cluster]]) – Clusters grouped by hash
clonedigger.clone_detection_algorithm.filterOutLongEquallyLabeledSequences(statement_sequences)[source]
clonedigger.clone_detection_algorithm.filterOutLongSequences(statement_sequences, max_length)[source]
clonedigger.clone_detection_algorithm.findDuplicateCode(source_files, report)[source]
clonedigger.clone_detection_algorithm.findHugeSequences(statement_sequences)[source]

Return candidate clones which cover at least arguments.size_threshold lines

Use a suffixTree to find candidate clones.

Parameters:statement_sequences (List[PairSequences]) – Candidate StatetementSequences
clonedigger.clone_detection_algorithm.print_statistics(sequences_lengths, statement_count)[source]
clonedigger.clone_detection_algorithm.refineDuplicates(pairs_sequences)[source]
clonedigger.clone_detection_algorithm.remove_dominated_clones(clones)[source]

[summary]

[description]

Parameters:clones (List[PairSequence]) – [description]
Returns:[description]
Return type:{List[PairSequence]}

clonedigger.clonedigger module

clonedigger module

clonedigger.clonedigger.cli_arguments()[source]
clonedigger.clonedigger.main()[source]
clonedigger.clonedigger.parse_file(file_name, func_prefixes, report, lang, supplier)[source]

clonedigger.python_compiler module

python_compiler module

class clonedigger.python_compiler.PythonCompilerSourceFile(file_name, func_prefixes=())[source]

Bases: clonedigger.abstract_syntax_tree.SourceFile

_setTree(tree)
distance_threshold = 5
extension = 'py'
getFileName()
getSourceLine(n)

Return n-th line of the file

Parameters:n (int) – Line number
Returns:n-th line of self._file_name
Return type:{str}
getTree()
ignored_statements = ['Import', 'From']
rec_build_tree(node, is_statement=False)[source]

Build an AST from a compiler.ast.Node

Parameters:
  • node (Union[None, compiler.ast.Node]) – Compiler node to build the AST from.
  • is_statement (bool, optional) – Direct childs of a ‘Stmt’ node are statements, defaults to False
Returns:

An AST representing the file.

Return type:

{AbstractSyntaxTree}

size_threshold = 5
class clonedigger.python_compiler.PythonNodeLeaf(val)[source]

Store variable identifiers or numbers

Parameters:_val (str) – Variable, number identifiers
as_string()[source]
getVal()[source]
clonedigger.python_compiler.add_childs(childs, r, is_statement, source_file)[source]
clonedigger.python_compiler.add_leaf_child(child, name, r)[source]
clonedigger.python_compiler.add_leaf_childs(childs, name, r)[source]
clonedigger.python_compiler.add_leaf_string_childs(childs, r)[source]
clonedigger.python_compiler.flatten(lst)[source]

Recursively flatten a list

Parameters:lst (iterable) – An iterable which can contain other iterables
Returns:A flattened list without iterables
Return type:{List}

clonedigger.reports module

reports module

class clonedigger.reports.CPDXMLReport[source]

Bases: clonedigger.reports.Report

addClone(clone)
addErrorInformation(error_info)
addFileName(file_name)
getTimerValues()
getTotalTime()
setMarkToStatementHash(mark_to_statement_hash)[source]
sortByCloneSize()
startTimer(descr)
stopTimer(descr='')
writeReport(file_name)[source]
class clonedigger.reports.HTMLReport[source]

Bases: clonedigger.reports.Report

addClone(clone)
addErrorInformation(error_info)
addFileName(file_name)
getTimerValues()
getTotalTime()
setMarkToStatementHash(mark_to_statement_hash)[source]
sortByCloneSize()
startTimer(descr)
stopTimer(descr='')
very_strange_const = 'VERY_STRANGE_CONST'
writeReport(file_name)[source]
class clonedigger.reports.NewAsString(s)[source]

Bases: object

class clonedigger.reports.Report[source]

Bases: object

addClone(clone)[source]
addErrorInformation(error_info)[source]
addFileName(file_name)[source]
getTimerValues()[source]
getTotalTime()[source]
sortByCloneSize()[source]
startTimer(descr)[source]
stopTimer(descr='')[source]
clonedigger.reports.diff_highlight(seqs)[source]
clonedigger.reports.format_line_code(s)[source]
clonedigger.reports.highlight(s)[source]
clonedigger.reports.rec_correct_as_string(t1, t2, s1, s2)[source]
clonedigger.reports.set_as_string_node_parent(t)[source]
clonedigger.reports.use_diff(statements, indentations, source_lines)[source]

clonedigger.suffix_tree module

suffix_tree module

class clonedigger.suffix_tree.SuffixTree(f_code=None)[source]

Bases: object

Data structure that holds suffixes of iterables

Exemple: t = SuffixTree() t.add(‘banana’)

(r (a (na (na)),
banana,
na (na)))

t.add(‘ananas’)

(r (a (s,
na (s,
na (s))),
banana,
na (s,
na (s)),
s))
Parameters:
  • _node (SuffixTreeNode) – Root node of the suffix tree
  • _f_code (Function[E, K], optional) – Function acting as key to add elements in SuffixTree, defaults to identity
class StringPosition(string, position, prevelem)[source]

Bases: object

Holds a position in a string

:todo what is it morally used for

Parameters:
  • string (Iterable[E]) – Original string of the suffix
  • position (int) – Beginning position of the suffix in the original string
  • prevelem (Union[K, None]) – Is this the first element of the string
class SuffixTreeNode[source]

Bases: object

A node of a suffix tree

Parameters:
  • childs (Dict[K -> E]) – Child nodes
  • string_positions (List[StringPosition]) – Information about the strings that uses this node
  • ending_strings (List[StringPosition]) – Information about the strings that end in this node
_add(string, prevelem)[source]

Add a suffix to the tree

[description]

Parameters:
  • string (Iterable[E]) – Suffix to add to tree
  • prevelem (K) – Key of previous element
add(string)[source]

Add all suffixes of string in the tree

[description]

Parameters:string (Iterable[E]) – String to add
getBestMaxSubstrings(threshold, f=None, f_elem=None, node=None, initial_threshold=None)[source]

[summary]

[description]

Parameters:
  • threshold (int) – Used to know when to start adding candidate
  • f (Function[K -> int], optional) – Used to lower the threshold when visiting a children, defaults to None
  • f_elem (Function[List[E] -> int], optional) – Used to validate candidate according to initial_threshold, defaults to None
  • node (SuffixTreeNode, optional) – Node to use as root, defaults to None
  • initial_threshold ([type], optional) – Leave empty, keep original threshold in recursive calls, defaults to None
Returns:

List of candidate clones

Return type:

{List[Tuple[List[E], List[E]]]}

Module contents