What is unicity?

I wrote this library to test the code in the hundreds of Python files submitted by my programming class. It also can be used to spot code submissions that are suspiciously similar.

To get started, first install unicity using pip.

pip install unicity

Then download the examples from https://github.com/ddempsey/unicity. These will show you how to load a project, run a simple test, perform comparisons, and a few other tricks. Or if you prefer a tutorial, see here.

Some disclaimers:

  • I have discovered that students learning to code are very innovative in the ways in which they will break your work flow. Therefore, I have built in catches for directory changes and infinite loops. But there will be other things I have not anticipated. You may catch and handle these in your test function, but I’d also like to hear if you think something should be added to make the testing more robust.
  • The similarity checking should be treated as a screening tool to highlight potential instances of copying. It does not (it cannot) assert that copying has actually occurred, and the best way to do that is by interview.

Python classes

Each submitter is called a client. Each client submits a portfolio of Python code. The set of all portfolios is called a project. The set of all clients is called a cohort.

A project is a collection of Python files submitted by several different clients. They should be contained in a zip archive or folder. Each file should follow a particular naming convention

{clientname}_*{expected_file}*.{ext}

where {expected_file.ext} is passed as an argument to the Project constructor.

Passing a path to a cohort file will give the Project additional data to tag each portfolio and report missing files.

Test parts of the submission using the test method. This will require you to define a test function, that in turn calls functions or classes from a client’s portfolio.

Check similarity metrics across the entire project with the compare method. You can compare particular methods, exclude similarity deriving from a common template, and compare with past projects.

Projects

class unicity.Project(project, expecting, ignore=['*'], cohort=None, root=None)

Class for all client portfolios.

project : str
Path to folder or zip archive containing client portfolios.
expecting : list (optional)
List of expected file names in each portfolio.
ignore : list (optional)
List of file names to ignore (unix identifiers fine).
cohort : str (optional)
Path to file containing information about clients.
root : str (optional)
String for naming of output files.
client : dict
Dictionary of Client objects, keyed by client name.
clientlist : list
List of Client objects, ordered alphabetically by client name.
portfolio_status : dict
Contains completeness information about individual client portfolios.
test_status : dict
Contains testing information about individual client portfolios.
findstr
Look for string instances in client portfolios.
dump_docstrings
Write docstring information to file.
compare
Compute similarity metrics between client portfolios.
similarity_report
Produce a plot of similarity across the project.
summarise
Output a summary of the project.
test
Batch testing of the code.

Cohort file should contain client information in CSV format. The first row defines column names and must at least contain a ‘name’ column. Client information is given on separate rows.

Each column is provisioned as a separate attribute for each client as discovered. If the ‘email’ attribute is present, this will be reported in output.

compare(routine, metric='command_freq', ncpus=1, template=None, prior_project=None, prior_routine=None)

Compares pairs of portfolios for similarity between implemented Python routine.

routine : str
Callable sequence in unicity format for comparison at three levels: {file}, {file}/{function} or {file}/{class}.{method}.
metric : str, callable (optional)
Distance metric for determining similarity. See below for more information on the different metrics available. Users can pass their own metric functions but these must adhere to the specification below. (default ‘command_freq’)
ncpus : int (optional)
Number of CPUs to use when running comparisons. (default 1, single thread)
template : str (optional)
File path to Python template file for referencing the comparison.
prior_project : unicity.Project (optional)
Project object for different cohort. Check current cohort for similarities with previous. Comparisons will not be made between clients of the previous project.
prior_routine : str (optional)
Callable sequence in unicity format for comparison from prior_project. If not given, will default to routine.
c : unicity.Comparison
Object containing comparison information.

A template file is specified when Python files are likely to exhibit similarity due to portfolios developed from a common template. Each portfolio is compared to the template for similarity, and this is ‘subtracted’ from any similarity between a pair of portfolios.

Exercise caution when drawing conclusions of similarity between short code snippets as this increases the likelihood of false positives.

command_freq (default) : Compares the frequency of distinct commands in two scripts.
Commands counted include all callables (function/methods), control statements (if/elif/else), looping statements (for/while/continue/break), boolean operators (and/or/not) and try/except.

jaro : Compares the order of the statements above for matches and transpositions.

User can pass their own callable that computes a distance metric between two files. The specification for this callable is

file1 : File
Python File object for client 1.
file2 : File
Python File object for client 2.
template : File
Python File object for template file.
name1 : str
Routine name in first file for comparison. If not given, the comparison is assumed to operate on entire file.
name2 : str
Routine name in second file for comparison. If not given, the comparison is assumed to operate on entire file.
dist : float
Float between 0 and 1 indicating degree of similarity (0 = highly similar, 1 = highly dissimilar).

User metrics that raise exceptions are caught and passed, with the error message written to unicity_compare_errors.txt for debugging.

See example/similarity_check.py at https://https://github.com/ddempsey/unicity for more details and examples.

dump_docstrings(routine, save)

Writes docstring information out to file.

routine : str
Callable sequence in unicity format {file}/{function}, or {file}/{class}.{method}.
save : str
File path to write docstrings.
findstr(string, location=None, verbose=False, save=None)

Check portfolio for instances of string.

string : str
String to locate in portfolios.
location : str (optional)
Search limited to particular file, class or routine, specified in unicity format file, file/function, or file/class.*method*.
verbose : bool (optional)
Flag to print discovered strings to screen (default False).
save : str (optional)
File to save results of string search.
found_clients : list
Client objects whose portfolios contain the search string.
similarity_report(comparison, client=None, save=None)

Creates a summary of similarity metrics.

comparison : unicity.Comparison
Comparison object produced by Project.compare.
client : str (optional)
Name of client to highlight or ‘anon’ for anonymised report.
save : str (optional)
Name of output file (default to {clientname}_{function}.png).
summarise(save, verbose=False)

Create log file with portfolio information.

save : str
File path to save output summary.
verbose : bool (optional)
Print output to screen as well.
test(routine, ncpus=1, language='python', client=None, timeout=None, **kwargs)

Runs test suite for function name.

routine : str
Name of unit test function (not a function handle).
ncpus : int (optional)
Number of CPUs to use when running comparisons. (default 1, single thread)
client : str (optional)
Run test for individual client, writing their code out to file. (default is to run test for all clients.)
timeout : float (optional)
Time after which test should be exited. Only use timeout if you think there is the possiblity of infinite loops. Using a timeout will impact performance. Timeout only implemented for serial. (default is no timeout)
language : python
Type of test to run. Base class supports ‘python’ only. Additional tests can be added by subclassing and defining new method _test_{language}.
**kwargs
Passed to language specific method _test_{language}

Testing proceeds by running a unit test function defined within the script from which the test method is called.

The unit test should contain at least one import statement to the generically named client file (the file name passed to ‘expecting’ when constructing the Project.)

The unit test should raise an error if the client’s code is in error. For instance an assert statement can be used to check a result is returned correctly.

See example/batch_testing.py at https://https://github.com/ddempsey/unicity for more details and examples.

Clients

class unicity.Client(name, cohort=None)

Class for individual client.

name : str
Name associated with client.
files : dict
Dictionary of File objects, keyed by file name, associated with the client’s portfolio.
missing : list
Expected files missing from the client’s portfolio.

When created by a Project object with an associated cohort file, Client objects are provisioned attributes corresponding to column information in the cohort file.

Comparisons

class unicity.Comparison(loadfile, prior_project=None)

A class for batch comparisons of code.

loadfile : str
File path to previously saved comparison object.
prior_project : unicity.Project (optional)
If comparison was constructed against a prior project, pass this object to read prior client information.
matrix : array-like
Distance matrix of similarities.
routine : str
Code unit for comparison in unicity string format.
prior_routine : str
Code unit for prior comparison in unicity string format.
save
Write comparison data to file.
save(savefile)

Saves a precomputed comparison file.

savefile : str
Name of file to save similarity data.

PythonFiles

class unicity.PythonFile(filename, zipfile)

Class containing information about Python file.

all_calls : list
All callables in file, in order of appearance.
all_keywords : list
All callables and reserved keywords in file, in order of appearance.
classes : dict
ClassInfo objects corresponding to classes defined in file.
filename : str
Name of file.
functions : dict
FunctionInfo objects corresponding to callables defined in file.
import_froms : list
List of [module, thing] import from statements in from {module} import {thing} format.
import_lines : list
Line numbers of import statements.
lns : list
Text of file.
reserved : dict
Frequency of reserved keywords.

FunctionInfo and ClassInfo

class unicity.FunctionInfo(**kwargs)

Class containing information about Python callable.

all_keywords : list
All callables and reserved keywords in file, in order of appearance.
docstring : str
Docstring for the function.
funcs : list
Names of functions called within this function.
import_froms : list
List of [module, thing] import from statements in from {module} import {thing} format.
lineno : list
First and last line number of function in file.
lns : list
Text of function.
methods : list
Names of methods called within the function.
name : str
Name of the funtion.
names : list
A python ast category I don’t fully understand but useful for catching when the client invokes an Exception.
user_classes : list
Names of classes defined by the client and used within this function.
user_funcs : list
Names of functions defined by the client and used within this funcion.
class unicity.ClassInfo(**kwargs)

Class containing information about Python class.

all_keywords : list
All callables and reserved keywords in file, in order of appearance.
base : str
Base class.
defs : list
Names of methods defined within the class.
docstring : string
Docstring for the class.
lineno: list
First and last line number of class definition in file.
lns : list
Text of class.
methods : list
Names of methods called within the class.
name : str
Name of the class.

Indices and tables