The Threat Of False Positives

11.06.2019

By Urkund

Plagiarism is an ever-present threat to academic integrity and original thinking. If you do not quote and reference correctly, research that has previously been done cannot be identified properly and academia would most likely be engulfed in constant fights over who came up with what. Plagiarism detection systems are therefore a helpful and crucial addition to all institutions creating knowledge. They also help us saving by automatically flagging suspected cases of plagiarism.

However, finding potential plagiarism in long texts, essays or even doctoral thesis can be tricky. A common threat in the fight against plagiarism is to not recognise so-called “false positives” and to underestimate their importance. So, keep on reading to learn more about what false positives are all about and why they are a big deal.

False positives are by definition something everyone wants to avoid, as it means that a result of a condition is shown but not true. When it comes to false positives in text comparison, it is, however, something created entirely by plagiarism detection systems.

Just have a look at the following examples in which the red text marks a matching text.

  • ”Salt and pepper”<>”Cats and dogs” – 33%
  • ”Three men in a boat” <> ”Life in a Medieval City” – 40%
  • The Adventures of Tom Sawyer” <> ”The Adventures of Sherlock Holmes” – 60%

Often the amount of matching text in a document is illustrated by a percentage. Which means that the higher the percentage, the higher the likelihood of plagiarism. Or so it may seem. As shown in the examples above, counting all findings, such as words like “and” or “in a” means while the percentage of overall detected plagiarism actually rises, its relevance drops. That effect of displaying common words as potential plagiarism is what we refer to as false positives. A lot of times, false positives are “caused” by words that are extremely common in the specific language rather than complicated conjunctions and appositions. The truth is that once we leave the 100% similarity mark, the lines become blurred. Because how do you calculate the relevance of the different words that make up a text and translate it into a percentage?

Showing all matching texts and findings causes clutter, confusion, and risks taking the attention from actual cases of plagiarism. It’s basically like googling the phrase “I don’t know” where you’ll simply be swamped with results. Try it. (It will give you around 7 billion search results). Or think about the phrase “This page is intentionally left blank”. Would this match even make sense to see in a plagiarism report? False positives also cause you to spend more time than necessary on having to go through findings and could eventually lead to a wrong judgement of the student in question. What’s probably even worse is that it undermines the utilisation of and trust in plagiarism detection software altogether. If you spend enough time sifting through false positives the frustration will most probably have you give up on your efforts to assess the level of real plagiarism in the text you are reviewing.

False positives are a more common threat to originality than we think they are, and we need to address it properly. One way to minimise false positives is by using machine learning algorithms that improve over time, learning how to recognise what a relevant text matching is and what it isn’t. In the end, we should all understand that technology is here to help and support us to make more informed decisions limiting cluttered and irrelevant data along the way. Because no matter the tools, the decision itself always needs to be made by us.