## GenText-Checker-0.31

Efficient article plagiarism checking tool.

### Basic Usage

You can compile Gentext-Checker (gtchecker) simply by using the following command:

    make all

The output of the compiler will be:

    g++ -Wall -std=c++17 -o gtchecker gtchecker.cc
    g++ -Wall -std=c++17 -o test test.cc

Then you can execute gtchecker by:

    ./gtchecker plag ./data/paper_1.txt ./data/paper_1_D3.txt

The output of gtchecker will be:

    Doc_A: ./data/paper_1.txt  Doc_B: ./data/paper_1_D3.txt
    
    Doc_A: [0-50] Deeper neural networks are more difficult to train.
    Doc_B: [0-43] Deeper learning are more difficult to train.
    Doc_A: [52-182] We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
    Doc_B: [45-178] We present a residual learning framework for easing the training of networks that are substantially deeper than those used previously.
    
    ...
    
    Doc_A: [948-1071] Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset.
    Doc_B: [955-1071] Due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset.
    Doc_A: [1073-1293] Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
    Doc_B: [1073-1299] Deep residual networks are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the first places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
    
    Doc_A has 100% similarity with Doc_B.
    Doc_B has 100% similarity with Doc_A.
    
This indicates that every sentences in Doc_A are similar with Doc_B. Also, every sentences in Doc_B are similar with Doc_A.

    Doc_A: [0-50] Deeper neural networks are more difficult to train.
    Doc_B: [0-43] Deeper learning are more difficult to train.

 `[0-50]` and `[0-43]` indicate the character index of each document.

In 0.31 version, gtchecker uses a very strict sentence-based plagiarism checking method by default. This means that if you just change a few words of a sentence, gtchecker can still check it out. 

If you think this kind of checking is overly strict, e.g., you consider these two sentences "*Deeper neural networks are more difficult to train.*" and "*Deeper learning are more difficult to train.*" are not similar, you can uses a loose checking method by the following command:

    ./gtchecker plag ./data/paper_1.txt ./data/paper_1_D3.txt --loose

The output of gtchecker will be:

    Doc_A: ./data/paper_1.txt  Doc_B: ./data/paper_1_D3.txt
    
    Doc_A: [184-330] We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.
    Doc_B: [180-326] We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.
    Doc_A: [644-724] An ensemble of these residual nets achieves 3.57% error on the ImageNet test set.
    Doc_B: [649-729] An ensemble of these residual nets achieves 3.57% error on the ImageNet test set.
    Doc_A: [796-857] We also present analysis on CIFAR-10 with 100 and 1000 layers.
    Doc_B: [803-864] We also present analysis on CIFAR-10 with 100 and 1000 layers.
    Doc_A: [859-946] The depth of representations is of central importance for many visual recognition tasks.
    Doc_B: [866-953] The depth of representations is of central importance for many visual recognition tasks.
    
    Doc_A has 36.3636% similarity with Doc_B.
    Doc_B has 36.3636% similarity with Doc_A.

In this example, gtchecker uses loose checking method that compare two sentences word-by-word. Only two sentences are exactly the same (every words are the same, while ignore the upper/lower cases and noise characters), gtchecker will check it out.

### Run Benchmark

We provide a benchmark that contains 60 cases to test the performance of gtchecker. To run the benchmark, you can using the following command:
    
    ./benchmark.sh

The output of the benchmark will be:

    make: `gtchecker' is up to date.
    Result of gtchecker is written to output.txt

Then, you can check the information of the benchmark in `output.txt` file. On default, this benchmark uses strict mode for checking plagiarism.

### HFWS Signature

In 0.31 version, gtchecker supports a novel method called *high frequency words signature (HFWS)*, which helps users filtering high probability plagiarism actions from a large numbers of documents. This method has the potential to reduce the computation and storage overhead of LLMs services for checking article plagiarism. Users can use this method by the following commands:

    ./gtchecker sig ./doc_A.txt

The output will be:

    Signature: w3u23059wu0ld305u9w31sc0ud25E3mL5c0q0c0ucd105f5b3u2c0w53u0c051

This command generates a digital signature for doc_A. Also, users can use the same way to generate a signature for doc_B:

    ./gtchecker sig ./doc_B.txt

The output will be:

    Signature: w3u2b059wu0ld305u9w31sc0ud25E3mL5c0q0c0ucd105f5b3u2c0w53u0c051

Compared to checking the whole content of the input documents, we can compare the signatures of the given documents in the first setp for filtering the high probability plagiarism actions.

### Testing    

To assume that you can get expected output of gtchecker, we suggest users to run the testing cases. Make sure you have already compiled gtchecker by using the following command:

    make all
 
 or

    make test

Then run the test program:   
    
    ./test
    
The output of the testing program will be:

    23 cases passed. 0 cases failed.