Exploring Supervision Levels for Patent Classification
Typ
Examensarbete för masterexamen
Program
Data science and AI (MPDSC), MSc
Publicerad
2022
Författare
van Hoewijk, Adam
Holmström, Henrik
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Machine learning can help automate monotonous work. However, most approaches
use supervised learning, requiring a labeled dataset. The consulting firm Konsert
Strategy & IP AB (Konsert) sees great value in automating its task of manually
classifying patents into a custom technology tree. But the ever-changing categories
leaves a pre-labeled dataset unavailable. Can other forms of supervision be used for
machine learning to excel without extensive data? This thesis explores how weakly
supervised, semi-supervised, and supervised learning can help Konsert to classify
patents with minimal hand-labeling. Furthermore, what effect class granularity has
on performance is explored alongside whether or not using patents’ unique characteristics
can help.
Two existing state-of-the-art methods at two supervision levels are employed. Firstly,
LOTClass, a keyword-based weakly supervised approach. Secondly, MixText, a
semi-supervised approach. We also propose LabelLR, a supervised approach based
on patents’ cooperative patent classification (CPC) labels. Each method is tested on
all granularity levels of a technology tree provided by Konsert alongside a combined
ensemble of the three methods. MixText receives all unlabeled patent abstracts together
with the same ten labeled documents per class LabelLR receives. LOTClass
on the other hand receives the unlabeled abstracts along with class keywords.
Results reveal that the small training dataset of around 4 200 patents leaves LOTClass
struggling while MixText excels. LabelLR outperforms MixText on the rare
occasion when the CPC labels and the classifications closely match. The ensemble
proves more consistent than LabelLR but only outperforms MixText on some
granular classes. In conclusion, a semi-supervised approach appears to be the best
balance of minimal manual work and classification proficiency reaching an accuracy
of 60.7% on 33 classes using only ten labeled patents per class.
Beskrivning
Ämne/nyckelord
Patent, Weakly supervised learning, Semi-supervised learning, Supervised learning, BERT