文章詳細頁

主題分析自動化的可行性：深度學習技術應用於偏態分佈之中文文件分類的實證評估

The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification

346

曾元顯 Yuen-Hsien Tseng

0000-0001-8904-7902

https://doi.org/10.6120/JoEMLS.202003_57(1).0047.RS.CE

DOI

教育資料與圖書館學

Journal of Educational Media & Library Sciences

2020 年 03 月 57卷1期頁121~頁144

語言：中文

類型：研究論文

中文關鍵詞：績效評估;深度學習;語料庫;文本分類

英文關鍵詞： Performance evaluation;Deep learning;Real-world corpus;Text categorization

138

JoEMLS Open Peer Review Report — Rebuttal to the Comments 28

相關附件

Excel References Export_57104 72

摘要

文件分類是圖書資訊學中的主題分析問題，而深度學習（deep learning，DL）為近年來運用大量語言知識的語意理解技術。本研究旨在透過四種現成的DL方法（CNN、RCNN、fastText和BERT）與四種傳統機器學習技術，對五個偏斜分佈語料（四個中文和一個英文）做成效比較，來評估DL進行主題分析的可行性。結果顯示，BERT對中等偏斜的語料有效，但對於高度偏斜的文件自動分類任務成效仍不佳。與傳統方法（例如SVM）相比，其他三種DL方法（CNN、RCNN、fastText）在五個文件分類任務上沒有顯示出優勢，儘管它們在預訓練的詞彙表示法中獲取了廣泛的額外語言知識，其成效也沒有比較好。為了方便將來的研究，本研究使用到的中文語料庫以及所有經過改編的機器學習和評估程式碼均公開發布。

Text classification (TC) is the task of assigning predefined categories (or labels) to texts for information organization, knowledge management, and many other applications. Normally the categories are topical in library science applications, although they can be any labels suitable for an application. Thus, TC often requires topical analysis which relies on human knowledge. However, in recent decades, machine learning (ML) techniques have been applied to TC for efficiency, as long as a sufficient number of training texts are available for each category. Nevertheless, in real-world cases, the number of texts (documents) for each category is often highly skewed for a certain TC task. This leads to the problem of predicting labels for small categories, which is viable for humans but challenging for machines. Deep learning (DL) is an emerging class of machine learning (ML) which was inspired by human neural networks. This study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four offthe-shelf DL methods (CNN, RCNN, fastText, and BERT) with four traditional ML techniques on five skew-distributed datasets (four in Chinese, and one in English for comparison). Our results show that BERT is effective for moderately skewed datasets, but is still not feasible for highly skewed TC tasks. The other three DL-aware methods (CNN, RCNN, fastText) do not show any advantage in comparison with traditional methods such as SVM for the five TC tasks, although they captured extra language knowledge in the pretrained word representation. To facilitate future study, all of the Chinese datasets used in this study have been released publicly, together with all of the adapted machine learning and evaluation source codes for verification and for further study at https://github.com/SamTseng/Chinese_Skewed_TxtClf.

參考文獻

Alex, K., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, & L. Bottou (Eds.), NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems (Vol. 1, pp. 1097-1105). Curran Associates. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

Calkins, S. (1983). The new Merger Guidelines and the Herfindahl-Hirschman Index. California Law Review, 71(2), 402-429. https://doi.org/10.2307/3480160

Chen, L., & Lee, C. M. (2017). Predicting audience’s laughter using convolutional neural network. arXiv. https://arxiv.org/abs/1702.02584v2

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://arxiv.org/abs/1810.04805v2

Hirschman, A. O. (1964). The paternity of an index. The American Economic Review, 54(5), 761.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Machine learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21-23, 1998 proceedings (pp. 137-142). Springer. https://doi.org/10.1007/BFb0026683

Johnson, R., & Zhang, T. (2015). Effective use of word order for text categorization with convolutional neural networks. In R. Mihalcea, J. Chai, & A. Sarkar (Eds.), Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 103-112). Association for Computational Linguistics. https://doi.org/10.3115/v1/N15-1011

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv. https://arxiv.org/abs/1607.01759v3

Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In AAAI’15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2267-2273). Association for the Advancement of Artificial Intelligence.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436-444. https://doi.org/10.1038/nature14539.

Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361-397.

Liston-Heyes, C., & Pilkington, A. (2004). Inventive concentration in the production of green technology: A comparative analysis of fuel cell patents. Science and Public Policy, 31(1), 15-25. https://doi.org/10.3152/147154304781780190

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://arxiv.org/abs/1301.3781v3

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems (Vol. 2, 3111-3119). Curran Associates.

Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In EMNLP ‘02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing (Vol. 10, pp. 79-86). Association for Computational Linguistics. https://doi.org/10.3115/1118693.1118704

Russakovsky, O. Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Li, F.-F. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211-252. https://doi.org/10.1007/s11263-015-0816-y

Saif, H., Fernández, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: A survey and a new dataset, the STS-Gold. In C Battaglino, C. Bosco, E. Cambria, R. Damiano, V. Patti, & P. Rosso (Eds.), Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and perspectives from AI (pp. 9-21). http://ceur-ws.org/Vol-1096/proceedings.pdf

Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing.

Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1-47. https://doi.org/10.1145/505282.505283

Simpson, E. H. (1949). Measurement of diversity. Nature, 163 , 688. https://doi.org/10.1038/163688a0

Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H. (2019). ERNIE: Enhanced representation through knowledge integration. arXiv. https://arxiv.org/abs/1907.12412v2

Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., & Wang, H. (2019). ERNIE 2.0: A continual pre-training framework for language understanding. arXiv. http://arxiv.org/abs/1907.12412

Tseng, Y.-H., & Teahan, W. J. (2004). Verifying a chinese collection for text categorization. In K. Jarvelin, J. Allen, P. Bruza, & M. Sanderson (Eds.), Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 556-557). ACM Press.

Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In P. Isabelle (Chair), ACL ‘02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 417-424). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073153

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30. Neural Information Processing Systems Foundation. https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques (3rd ed.). Morgan Kaufmann Publishers.

Yan, L., Zheng, Y., & Cao, J. (2018). Few-shot learning for short text classification. Multimedia Tools and Applications, 77(22), 29799.29810. https://doi.org/10.1007/s11042-018-5772-4

Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In F. Gey, M. Hearst, & R. Tong (Chairs), SIGIR ‘99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 42-49). Association for Computing Machinery. https://doi.org/10.1145/312624.312647

Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding [Paper presentation]. 33rd Conference on Neural Information Processing Systems, Vancouver, Canada. https://papers.nips.cc/paper/8812-xlnet-generalized-autoregressive-pretraining-for-language-understanding.pdf

Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing Systems (Vol. 1, pp. 649-657). MIT Press.

Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., & Liu, Q. (2019). ERNIE: Enhanced language representation with informative entities. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1441-1451). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1139

展開

收起

APA 引文格式

Tseng, Y.-H. (2020). The feasibility of automated topic analysis: An empirical evaluation of deep learning techniques applied to skew-distributed Chinese text classification. Journal of Educational Media & Library Sciences, 57(1), 121-144. https://doi.org/10.6120/JoEMLS.202003_57(1).0047.RS.CE

芝加哥引文格式

Tseng, Yuen-Hsien. “The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification.” Journal of Educational Media & Library Sciences 57, no. 1 (March 2020): 121-44. https://doi.org/10.6120/JoEMLS.202003_57(1).0047.RS.CE.