段青玲, 魏芳芳, 张磊, 肖晓琰. 基于Web数据的农业网络信息自动采集与分类系统[J]. 农业工程学报, 2016, 32(12): 172-178. DOI: 10.11975/j.issn.1002-6819.2016.12.025
    引用本文: 段青玲, 魏芳芳, 张磊, 肖晓琰. 基于Web数据的农业网络信息自动采集与分类系统[J]. 农业工程学报, 2016, 32(12): 172-178. DOI: 10.11975/j.issn.1002-6819.2016.12.025
    Duan Qingling, Wei Fangfang, Zhang Lei, Xiao Xiaoyan. Automatic acquisition and classification system for agricultural network information based on Web data[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2016, 32(12): 172-178. DOI: 10.11975/j.issn.1002-6819.2016.12.025
    Citation: Duan Qingling, Wei Fangfang, Zhang Lei, Xiao Xiaoyan. Automatic acquisition and classification system for agricultural network information based on Web data[J]. Transactions of the Chinese Society of Agricultural Engineering (Transactions of the CSAE), 2016, 32(12): 172-178. DOI: 10.11975/j.issn.1002-6819.2016.12.025

    基于Web数据的农业网络信息自动采集与分类系统

    Automatic acquisition and classification system for agricultural network information based on Web data

    • 摘要: 为了快速、高效地获取农业Web信息,解决信息孤岛和信息不对称的问题,重点研究了农业Web数据自动采集与抽取、基于SVM(support vector machine)的文本分类、物联网异构数据采集等技术,并采用统一建模语言(unified modeling language,UML)描述了农业网络信息自动采集与分类系统。该系统实现了农业网站、物联网数据的自动抓取和共享,为用户提供农业资讯、农产品市场行情、供求信息在线查询,环境数据实时监测和个性化信息服务等功能。应用结果表明,该系统对样本集网站的信息抓取准确率为98.2%,资讯分类准确率为92.5%,具有数据采集实时性强、用户参与度好、通用性高等特点,该系统为农业信息整合和服务提供参考。

       

      Abstract: Abstract: The purpose of this study is to obtain agricultural web information efficiently, and to provide users with personalized service through the integration of agricultural resources scattered in different sites and the fusion of heterogeneous environmental data. The research in this paper has improved some key information technologies, which are agricultural web data acquisition and extraction technologies, text classification based on support vector machine (SVM) and heterogeneous data collection based on the Internet of things (IOT). We first add quality target seed site into the system, and get website URL (uniform resource locator) and category information. The web crawler program can save original pages. The de-noised web page can be obtained through HTML parser and regular expressions, which create custom Node Filter objects. Therefore, the system builds a document object model (DOM) tree before digging out data area. According to filtering rules, the target data area can be identified from a plurality of data regions with repeated patterns. Next, the structured data can be extracted after property segmentation. Secondly, we construct linear SVM classification model, and realize agricultural text classification automatically. The procedures of our model include 4 steps. First of all, we use segment tool ICTCLAS to carry out the word segment and part-of-speech (POS) tagging, followed by combining agricultural key dictionary and document frequency adjustment rule to choose feature words, and building a feature vector and calculating inverse document frequency (IDF) weight value for feature words; lastly we design adaptive classifier of SVM algorithm. Finally, the perception data of different format collected by the sensor are transmitted to the designated server as the source data through the wireless sensor network. Relational database in accordance with specified acquisition frequency can be achieved through data conversion and data filtering. The key step of data conversion can be implemented on the basis of mapping rules between source data and target data. The mapping rules include 3 kinds of rules. The first is the source data directly corresponding to the target data; the second is that we create a temporary table, which corresponds to target table if they have same field name; and the third is converting perception data of XML (extensible markup language) type to relational database. Besides, data filtering is required to process abnormal values of the measured value beyond the sensor range. In this paper, unified modeling language (UML) is used to describe the agricultural network information automatic acquisition and classification system. User requirement analysis is described by the system's use case diagram. Web data extraction process is described by the system activity diagram. These help the system's key function implement of automatic information acquisition from Internet. The IOT data sharing module is implemented based on the proposed data conversion and filtering rules. The system can supply the services of on-time agricultural news, agricultural product prices, supply and demand information browsing query, real-time agricultural environment monitoring and personalized information statistics. The preliminary application shows that the agricultural network information automatic acquisition and classification system improves the accuracy of information extraction and text classification. The information acquisition accuracy rate for sample web sets is 98.2%, and the accuracy rate of text classification with rules is 92.5%. Compared with sequential minimal optimization (SMO), Bayesian, C4.5 decision tree and radial basis function (RBF) based SVM algorithm, linear SVM is more suitable for agricultural news classification. The system has high real-time performance and good user participation for IOT applications, which will expect to be applied to agricultural information integration and intelligent processing.

       

    /

    返回文章
    返回