Automatic acquisition and classification system for agricultural network information based on Web data
-
-
Abstract
Abstract: The purpose of this study is to obtain agricultural web information efficiently, and to provide users with personalized service through the integration of agricultural resources scattered in different sites and the fusion of heterogeneous environmental data. The research in this paper has improved some key information technologies, which are agricultural web data acquisition and extraction technologies, text classification based on support vector machine (SVM) and heterogeneous data collection based on the Internet of things (IOT). We first add quality target seed site into the system, and get website URL (uniform resource locator) and category information. The web crawler program can save original pages. The de-noised web page can be obtained through HTML parser and regular expressions, which create custom Node Filter objects. Therefore, the system builds a document object model (DOM) tree before digging out data area. According to filtering rules, the target data area can be identified from a plurality of data regions with repeated patterns. Next, the structured data can be extracted after property segmentation. Secondly, we construct linear SVM classification model, and realize agricultural text classification automatically. The procedures of our model include 4 steps. First of all, we use segment tool ICTCLAS to carry out the word segment and part-of-speech (POS) tagging, followed by combining agricultural key dictionary and document frequency adjustment rule to choose feature words, and building a feature vector and calculating inverse document frequency (IDF) weight value for feature words; lastly we design adaptive classifier of SVM algorithm. Finally, the perception data of different format collected by the sensor are transmitted to the designated server as the source data through the wireless sensor network. Relational database in accordance with specified acquisition frequency can be achieved through data conversion and data filtering. The key step of data conversion can be implemented on the basis of mapping rules between source data and target data. The mapping rules include 3 kinds of rules. The first is the source data directly corresponding to the target data; the second is that we create a temporary table, which corresponds to target table if they have same field name; and the third is converting perception data of XML (extensible markup language) type to relational database. Besides, data filtering is required to process abnormal values of the measured value beyond the sensor range. In this paper, unified modeling language (UML) is used to describe the agricultural network information automatic acquisition and classification system. User requirement analysis is described by the system's use case diagram. Web data extraction process is described by the system activity diagram. These help the system's key function implement of automatic information acquisition from Internet. The IOT data sharing module is implemented based on the proposed data conversion and filtering rules. The system can supply the services of on-time agricultural news, agricultural product prices, supply and demand information browsing query, real-time agricultural environment monitoring and personalized information statistics. The preliminary application shows that the agricultural network information automatic acquisition and classification system improves the accuracy of information extraction and text classification. The information acquisition accuracy rate for sample web sets is 98.2%, and the accuracy rate of text classification with rules is 92.5%. Compared with sequential minimal optimization (SMO), Bayesian, C4.5 decision tree and radial basis function (RBF) based SVM algorithm, linear SVM is more suitable for agricultural news classification. The system has high real-time performance and good user participation for IOT applications, which will expect to be applied to agricultural information integration and intelligent processing.
-
-