Abstract: With the increasing demand for patent document analysis based on patent literature big data, more and more enterprises, patent agencies, government agencies, etc. have begun to analyze patent documents. This paper aims to discuss steps of carrying out patent document analysis based on the patent document big data with specific examples, thus providing suggestions and references for how to perform efficient and high-quality analysis of patent document big data.
With the rapid development of information technology, the demand for patent document analysis based on the patent literature big data is also growing. Patent document analysis refers to the multi-data analysis method that combines various patent document information and uses some statistical methods and graphical means to form corresponding analysis results, thus providing support for decision-making in the development of technology, products and services of enterprises . More and more companies hope to get advice on their R&D direction, technology choices, and market entry through big data analysis of patent documents. However, the big data analysis of patent documents requires experienced patent document analysts involving a large number of difficult work. Therefore, how to perform efficient and high-quality analysis on big data of patent documents is also required by enterprises, patent document service organizations, and government agencies.
II. Steps and Difficulty of Big Data Analysis in Patent Documents
Big data analysis of patent documents normally needs to be implemented through the steps shown in the following Figure 1:
As shown in the above Figure 1, the big data analysis of patent documents needs to go through:
(1) Identifying requirements, and classifying technical areas into sub-areas
(2) Identifying keywords in each sub-area, determining the search formula, and filtering out noise
(3) Obtaining the search results and checking the recall and precision
(4) Determining indexing items
(5) Manual reading for indexing
(6) Manual reading analysis of key patent documents
(7) Drawing analysis charts such as patent document maps based on indexing items
(8) Obtaining conclusions and recommendations for patent literature analysis based on the analysis charts
(9) Writing a patent document analysis report
Among them, how to choose keywords and search terms to get full and accurate search results and when to stop the adjustment of keywords and search terms is a difficult point. The work of determining indexing items also needs to have a good grasp of the technology itself. It takes a lot of manpower to perform manual reading, which is time consuming and labor intensive. The following is a concrete case to discuss our practical solutions in the big data analysis of patent documents.
III. Practice of Patent Document Big Data Analysis
1. Identifying requirements, and classifying technical areas to obtain sub-areas
Suppose that an enterprise wants to understand the patented technology of driverless cars to determine the direction of research and development, we must first classify the technical field of driverless cars, because it is a big class. If the enterprise's needs are originally small areas, we can omit this step.
How can we make a more accurate classification of this technical field? We must first conduct a comprehensive investigation of the technical field, including reading news, industry reports, some patent documents, and so on. There are many ways to classify. For example, in a broad sense, automotive technology can be divided into five categories: 1. Power, including engine and gearbox design, transmission systems, and new energy drive systems. 2. Navigation, including GPS satellite navigation, dedicated short-range communication system. 3. Control, including braking, steering, suspension and other systems. 4. Safety/security, including overall safety/security systems, seats, seat belts, airbags, door locks, etc. 5. Entertainment, including in-vehicle communication, smartphone integration, head-up display (HUD), etc. If the area is classified based on hardware and software, the hardware can be divided into camera, radar, laser radar, sensor, GPS global positioning system, and the software can be divided into, for example, environment-aware technology, planning decision-making technology, network communication technology, operation control technology, etc.. If it is divided by function, it can be divided into perceptual positioning, planning decision, execution control, and so on. From the perspective of the system, it can be divided into vehicle positioning technology, vehicle control technology, vehicle stability system, automatic parking system, radar system, lane keeping system, collision prevention system, infrared camera equipment, stereo vision system, electromagnetic control. System, etc. Therefore, the classification of the field of driverless cars can be divided from various perspectives, and we need to communicate with the enterprise to understand its needs, so as to determine the perspective of the appropriate classification for the enterprise’s needs.
2. Identifying keywords in each sub-area, determining the search formula, and filtering out noise
Assume that the enterprise chooses to divide from the perspective of vehicle manufacturing, the area is divided into vehicle positioning technology, vehicle control technology, vehicle stability system, automatic parking system, radar system, lane keeping system, collision prevention system, infrared camera equipment, stereo vision system, electromagnetic control systems, etc.. For respective sub-areas, identify the keywords for each sub-area. For example, for vehicle positioning technology, keywords for positioning, GPS, satellite, navigation, and so on may be developed. Here, such keywords can be extended by searching and interpreting related materials and patent documents, using a database website, and the like. These keywords are then combined to determine the search formula, and the keywords and the search formula may be further adjusted based on the search results.
The process of determining the search terms also needs to be done by experienced personnel in the field of search. Adjsuting the search formula is an ongoing process. Preliminary results are obtained using a preliminary search, and several of the search results are read firstly. If there are too many noises (i.e., unrelated patent documents) in the read documents, keywords and search terms should be modified to search for relevant results as much as possible. During the search adjustment, the adjustment of the classification number is a very important step. For example, in the course of practice, we also use some of the patent document map functions provided by the patent literature search website to reduce noise. Because of the generation of patent document maps, we usually conduct citation analysis and cluster analysis, extract keywords and merge related content. The patent document forms a patent document map with the number of patent documents as the height of the mountain peak. It can be seen that if the keyword of the patent document with higher mountain peak is irrelevant to the result we are searching for, the classification number or keyword can be removed to remove noise. Some patent document search websites will also rank the top ten applicants, etc., and we may further remove noise from the unreasonable top ten applicants. For example, in the field of driverless cars, the top ten applicants of the State Grid have shown that the classification number or keywords cause too much noise, so we can check the classification number or keyword of the patent documents of the State Grid as the applicant, to further eliminate noise.
3. Obtaining the search results and checking the recall and precision
Everyone knows that the search results can be continuously improved, so it is difficult to judge when to end the search. Many people rely on experience to determine the end of the search, but we have studied and believed that some quantitative indicators are needed to more objectively determine when to end the search. Here, we use the recall rate and precision rate checks, but for efficiency, it has been improved. The two improved check methods are described in detail below.
The recall rate and the precision rate are two contradictory indicators. How to achieve balance between the two and achieve efficient search is a problem to be considered.
The recall rate is an index that measures the success of a search system in detecting a relevant document from a collection of documents, that is, the percentage of relevant documents detected and the total amount of relevant literature in the retrieval system. Generally expressed as: recall rate = (relevant amount of related information / total amount of related information in the system) x 100%. The use of a more general search language (such as the broad class, or broad keywords) can improve the recall rate, but the precision rate decreases .
However, if a complete recall test is performed on the above search results, it is necessary to manually read all the searched patent documents, and mark the patent documents that are hit (i.e., relevant for the purpose of the search) in the searched patent documents and we have to know the number of all the relevant documents in the whole database, and calculate the ratio. This part of the work is already the fifth step of manual reading for indexing and the amount of manual reading is very large. It should not be used repeatedly to eliminate the labor and time cost of manual reading when determining when to end the search. And the number of all the relevant documents in the whole database is unknown.
Therefore, we have simplified the process in practice, and we will build a full library by completely different methods from the above keywords and search. For example, we use a wide range of keywords (such as unmanned or automatic), and manually read a portion of the search results through a specific applicant (for example, Tesla) that specializes in driverless cars, and obtain hit patent documents that are definitely determined to be hits for the search target. The number of patent documents in the full library is preferably 1/10 or 1/100 of the total number of searched patent documents of the above search results. , dependent on the total number of the searched results. Then, in the above search result, the probability of hitting the patent document in the full library is found. If a certain threshold, such as 90%, of the hit patent documents in the full database can be found in the above search result, then the search result is considered to be full, because it can cover 90% of a certain hit library, as shown below in Figure 2. Such an operation scheme makes it possible to compress the human and time costs of reading a patent document and constructing a full library, and can also initially check the recall rate of the search result.
For the precision rate, we have also simplified the process accordingly. Precision (accuracy) rate is an indicator of the signal-to-noise ratio of a particular search system, i.e., the percentage of relevant literature detected and the total amount of literature detected. Generally expressed as: precision rate = (the amount of related information searched / the total amount of information searched) x100%. The use of a more specific search language (such as the narrow class, or narrow keywords) can improve the precision, but the recall rate declines .
If a complete precision test is performed on the above search results, it is necessary to manually read all the searched patent documents and mark the patent documents that are hit (that is, relevant for the purpose of the search), and the workload of manual reading is very large. Therefore, we simplified the judgment of the precision rate and changed it to random sampling and manual reading to determine the hit patent document. This random sampling rate can be 1/10 or 1/100 of the total number of the searched results, dependent on the total number of the searched results. If the hits of the sampled patent document can account for a certain threshold, such as 90% of the total number of sampled patent documents, the search result may be considered to be relatively accurate, that is, has less noise, as shown in the following Figure 3.
After passing such a quantitative recall and precision check, it can be considered to end the search.
4. Determining indexing items
After the searched result is obtained, the manual reading step is entered. We need to determine the indexing items based on what information desired to be obtained after manual reading. This step also needs to be based on communicating with the company to achieve the required indexing items. For example, in the case of an unmanned vehicle, the indexing items can be defined as broad indexing items in the collision prevention system: object capturing, detection processing, vehicle stability control, alarm, and the like. Narrow indexing items can be created under the indexing item object capturing: radar, camera, laser, etc., and narrow indexing items are created under the broad indexing item alarm: sound, display, tactile sensation, and so on. The former is the indexing item from the technical means, and can also be divided into some indexing items from the technical effect, such as pedestrian detection, vehicle detection, fast braking, alarming and reminder, emergency monitoring, steering control and so on. This will lay the foundation for multi-dimensional chart production in the future. Of course, these are just examples. In fact, we can also divide the indexing items of more levels and other contents.
5. Manual reading for indexing
Then we can do manual reading and indexing work. This work is also the one that consumes the most manpower and time. However, because we have checked the recall and precision rates, the noise in the search results is controlled, and because the indexing items are also determined, the manual reading and indexing can be performed efficiently.
6. Manual reading analysis of key patent documents
Some companies will hope to understand the details of some key technologies in addition to analyzing general technology trends. They need to extract key patent documents and manually read and analyze key patent documents, and extract technical issues, technical details and technical effects etc. from the key patent documents. There are also many ways to determine key patent documents. For example, the number for which the patent document is cited is used to evaluate the influence and importance of the patent document.
7. Drawing analysis charts such as patent document maps based on indexing items
Using the patent chart, the general charts of the application amount - annual trend, number of applicants - annual trend, classification number distribution, etc. can be obtained, and the patent technology trend analysis can be obtained by above-mentioned manual reading based on indexing items. Here's how to make such charts.
For example, based on the previously determined technical means and technical effects of the indexing items, we can obtain the bubble diagram of the technical means and technical effects, that is, the abscissa is the technical means, the ordinate is the technical effect, and the technical means is used to achieve the technical effect. The number of patent documents is represented by the size of the bubble where the abscissa and ordinate are located. An example of a specific chart is as follows in Figure 4:
Of course, the above is only an example, and more charts can be obtained by using the index items and the parameters of the patent documents. For example, the timeline technology development map can be constructed through key patents, and the cooperation application quantity map can be constructed by the inventors.
Figure 5 below shows most of the chart styles used (the figure below is obtained from the network).
8. Obtaining conclusions and recommendations for patent literature analysis based on the analysis charts
For example, according to the bubble diagram between the above technical means and the technical effect, it can be judged that the number of patents that use the technical means of laser to achieve the technical effect of pedestrian detection is small, and it is possible to consider strengthening research and development in this respect. Of course, the above is only an example, and analysis of various aspects can also be performed. And what kind of conjectures and suggestions according to the charts needs the experience and wisdom of the analysts, because the charts are objective phenomena, and it is difficult and important to analyze the essence from the objective phenomena.
9. Writing a patent document analysis report
Through a variety of charts to comprehensively analyze the applicant's distribution, geographical distribution, year distribution, classification number distribution, research and development trends, technology-effect trends, etc. in the field of autonomous vehicles.
With the development of science and technology and the popularization of intellectual property awareness, more and more enterprises are aware of the impact of patent analysis on their research and development, production and operation, and they are paying more and more attention to patent analysis projects. To undertake a large-scale patent analysis project requires not only knowledge of science and technology, and experience in patent search, but also the practical experience of charting and patent analysis projects. The differences in the technology and experience of the patent analyst may result in completely different analysis results. Therefore, if enterprises would like to develop a patent analysis to get support for decision-making, the technical background and the experience of the patent analyst team should be carefully screened before entrustment, in order to get a truly useful analysis result.
 Longhui Zhang. Patent document analysis in big data period[J]. China Academic Journal Electronic Publishing House, 2014: 148-149.
Hancheng Deng, Ying Wang, Minfang Wang. Relationship between Recall and Precision Ratio in Terms of Retrieval Example[J]. JOURNAL OF THE CHINA SOCIETY FOR SCIENTIFIC AND TECHNICAL INFORMATION, 2000, Vol.19, No.3: 237-241.