Blog quality calculation (I)

Updated to 1 day ago

A new departure

Last week I officially joined CSDN’s NLP team. Thanks to the organization for giving such an opportunity to do what you like with like-minded colleagues. After graduating from the mathematics and statistics major 21 years ago, I began to enter the direction of AI, which can probably be regarded as a spiral of life.

Blog Title Quality Assessment

My first development task was content quality score. The quality here does not strictly follow the subjective content quality, but is just a sorting basis. It can be said to be part of the recommendation system. We aim to recommend better content to users statistically as statistically as possible based on statistical algorithms rather than manual interference.
This result usually does not guarantee that it is the best for every reader, and you can only approach this goal as much as possible. This is an eternal regret for the recommendation system.
The evaluation of the blog title is one of the special children. The title has its own characteristics. It does not need to be a complete sentence, but should explain the key points of the article as clearly as possible. For title recommendation, there are two main goals. One is that the title should conform to the column theme or search keywords. In this regard, we should focus on the behavior of increasing the hit rate by stacking keywords in the title; the other is that the title should conform to the content of the article itself, and we should fight against the behavior of the "title party".
I manually collected the titles of more than 10,000 blogs and read them manually. I felt that the quality of CSDN blogs was still very high, and most of the titles were very sincere.
What the title party likes most is to attract readers to click through some "eye-catching" text and punctuation, so my first idea is to find out the most common subset of titles, perhaps a ready-made title party collection. But after reading more than 1,000 titles, I feel that this direction cannot be achieved, because technical blogs are a special vertical field, and good titles will also be very similar because they are discussing similar content. Instead, the largest subset that can be obtained by aggregating the title's vocabulary is the technical terminology set.
Initially, for technical blogs, the title quality should be as close to the following goals as possible:

Subject-predicate/vendobject structure that conforms to natural language grammar
- Some vocabulary is in the term dictionary, focusing on subject and object
- If a title only has the words in the term dictionary, do a certain reduction of the right
- The emotional analysis results of the title should tend to be neutral, and the right to be demoted if the emotions are intense
- Depreciate titles containing negative scores

To more fully evaluate the quality of an article, in addition to the title, the content needs to be analyzed. Next articleWe discuss the content quality of blog posts:/ccat/article/details/123911429