↑ Return to P35 URL Stop Words

PRIV P35c Algorithm


Page no: P35c

 

 

Definition: post stands for a post or a page

 

For our Stop Words plugin we have three procedures that need documentation.

  1. Restore old postname based on titles (and family name)
  2. Make postname unique procedure
  3. Remove stop words

 

Definition: Postname is the last part of the URL (see full explanation on the Yoast/Mark Cutts permalink article)

 

 

Avoid duplicate URL 

We must check if the generated URL does not exist yet. This is done by default in WordPress, but our plugin does not do this. It might call the core WP procedure for this check.

Decision: WordPress does not use full URL, but only the postname to identify a post. For performance reasons we prefer to use the post ID instead of a postname numbering. The test if the post exists, is done only once.  If the post name exists already then we add the post ID.

Advantage of this procedure: we do not need to count the number of posts with the same post name.

 

 

 

Avoid duplicate URL (pro and light version)

We must check if the generated URL does not exist yet. This is done by default in WordPress, but our plugin does not do this. It might call the core WP procedure for this check.

Decision: WordPress does not use full URL, but only the postname to identify a post. For performance reasons we prefer to use the post ID instead of a postname numbering. The test if the post exists, is done only once.  If the post name exists already then we add the post ID.

Advantage of this procedure: we do not need to count the number of posts with the same post name.

 

Algorithm to remove the stop words

For Google Go-Live we start directly with step 3.

For step 1 and 2 see the algo that removes the final S for plural, 3rd person and genitive.

 

This algorithm can be built with a finite state machine. Here more on PHP and state machines.

State Machine

 

 

 

 

Main Algorithms

Remove stop words

For Google Go-Live we start directly with step 3.

For step 1 and 2 see the algo that removes the final S for plural, 3rd person and genitive.

 

This algorithm can be built with a finite state machine. Here more on PHP and state machines.

State Machine

This is the algorithm that removes stop words.

Input: A post or page with its postname

Procedure:

  • For all posts or pages in the DB do
    • For each word of the postname  (post), the word separator = “-”
      • if word is in the list of stopwords then do nothing, next
      • else Set postname = postname + “-” + word
    • postname: = Make postname unique (postname)

Output: unique postname with stop words removed (for all posts/pages in the DB)

Remove Stop Words

Algo: Restore old postname

Restore Postname based on title (and family name)

This is the algo for the button “Restore old URL based on titles”.

Input: A post or page with its title and its URL

 

Procedure:

  • For all posts or pages in the DB do
    • postname := empty string
    • Take all words of the post title (no matter if stop words are in it)
    • For each word of the title (word separator = space)
      • Set postname = postname + “-” + word
    • postname : = familyname(author) + “-” + postname
    • postname = Make post unique (post)
  • Output: the procedure ensures that postname is build on post titles

 

Restore old URL based on Title

Algo: Make Postname unique

This is the algorithm to have unique Postnames and URLs. It is used by both restore old URLs and by the stop words algorithm.

 

Input:  a post

Procedure:

  • Check if the postname already exists in the WordPress DB
  • If the  postname already exists then postname := URL + “-“+ Category Short Name
  • If the postname already exists then postname := URL + “-” + PostID

Output: a  unique postname

Difficulties

There were some problems, which we had to deal with them

1) The category

:???

All the tags and the categories are into one table wp_term.

 

Their relationships with the posts are in another and their type in another.

This is this table:

The problem was here that we had to find the fastest problem for checking the needed category.

Why do we have to check the category for these algorithms?
Our algo does not change the object_id , but only the postname.

 

Our plugin is time and resource consuming and we had to be careful with every new feature, because it can be over our limits. So we had to make  only one query, which is fast and it will not cause problem when we run it more than 2800 times (depends on the posts).

The second issue was that one post can have multiple categories.

We had to use the main category (which is in the url).

I thought that you do not have to care about the category inside the URL by the permalink mechanism.
The only place where add the category to the postname, is when the post is not unique.

 

This was problem, because we again had to think about speed. It was not good idea to make a new query only for that. So we implement this in the first function.

Do these queries contain the category?

No, they don’t.

Table: wp_term_relationships

Field Type Null Key Default Extra
object_id bigint(20) unsigned PRI Pt1 0
term_taxonomy_id bigint(20) unsigned PRI Pt2 & IND 0
term_order int(11)

2) Two different quries for searching unique post

 

Before we make only one check – if the post is unique and if not we add the id to the postname.

 

Count queries are always slow and not efficient.

We must parse through the whole post table and look if the postname already exists.

 

Now we had to make two checks – one before putting the category, if the postname existsand one right after putting the category.

If the postname exists –> we need to two times through the posts table, but there is an index on postname.

 

The issue with the speed comes again. We had to rewrite a little bit the count queries. so the two of them had to be fast as the query before.

 

Now after some optimizing the queries and the algo, it works the same way as before, but with more complicated algo. It can be used for now on all our blogs. When we have more than 5000 or 10 000 posts, probably it will be needed to implement the chunks.

Acute Sign Stop Words

 

Acute Sign inside the
Open in WP Backend and Frontend

https://snbchf.com/2017/06/activist-aims-staid-nestleacute/

Wrong problem

The problem is caused by é of Nestlé, which wordpress translates into acute. It is a strange bug, but it is caused before our stopwords algo & plugin.

Explanation what happens:

1) Input Title and URL

2) Then WordPress does;

3) then Stop Words, Output in URL, Title

Sorry, I received wrong information about this post from the team. There is no any problem with è à î or other special symbols. The problem here was that Vasil was edited the title, but forgot to edit the URL.

 

Keywords in Substrings

Keywords in Substrings
P03c Keywords in Substrings

- Click to enlarge

 

 

Tags:

See more for P35x URL StopWords