We must implement a google like search over a existing database. The database contains several tables for persons, corporations, organizations, addresses, etc.
It must be a custom solution (no thirdy-party solutions neither SQL FTS).
Its for a new web application (dozen to hundreds simultaneous users, response time under 1s for most cases, 5s for worst case).
We ill create a single table (registers) to point to the existing tables. We ill need also a table of tags and a relationship table (registers x tags).
A batch ill scan the relevant tables (persons, organizations, addresses, etc) and populate the tables (registers, tags, tag_register) once a day.
The search ill be performed in all registers at once.
A "register" is a abstraction for a person a organization or any other object.
The object's data lie in a "main" table (persons table, organizations table, etc)
and in many "members" tables (addresses, contacts, profile, etc).
A "tag" is a word wich ocurrs in any of the table's text fields.
In a search for tag1, tag2 and tag3
If exists: tag1 AND tag2-100 ocurrences, tag1 AND tag3-10 ocurrences
and there's no ocurrences of tag2 and tag3 for the any register we ignore tag3
since this returns mores results than ignoring tag2.
Some objects share many properties and the search can use some properties to filtering and/or ordering,
in this example we ill simple use the name property to populate a column (vc_register) to be used for ordering.
Below is the script to create the tables and procedure and populate the database with dummy data.
The procedure performs the search in a reasonable time and was implementd as a conceptual proof.
We gave a try to table variables and creating/droping indexes in temporary tables but we found no indexed temporary tables worked better.
Also the FK constraints increase performance is better than we can have guessed.
Increasing the number of tags dont impact in performance, increasing the number of registers dont impact directly in performance.
If a single tag is present in millions of registers returning all matches for it can slow down the response time.
The response time is dicted by the number os comparasions to find the intersection between to matches (join each tag search result).
Chosing wisely the words from the database to populate the tags table ill impact directly in performance.
Words like "the", "and", "a" must be avoided (low meaning, high ocurrence)
We ill must keep a eye in the most common used tags and the tags with the highest ocurrences.