The Goldilocks problem – Materialized Views

I’m sure you’ve all heard the tale of Goldilocks and the Three Bears, but I’d like to apply the Goldilocks principle to a database object-namely, a materialized view.

You might be thinking, how does that apply to database objects when it comes to things like not too cold, not too hot, or not too heavy, not too light? But bear with me (no pun intended) and we’ll see how all this pans out shortly.

Just about every web application has a public page. Often that public page is just the login screen that takes you into the more private parts of the application. But another common pattern I see is a public page that shows a dashboard or some basic information designed to entice the user to go further into the application. A simple example would be our very own AskTOM website, which shows some recently answered questions and the upcoming Ask TOM Live sessions on the home page, available for all public users to see. You don’t need a login to view that information.

One thing people often overlook is that the efficiency of these public pages is absolutely critical. By definition, because everyone can see them, it’s not just people accessing them. It could be web crawlers, AI bots, or even an army of denial-of-service attack agents. Anything that plunders these public pages because there are no credentials required is a potential threat to your system, whether it’s malicious or simply the result of site popularity.

If these pages are static, we can rely on various caching technologies, either in our own application servers or via service providers like Cloudflare and the like. But if the data on these public pages is being dynamically generated, then caching isn’t going to help you. My very rough rule of thumb is that if a public page is consuming a second of database CPU time, then you’re exposing yourself to significant risk. All it takes is 30 or 40 concurrent users hitting your system, and that’s a lot of CPU cores on your database being consumed just to serve up that public page.

So how does this relate to Goldilocks and materialized views?

Well, often these public pages start off as very simple content. They’re just a quick overview of what’s available to users if they were to dive deeper into the application. But of course, a public page can become very popular, especially inside your own organization. What this means is that something which starts simple begins to succumb to increasing demands for more and more data to be displayed. The more data you have to display, and the more complex those data requirements become, the greater the risk of slowdown. And the greater the risk of slowdown, the greater the risk to your application in terms of denial-of-service attacks, whether malicious or inadvertent.

A materialized view is often a great way to solve those issues by providing a very simple structure to query that represents-or masks-a lot of the complexity of the underlying data structures. But how does the materialized view relate to my comments about Goldilocks?

Well, if the materialized view is based on a simple query (where “simple” means it’s a candidate for fast refresh) then refreshing the view can be done very efficiently, perhaps even with an ON COMMIT materialized view definition will be possible assuming the data changes are not too frequent. But what if the materialized view is too complicated for that? What if it isn’t a candidate for fast refresh? That means you have to do a complete refresh.

And what if that complete refresh is expensive?

Now we’re trapped in the Goldilocks “conundrum”. I don’t want to refresh that materialized view too often because it’ll consume a lot of resources on my system. However, I also don’t want to present excessively stale data on my public page. Now we are getting into “I want my porridge not too cold, not too hot” territory. I don’t want to have a materialized view refresh job running (say) every minute because that’s far too expensive in terms of resource consumption. Conversely, I don’t want a job that runs only once an hour just to avoid consuming too many resources, because then my public page could be up to 59 minutes out of date.

So how do we solve the middle-ground problem? Let me present a potential solution you might want to consider.

I’ll start off with a simple dashboard that’s just querying from a calendar table for room bookings.

SQL> create table calendar (
  2    cal_id  int primary key,
  3    start_time   date,
  4    end_time     date,
  5    room         varchar2(30),
  6    event        varchar2(60)
  7  );
Table created.
SQL>
SQL> insert into calendar
  2    values (1000, date '2025-05-20' +  9/24, date '2025-05-20' + 10/24, '2nd Floor', 'Team meeting');
1 row created.
SQL> insert into calendar
  2    values (1001, date '2025-05-20' + 11/24, date '2025-05-20' + 12/24, '1st Floor', 'Standup');
1 row created.
SQL> insert into calendar
  2    values (1002, date '2025-05-21' + 12/24, date '2025-05-20' + 14/24, 'Conference Room', 'Spec Review');
1 row created.
SQL> insert into calendar
  2    values (1003, date '2025-05-21' + 14/24, date '2025-05-20' + 15/24, 'Room 6A', 'Logistics');
1 row created.
SQL> insert into calendar
  2    values (1004, date '2025-05-22' +  8/24, date '2025-05-20' + 09/24, 'Bill''s Office', 'Pre-Deployment');
1 row created.
SQL> insert into calendar
  2    values (1005, date '2025-05-22' + 11/24, date '2025-05-20' + 11.5/24, '2nd Floor', 'Brainstorming');
1 row created.
SQL> insert into calendar
  2    values (1006, date '2025-05-23' + 10/24, date '2025-05-20' + 10.5/24, 'External', 'Perf Review');
1 row created.
SQL> insert into calendar
  2   values (1007, date '2025-05-23' + 14/24, date '2025-05-20' + 16/24, 'Cafeteria', 'Launch date decision');
1 row created.

Because that’s a very simple query, I wouldn’t initially need any complexity in terms of a materialized view etc. That would be a great query for my initial public page.

However, let’s assume my application or, in particular, the public-facing page of this application, has suddenly become popular. New requirements come in saying, “I don’t want to just free-text the location names. We have genuine locations stored in another table, but they’re not in this system, so you need to retrieve them via a REST call.”

SQL> alter table calendar drop column room;
Table altered.
SQL> alter table calendar add location_id int;
Table altered.
SQL>
SQL> create table locations (
  2    loc_id   int primary key,
  3    loc_name varchar2(30));
Table created.
SQL>
SQL> create or replace
  2  function location_details(p_loc_id int) return json is
  3  begin
  4    --
  5    -- Some REST call to get some details about the room
  6    --
  7    -- apex_web_service.make_rest_request(...);  etc
  8    --
  9
 10    return json('{ "name": "Room 6",
 11                     "AV" : "yes",
 12                     "Capacity" : 20,
 13                     "Monitor" : "Yes",
 14                     "Interface" : "HDMI" }');
 15  end;
 16  /
Function created.

Similarly, for each room there might be an organiser for meetings on the calendar, and once again that organiser might need to be retrieved via a REST call to get all their details from the HR system.

SQL> alter table calendar add meeting_owner int;
Table altered.
SQL> create or replace
  2  function person_details(p_hr_id int) return json is
  3  begin
  4    --
  5    -- Some REST call to get some details about a person
  6    -- from the company HR system
  7    --
  8    -- apex_web_service.make_rest_request(...);  etc
  9    --
 10
 11     return json('{ "firstname": "Bill",
 12                     "lastname" : "Smith",
 13                     "Level" : 4,
 14                     "Title" : "Tech Manager"}');
 15  end;
 16  /
Function created.

Suddenly, my simple page has become a lot more complex and has much greater potential to slow down because the performance of the REST calls across the network are probably out of my control.

SQL> with cal as
  2  (
  3    select
  4         cal_id
  5        ,start_time
  6        ,end_time
  7        ,event
  8        ,location_details(location_id) as loc_json
  9        ,person_details(meeting_owner) as person_json
 10    from calendar c, [lots of other tables]
 11  )
 12  select
 13         cal_id
 14        ,start_time
 15        ,end_time
 16        ,event
 17        ,json_value(loc_json,'$.name') room_name
 18        ,json_value(loc_json,'$.Capacity') room_size
 19        ,json_value(person_json,'$.firstname') owner
 20        ,json_value(person_json,'$.Title') title
 21        --
 22        -- etc
 23        --
 24  from cal;

So I’m going to keep my public page efficient by using a materialized view

SQL> create materialized view calendar_mv
  2  refresh complete on demand as
  3  with cal as
  4  (
  5    select
  6    ...

But I’ve now got two requirements I need to resolve here. One is making sure my page isn’t overly stale, and the other is making sure I don’t refresh my materialized view so often that it creates a resource consumption issue on my database server.

Let’s start with the staleness requirement. Let’s assume I don’t want my materialized view to be more than, say, three minutes out of date. By that logic alone, I’d need a refresh job that runs every two minutes (to accommodate the refresh time itself). But as I said, if the refresh is expensive, then I’d rather refresh at a much less frequent interval. It’s possible to have both.

I’ll create a table called CHANGE_SEMAPHORE, which is really just a home-grown version of a materialized view log on my calendar table. The reason I’m doing it myself is that I want to make this as efficient as possible, so I’m only going to capture information at the statement level rather than row level. That means I don’t really care how many rows are inserted, updated, or deleted. I’m simply recording the fact that a statement occurred.

SQL> create table changes_semaphore ( occurrence date);
Table created.
SQL> create or replace
  2  trigger calendar_trg
  3  after insert or update or delete
  4  on calendar
  5  begin
  6    insert into changes_semaphore
  7    values (sysdate );
  8  end;
  9  /
Trigger created.

Of course, if my CALENDAR table already had a suitable date column that I could use in a materialized view log, or if operations on that table were generally single-row operations, then a materialized view log would be just as adequate-and probably a little more efficient-than using a trigger. But in this case, all the CHANGE_SEMAPHORE table is going to contain is the timestamp at which someone executed a statement against the calendar table.

Now I can build a SMART_REFRESH procedure that both keeps my materialized view from becoming too stale while also avoiding overloading my database server with continuous refreshes. I’ve got two thresholds

a maximum time at which I will always refresh my materialized view
the limit of staleness I’m prepared to accept

SQL> create or replace
  2  procedure smart_refresh is
  3    l_rid                 sys.odcivarchar2list:= sys.odcivarchar2list() ;
  4    l_last_refresh        date;
  5    l_max_threshold_secs  int := 1800;
  6    l_min_threshold_secs  int := 180;
  7

The first thing I’ll do is collect some information. Namely, have there been any changes recorded in my CHANGE_SEMAPHORE table, and how long has it been since I last refreshed the materialized view?

  8  begin
  9    select rowidtochar(rowid)
 10    bulk collect into l_rid
 11    from changes_semaphore
 12    for update;
 13
 14    select last_refresh_end_time
 15    into   l_last_refresh
 16    from   user_mviews
 17    where  mview_name = 'CALENDAR_MV';
 18    dbms_output.put_line('Last refresh was '||round((sysdate - l_last_refresh)*86400)||' seconds agoo');
 19

Obviously, you can adjust the timings to suit your own requirements, but for me, if it’s been more than 30 minutes since the last refresh, I’m going to perform a complete refresh anyway.

That’s not mandatory. I’ve chosen to do that because some of my external information-such as the location lookups or the HR lookups via REST calls-might have changed and I have no mechanism for detecting that. Or perhaps the tables other than CALENDAR are slowly changing over time and I’d like to make sure I’m getting those updates semi-regularly.

Either way, I’m effectively putting in a blanket rule that no more than 30 minutes will ever pass without refreshing the materialized view. After that, I clear out the semaphore data, which I no longer need, and I’m done.

 20    if (sysdate - l_last_refresh)*86400 > l_max_threshold_secs then
 21       dbms_output.put_line('Refreshing due to expiry');
 22       dbms_mview.refresh('CALENDAR_MV');
 23
 24      if l_rid.count > 0 then
 25        forall i in l_rid.first .. l_rid.last
 26          delete from changes_semaphore
 27          where  rowid = chartorowid(l_rid(i));
 28        commit;
 29      end if;
 30      return;
 31    end if;
 32

The other scenario is only wanting to refresh the materialized view when I’ve detected changes.

Because my job is going to run once every minute, I’m never going to let my materialized view become too stale.

So the first thing I do is check whether any changes have occurred. If they have, I then check how recently I last refreshed the materialized view.

If it’s been less than two minutes since the last refresh, then I’m not going to refresh again. I don’t want to overload my server. Yes, the data is stale, but it’s not more than two minutes stale.

However, if it’s been more than three minutes, then I’ll go ahead and perform the refresh.

 33    if l_rid.count > 0 then
 34      if (sysdate - l_last_refresh)*86400 > l_min_threshold_secs then
 35        dbms_output.put_line('Changes detected, '||l_min_threshold_secs||' exceeded, doing a refresh');
 36        dbms_mview.refresh('CALENDAR_MV');
 37        forall i in l_rid.first .. l_rid.last
 38          delete from changes_semaphore
 39          where  rowid = chartorowid(l_rid(i));
 40        commit;
 41      else
 42        dbms_output.put_line('Last refresh less than '||l_min_threshold_secs||' ago, skipping');
 43      end if;
 44    else
 45      dbms_output.put_line('No changes to refresh');
 46    end if;
 47  end ;
 48  /

Procedure created.

In this way, even if my system is being absolutely hammered, I might only be refreshing every few minutes. But when there aren’t many data changes, my materialized view is never more than a few minutes stale, while still avoiding an excessive number of refreshes.

Now all I need to do is schedule that SMART_REFRESH job to run every minute-or whatever frequency is appropriate-and I’ve achieved my Goldilocks materialized view refresh.

I should note that you could probably implement this entirely using Scheduler functionality itself with what we call event-based jobs. Event-based jobs are where the Scheduler listens on a queue.

For a solution like this, though, I find that approach to be a little overkill because you’d need to create a queue, add the queue logic, and still implement the “not too often” refresh logic.

However you choose to implement this, the two key points I’d stress are:

1. Public pages must be extremely efficient. Otherwise, they can become a considerable risk to your application.

2. Materialized view refresh doesn’t have to be an all-or-nothing choice. You can tailor refresh timing with flexible scheduling to get the best of both worlds-keeping staleness low without overloading your server.

The Goldilocks problem – Materialized Views

Rate

Share

Share

Rate