I think your problems might be associated with statistics.
It seems that you have automated statistics updating enabled. This means that when 20% of the rows in a table are updated, the statistics for the table is automatically updated.
This update can take a lot of time for very large tables. This might be one reason why your query is sometimes slow.
I Suggest that you turn off automatic statistics updating for this table, and instead add a job that manually updates statistics every night (for the active shard).