Issue #3012: Query on primary key is slow when using 'IN'

Type: Bug ReoprtVersion: 2.9.4_06Priority: NormalStatus: ActiveReplies: 3
#1

Hi! We have an entity with a composite primary key on "primaryVariantOid" and "storeOid" fields:

(I removed unnecessary fields and methods to make the code shorter and hopefully compile with just the JDK)

import javax.persistence.Entity;
import javax.persistence.Id;
import javax.persistence.IdClass;
import java.io.Serial;
import java.io.Serializable;
import java.util.Objects;

@Entity
@IdClass(ProductVariantStoreData.DatabaseId.class)
public final class ProductVariantStoreData implements Serializable {
    
    @Serial
    private static final long serialVersionUID = -4357827914123980944L;
    
    @Id
    private String productVariantOid;
    
    @Id
    private String storeOid;
    
    private ProductVariantStoreData() {}
    
    public ProductVariantStoreData(String productVariantOid, String storeOid) {
        this.productVariantOid = productVariantOid;
        this.storeOid = storeOid;
    }
    
    public static final class DatabaseId {
        
        final String productVariantOid;
        final String storeOid;
        
        public DatabaseId(String productVariantOid, String storeOid) {
            this.productVariantOid = productVariantOid;
            this.storeOid = storeOid;
        }
        
        @Override
        public boolean equals(Object obj) {
            return obj instanceof DatabaseId other &&
                    Objects.equals(productVariantOid, other.productVariantOid) &&
                    Objects.equals(storeOid, other.storeOid);
        }
        
        @Override
        public int hashCode() {
            return Objects.hash(productVariantOid, storeOid);
        }
    }
}

We have about 200M instances of this entity.

The following query finishes instantly:

SELECT FROM ProductVariantStoreData d
WHERE d.storeOid = '1205450'
  AND (d.productVariantOid = '1344763' OR d.productVariantOid = '1344770')
Query plan 1/2 description
============================

[Step 1]
Scan type eu.extech.quant.product_variants.data.ProductVariantStoreData
locating ProductVariantStoreData (d) instances that satisfy:
  (d.productVariantOid='1344770').

[Step 2]
Scan type eu.extech.quant.product_variants.data.ProductVariantStoreData
locating ProductVariantStoreData (d) instances that satisfy:
  (d.productVariantOid='1344763').

[Step 3]
Merge the results of steps 1, 2 using OR.

[Step 4]
Filter the results of step 3
retaining only results that satisfy:
  (d.storeOid='1205450').

[Step 5]
Group the results of step 4 by:
  d.

[Step 6]
Remove duplicates from the groups of step 5.

[Step 7]
Apply selection and prepare final results.

<distinctPlan 1.7056 3,83/4,13 distinct(d)>
  <filterPlan 1.4503 3,83/4,13 (d.storeOid='1205450')>
    <mergePlan or 1.4292 3,81/4,13 0>
      <btreePlan 1.0142 2,72/4,13 d(ProductVariantStoreData) type(ProductVariantStoreData['1344770'->'1344770'*:notNull]) />
      <btreePlan 1.0142 2,72/4,13 d(ProductVariantStoreData) type(ProductVariantStoreData['1344763'->'1344763'*:notNull]) />
    </mergePlan>
  </filterPlan>
</distinctPlan>

However, if I change the two OR'd equalities to IN, the query suddenly takes 2.5 minutes:

SELECT FROM ProductVariantStoreData d
WHERE d.storeOid = '1205450'
  AND d.productVariantOid IN ('1344763', '1344770')
Query plan 1/1 description
============================

[Step 1]
Scan type eu.extech.quant.product_variants.data.ProductVariantStoreData
locating ProductVariantStoreData (d) instances that satisfy:
  and(contains(list('1344763','1344770'),d.productVariantOid),(d.storeOid='1205450')).

[Step 2]
Apply selection and prepare final results.

<btreePlan 23.4232 6,87/7,56 d(ProductVariantStoreData) type(ProductVariantStoreData[and(contains(list('1344763','1344770'),d.productVariantOid),[pos<1>'1205450'->'1205450':notNull])]) />

The same happens programmatically when using javax.jdo.Query, where I declare HashSet parameters for the searched values, and set the query filter to:

param_productVariantOids.contains(this.productVariantOid) &&
param_storeOids.contains(this.storeOid)

I would assume this would have the same query plan as the "IN" query, though I haven't checked because I don't know how to do that in Explorer's Query tab.

I was able to work around this by declaring a separate parameter for each value and OR-ing them together, instead of using the HashSet, but it would be nice to improve the optimizer to get the same performance for queries. Thanks!

#2

There is another issue that causes a complete scan:

SELECT FROM ProductVariantStoreData d
WHERE d.storeOid = '1205450'
  AND FALSE

This query will obviously not find anything, but it still takes 2 minutes to complete. Not being able to use "AND FALSE" makes it awkward to configure the query programmatically if some deeply nested condition is known to always be false.

Query plan 1/1 description
============================

[Step 1]
Scan type eu.extech.quant.product_variants.data.ProductVariantStoreData
locating ProductVariantStoreData (d) instances that satisfy: 
  and(false,(d.storeOid='1205450')).

[Step 2]
Apply selection and prepare final results.

<btreePlan 23.4232 6,86/7,56 d(ProductVariantStoreData) type(ProductVariantStoreData[and(false,[pos<1>'1205450'->'1205450':notNull])]) />

The same issue happens when a HashSet parameter is empty; then `param_productVariantOids.contains(this.productVariantOid)` will obviously never find anything, but the scan still happens. This is what I discovered first, and I was hoping that if the parameter is empty, I could substitute the condition with `&& false` to nudge ObjectDB into immediately returning an empty result, but it still did a full scan.

#3

These examples indeed demonstrate cases where the query optimizer needs improvement. We plan to rewrite the optimizer in future versions, and we hope to address these issues as part of that work.

In the meantime, your options are:

  • Manually rephrasing queries to avoid these issues, as you described above.

  • Keeping the original queries but using query hints to override suboptimal query planning decisions.

The first approach is usually much easier when it applies. When it doesn’t, there are forum posts discussing how to improve performance using query hints, but this requires more effort.

ObjectDB Support
#4

Hi, thanks for the response.

I found some more issues with generating a query by adding 'OR' clauses with one parameter per value:

  • Too many conditions (around 20 000) caused a StackOverflowError in ObjectDB.
  • If the query requested combinations of both fields, i.e. ((productVariantOid = 1 OR ...) AND (storeOid = 1 OR ...)), having too many parameters on both sides would make the query even slower than the full scan using HashSet parameters (I think 10 000 productVariantOid values, 200 storeOid values).

Rather than working around the inconsistent performance with heuristics, I found a different solution altogether - looping over every combination of both fields, and using entityManager.find to look up every entity by the primary key. This ended up being even faster than the fastest queries with the 'OR' clause, and gave consistent performance scaling.

Side note: I was also trying to find if there was any API like entityManager.find, but for bulk retrieval by primary keys, that could be even faster, but didn't find one. I thought PersistenceManager.getObjectsById could work, but it threw an exception if any of the primary keys was not found in the database, whereas I needed it to silently ignore any primary keys that were not found.

Reply