Skip to Main Content

MongoByte MongoDB Logo

Welcome to the new MongoDB Feedback Portal!

{Improvement: "Your idea"}
We’ve upgraded our system to better capture and act on your feedback.
Your feedback is meaningful and helps us build better products.

Status Submitted
Categories Database
Created by Guest
Created on Jun 18, 2022

Boost the performance of bioinformatic annotation queries

The documents to be selected look something like this: { "_id": { "$oid": "6272c580d4400d8cb10d5406" }, "#CHROM": 1, "POS": 286747, "ID": "rs369556846", "REF": "A", "ALT": "G", "QUAL": ".", "FILTER": ".", "INFO": [{ "RS": 369556846, "RSPOS": 286747, "dbSNPBuildID": 138, "SSR": 0, "SAO": 0, "VP": "0x050100000005150026000100", "WGT": 1, "VC": "SNV", "CAF": [{ "$numberDecimal": "0.9381" }, { "$numberDecimal": "0.0619" }], "COMMON": 1, "TOPMED": [{ "$numberDecimal": "0.88411856523955147" }, { "$numberDecimal": "0.11588143476044852" }] }, ["SLO", "ASP", "VLD", "G5", "KGPhase3"] ] } For a basic annotation (https://en.wikipedia.org/wiki/SNP_annotation) scenario, we need such query: {'ID': {'$in': ['rs369556846', 'rs2185539', 'rs2519062', 'rs149363311', 'rs55745762', <...>]}} , where <...> means hundreds/thousands of values. Such query is executed in a few seconds. More complex annotation queries: {'$or': [{'#CHROM': 1, 'POS': 1499125}, {'#CHROM': 1, 'POS': 1680158}, {'#CHROM': 1, 'POS': 1749174}, {'#CHROM': 1, 'POS': 3061224}, {'#CHROM': 1, 'POS': 3589337}, <...>]} {'$or': [{'ID': 'rs149434212', 'REF': 'C', 'ALT': 'T'}, {'ID': 'rs72901712', 'REF': 'G', 'ALT': 'A'}, {'ID': 'rs145474533', 'REF': 'G', 'ALT': 'C'}, {'ID': 'rs12096573', 'REF': 'G', 'ALT': 'T'}, {'ID': 'rs10909978', 'REF': 'G', 'ALT': 'A'}, <...>]} Despite the involvement of IXSCAN, they run many hours. Please test aforementioned queries thoroughly and improve the performance of their execution. This will help science!
  • Attach files