Skip to Main Content

MongoByte MongoDB Logo

Welcome to the new MongoDB Feedback Portal!

{Improvement: "Your idea"}
We’ve upgraded our system to better capture and act on your feedback.
Your feedback is meaningful and helps us build better products.

Status Submitted
Categories Data Federation
Created by Guest
Created on Aug 21, 2023

Combine data lake snapshots into a single federated collection

A common use case for data analytics is to analyse how your data evolve over time. For example, imagine you have an e-commerce database and your products have their price change every day. You may only store the price in your database but you'd like to make a chart that shows the evolution of your product prices over time (price y axis and time for x axis). It is possible today to make this happen with the combination of `Data Lake` and `Data Federation`, but the Storage Configuration JSON need to be manually updated like this: ``` { "databases": [ { "collections": [ { "name": "collectionName", "dataSources": [ { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230814T050424Z", "provenanceFieldName": "provenance", "storeName": "..." }, { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230813T050415Z", "provenanceFieldName": "provenance", "storeName": "..." }, { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$20230812T050424Z", "provenanceFieldName": "provenance", "storeName": "..." }, ..., ..., ..., ..., ..., ..., ] } ], } ], ... } ``` The Data Federation configuration json is going to make thousands of lines and need to be maintained daily or maybe using a script + API. (3 lines of json per collection * 365 snapshots a year * 20 collections = 22'000 lines of json a year) One idea could be to use a simple wildcard instead of the timestamp like this: ``` { "databases": [ { "collections": [ { "name": "collectionName", "dataSources": [ { "datasetName": "v1$atlas$snapshot$Cluster0$env$collectionName$*", "provenanceFieldName": "provenance", "storeName": "..." } ] } ] } ] } ``` P.S: I know that time series collection could be useful in the specific example I just gave. But, sometime, you may want to analyse various properties over time, that's where a Data Lake solutions make sense.