Skip to Main Content

MongoByte MongoDB Logo

Welcome to the new MongoDB Feedback Portal!

{Improvement: "Your idea"}
We’ve upgraded our system to better capture and act on your feedback.
Your feedback is meaningful and helps us build better products.

Status Submitted
Categories Data Federation
Created by Guest
Created on Jun 8, 2023

Schema inference

Schemaless is flexible but it has a big impact for the downstreams especially for data exchange and DW/AI. It is a must-have effort to derive & infer the schema from the actual documents, so that we can understand/track/evolve/translate the document schema. https://www.mongodb.com/blog/post/engblog-implementing-online-parquet-shredder is a great article. I'd like to propose an additional feature in ADL/ADF to make schema inference as a 1st-class citizen with faster turnaround & less operation cost. After the $out operation of ADL/ADF, please collect the Parquet schema from each data files and union/unify them into a single schema. This schema will be stored in a .schema.json or .schema.txt file in the same S3/GCS location. Add a new flag/parameter for $out to scan through all the documents based on the filter condition in the queries, but instead of writing the Parquet files, the $out only writes out .schema.json or .schema.txt file to s3. This can be a quite useful operation routine to run every week with or without a rough datetime incremental filter to infer the schema and then update the corporate/enterprise schema repository/central. I've elaborated this idea with MGM and benjamin.flast Thank you.