Schema inference

Schemaless is flexible but it has a big impact for the downstreams especially for data exchange and DW/AI. It is a must-have effort to derive & infer the schema from the actual documents, so that we can understand/track/evolve/translate the document schema. https://www.mongodb.com/blog/post/engblog-implementing-online-parquet-shredder is a great article. I'd like to propose an additional feature in ADL/ADF to make schema inference as a 1st-class citizen with faster turnaround & less operation cost. After the $out operation of ADL/ADF, please collect the Parquet schema from each data files and union/unify them into a single schema. This schema will be stored in a .schema.json or .schema.txt file in the same S3/GCS location. Add a new flag/parameter for $out to scan through all the documents based on the filter condition in the queries, but instead of writing the Parquet files, the $out only writes out .schema.json or .schema.txt file to s3. This can be a quite useful operation routine to run every week with or without a rough datetime incremental filter to infer the schema and then update the corporate/enterprise schema repository/central. I've elaborated this idea with MGM and benjamin.flast Thank you.

Post comment

Please enter your email address

RELATED FEEDBACK

Schema inference