Our client focuses on showing the impact of governmental-policy decisions on real estate values. Their business model depends on real estate transaction data contained in deeds, from county and town level recorders, each with varying data formats. To support the growth and scalability of their business, our client needed an automated process to ingest and normalize data in varying formats into a unified format.
Our team built data engineering pipelines to ingest semi-structured, heterogeneous data files and trained machine learning models to infer data types and auto-normalize to a standard data model. Our models were trained to classify data based on morphology, string matching, and distribution of data. We identified the most similar distribution based on past data to assign data types. We worked with the client engineering team to incorporate the data engineering pipelines and automated normalization algorithms into their platform.