Our client, a digital PR firm, needed to automate the discovery of concepts in new articles published to the web daily to share with their customers to drive realtime PR campaigns. Further, these 6 million articles needed to be processed within 1 hour.
We built data engineering pipelines to pre-process text strings to lemmatize and drop out stop words. Our team developed a specialized natural language processing (NLP) model to process, classify, and cluster web-based articles, based on primary purpose and content. We were able to extract the most common themes present across the full set of new articles. We optimized the performance and parallelized the pipeline to process 6 million articles daily within 1 hour.