7 Real Estate Photo Editing Rules for Better MLS Listings
Effective B2B database maintenance, driven by structured workflows, can process millions of records with over 99% accuracy. The results are reliable data for sales, compliance, and analytics, along with greater scalability, consistency, faster operations, and stronger overall business performance.
Table of Contents
It is surprising to learn for those not in the field that B2B contact information has an annual decay rate of approximately 30%. A report by Dun & Bradstreet made this clear beyond personal field experiences. If you consider a 50-million-record database, this would be equivalent to 15 million records being outdated every 12 months. Multiple factors drive this data decay and compound one another: executive turnover, M&A (mergers and acquisitions), office moves, and name changes, among others.
Each of these factors affects different fields at different rates. No static database can remain accurate against these compounding forces. B2B data aggregators’ data need not debate the necessity to update their databases; instead, they need to determine if their current workflow can support the speed of updates necessary to fix data decay challenges and factors.
This article is meant for B2B data aggregation companies, market intelligence firms, and data platform companies. It covers three real-world examples of B2B database maintenance processes, which are representative of a production environment and includes documented results.
These examples include a 50-million record business listings database, a 15-million record global foodservice database, and a 309,000-record legal database. Each example illustrates a unique set of maintenance issues and how the specific process design used addressed those issues.
Before examining these processes, it is important to understand the mechanisms of data decay and why standard cleansing processes are unable to address the root causes of data decay. These concepts drive nearly all design decisions related to the case studies illustrated below.
Data decay refers to the speed at which records in your database degrade (inaccuracy, incompleteness, undeliverability) due to real-world changes in the business represented by those records. This is not an issue caused by some system error, and it is inherent in B2B data, continuously occurring across many fields.
It’s also important to note that a quarterly cleansing process will typically catch issues that have existed for approximately three months. At that point, most organizations have incurred significant “downstream” costs from previous failed attempts (missed campaign opportunities, incorrect contact information, etc.) due to delayed action.
In other words, the main reason cleansing doesn’t work effectively is because there is too much delay before you even attempt to correct it. Effective management of B2B data therefore requires ongoing, continuous updates, rather than periodic corrections.
As mentioned above, the seven causes listed below create degradation in various and separate data fields. This is why a single validation pass usually can’t detect all possible types of degradation.
The implications for aggregators are clear: maintenance workflows should be field-aware, not merely record aware. Validation passes may determine if a record exists, but they cannot verify that the proper values populate each individual field. The difference between being field aware vs. record aware drives the structure of each workflow detailed in the examples provided below.
Most aggregation companies are limited by their ability to scale data processing volumes and to be accountable for accuracy. Below is a mapping of these differences using five different dimensions.
All aggregators at scales that have databases of 5 million or greater records move towards outsourcing data operation costs. The fixed cost of employing an in-house team of 10-15 people to manage data operations does not decrease with the growth of database sizes, whereas the cost per unit of data can be distributed by the number of units produced when data operations are outsourced.
Want to turn your data pipeline into a competitive advantage?
The Challenge
California-based business-to-business (B2B) data aggregator operated a 50-million record business listings database that contains contact information, firmographics and executive profiles. This was the company’s core product. Client accessed database for marketing, sales intelligence, and account-based targeting. The problem statement was velocity.
CXO appointments, mergers & acquisitions, office relocations and funding announcements were changing these records at a faster pace, making it difficult for the company’s internal staff to process the changes. In addition to being an accuracy issue, stale records represented a risk to client retention.
The aggregator needed a workflow that would ingest, validate, and update more than 150,000 records per month using both structured and unstructured data sources without degrading either accuracy or producing a backlog.
Workflow Solution
Our team began this project with the first task of separating data harvesting from data validation. These two functions that had been conflated within the previous workflow. Conflation of harvesting and validation resulted in bottlenecks at each stage.
Harvesting
Custom bots and scheduled crawlers were utilized as harvesting tools deployed across multiple time zones of public and private sources.
Automated crawlers captured
Baseline firmographics
Value-add intelligence
Crawlers that were denied automated access were forwarded to manual web researchers
Manual researchers handled
Validation
Validations ran concurrently with harvest. All captured records ran through multi-layered checks against business directories, news sites and social platforms.
The validations stack consisted of rule-based checks
Human review for all records that did not meet predefined confidence thresholds.
Data Appending
Data appending was used to address gaps that validations exposed.
Outcome
The aggregator now processes 150,000+ verified, enriched records per month. The database meets industry accuracy benchmarks on firmographic and contact fields. The separation of harvesting, validation, and appending into discrete pipeline stages, each with its own quality gate, eliminated the backlog and reduced manual rework caused by upstream errors entering the validation stage.
Read full case study here →Problem
A French firm providing data aggregation services and collecting market intelligence for the global food service and hospitality industries developed an international food service and hospitality database with 15M records across 60 countries.
There were over 50 data points per record. In addition, 2 million new records were added to the database annually and each of those required validation, enrichment, and categorization prior to delivery to their customers.
The costs associated with performing these activities as well as acquiring data, conducting data quality control assessments, enriching, and maintaining the data were absorbed entirely by the in-house data team. Therefore, there was very little bandwidth available for the product and commercial teams that generated revenue.
Both data currency and data hygiene were identified as key quality metrics. The global database could not deteriorate as volumes increased. These two conditions needed to exist simultaneously.
Process
This project would require a hybrid process. It would have to meet the sheer volume of new additions as well as maintain the accuracy level of the 15 million existing records. Rather than being sequential, the two processes operated in parallel.
Data Screening
Firstly, the entire 15 million-plus database was evaluated for clarity and completeness.
The remaining records went through a series of validation rule-based filters, comparing them to pre-established reference authorities in the food service industry. After completing a validation assessment, manually validated records were returned to the clean database pool.
New Record Ingestion
For new record input, a combination of regularly scheduled crawlers for structured sources (such as websites) and manually researched unstructured data sources via the Internet were used.
Each new record entering the database underwent a layered series of validations including direct contact with the business named in the record via telephone to verify its factual accuracy. Although this method can be resource intensive, it also eliminates the category of errors that automated validation methods cannot identify (i.e., correct format but incorrect fact).
Segmentation
Industry codes were assigned to every record based upon established client defined criteria and hospitality standards using a logical structure. The manner in which a record is classified has a significant impact upon client search results and profile development.
Misclassifying a restaurant chain and misclassifying a hotel chain are both forms of error. However, they represent different categories of potential errors affecting client use cases. Client-defined classification was treated as part of the ingestion flow rather than a post processing activity.
Ongoing Hygiene
Ongoing hygiene utilized programmable macros and validation scripts to continuously check for authenticity against the current version of the database. If records did not pass these tests, they were flagged for review instead of automatically deleted, thereby allowing for tracking of the history behind how each record came to be included in the database.
Outcome
Utilizing an off-shore partnership model allowed the aggregator to implement scalable and consistent data hygiene best practices that were not feasible for an internally fixed team size. Specifically, incorporating direct contact verifications resulted in improved accuracy levels when compared to automation on some categories of records that automated workflows systematically miss.
Read full case study →Get a comprehensive, accurate and updated business listing database.
Result
A team at Hitech BPO collected over 309,000 records from California Bar attorneys, encompassing both active and deceased members in less than 45 days; the project also achieved an accuracy level of greater than 99%. The project involved developing a full end-to-end data pipeline from numerous disparate data types as well as from several secured state bar databases that did not allow automated access.
Challenge
A US-based B2B data aggregator had to create a production-ready repository of attorney member registrations maintained by the California State Bar Association. The repository had to include both active practicing members and deceased members, as well as historical registration information. In addition, the repository required a national license statistics layer aggregated attorney license data by jurisdiction.
The various data sources included everything from publicly available open directories to restricted, institutionally controlled databases. Of the many authoritative sources, including the state bar association’s own systems, there were some that could be accessed via automated crawlers while others were unable to be crawled automatically due to restrictions placed upon them.
Workflow
Source Identification
In order to build the pipeline, identifying potential data sources was the initial phase, not capturing the data itself. Each identified source was evaluated as being reliable and complete prior to extracting the data.
Data sources were categorized based on their accessibility to automated crawlers i.e., either they allowed crawling via automated methods, or they only allowed crawling manually through restrictions placed on them or because of structural complexity. Combining these two categories into one pipeline would have resulted in both accuracy gaps within the extracted data and additional manual labor requirements.
Data Capture
Automated capture utilized customized crawlers and parsing scripts specific to the format of each source. Prior to deploying the crawlers, field definitions e.g., bar number, admission date, status, practice area, disciplinary history was defined so that a consistent schema map existed across sources.
Manually collecting the data provided for those sources that were restricted for crawling purposes or too structurally complex for the crawlers to penetrate. For these sources researchers employed documented extraction methodologies to provide consistency regarding the crawlers’ output.
Data Validation
Validation processed the combined data from automated and manual sources against pre-defined business rules. An independent random manual audit of the validated data sample was performed to identify systematic errors that would be normalized but not flagged by the rule-based validation methodology.
Data Enrichment Enrichment updated records having missed or out-of-date fields using web research and other B2B data appending methodologies. Where possible, enrichment prioritized filling the most important fields relative to the client’s downstream use of this data.
Classification
Classifying segmented all records into two broad categories: licensing data (e.g., admission status, bar number, jurisdiction, practice areas) and discipline data (e.g., complaint histories, sanctions and reinstatement actions). Accuracy of classifying is directly related to the usability of the database for compliance searches, targeted marketing efforts toward attorneys, and licensure analytics, which are the three main client use-cases for this dataset.
Outcome
Deliverable in 45 days at greater than 99% accuracy was the 309,000-record database. Segmentation of licensure and discipline data enabled immediate compliance analysis capability for the client when they received the deliverable, thereby eliminating the need for subsequent processing steps on their part. The accuracy metric was a contractual obligation as per SLA agreement with the client’s own downstream clients.
Click here to read the full case study →Three data sets. Three different markets. Three varying levels of accuracy and throughput needed. And yet, each workflow has the same structural characteristics: scheduled automation manages large volumes while human expertise addresses exceptions.
None can provide the level of accuracy that B2B aggregators need to protect both the quality of their products and the terms of their clients’ contracts. The examples described above should not be viewed as outliers. These are representative workflows of all large-scale databases.
The operational basis of outsourcing B2B database management does not revolve around cost. Rather, it revolves around elastic capacity. A data team working internally with adequate resources to manage 150,000 records per month will require a minimum of 6 months to hire and train sufficient personnel to increase its monthly processing to 300,000 records.
An outsourced model can quickly increase or decrease processing capacity to meet volume requirements without the associated delay. For companies who operate B2B aggregation models where databases expand much faster than an internal data team’s ability to scale, the resulting scalability gap represents a tangible threat to product quality.
To select a suitable vendor for outsourced B2B data enrichment services, there are three key components required. They include: vendors making service-level agreement-based guarantees regarding accuracy levels (as opposed to simply stating what they believe their benchmark accuracy rates may be), domain-specific knowledge related to the types of data being maintained by the aggregator, and documentation outlining how the company uses automated processes combined with manual validation procedures to verify data accuracy.
Hitech BPO provides outsourced database management services to B2B aggregators in the U.S., E.U. and APAC markets, including those managing 50 million-plus record listing databases and those maintaining highly regulated legal datasets.
What’s next? Message us a brief description of your project.
Our experts will review and get back to you within one business day with free consultation for successful implementation.
Disclaimer:
HitechDigital Solutions LLP and Hitech BPO will never ask for money or commission to offer jobs or projects. In the event you are contacted by any person with job offer in our companies, please reach out to us at info@hitechbpo.com