We recently turned on some new features of a product we were using and would get this RTE on a few records within the application. The number of problem records were small, the total number of records worked with in a day made them a regular occurrence.
I went over to our test environment. It was up to current patch levels, but the data was a few months old. Almost all of my broken records in live were broken in test. That gave me hope that its an existing data issue and not something actively breaking records.
The nice thing about our test environment is that I am usually the only person using it. That makes it easy to toss the SQL Profiler on with out having to filter it to just my traffic. I started the profiler at the point just before clicking the button that generated the RTE and paused it just after the message came up. I closed the app and repeated this process a few times for both working and broken records.
Looking at the queries that each ran, the broken records all stopped at the same point after the same query was run. That tells me that the application is processing the queries as it runs them and the last one before it stopped is my suspect query. If both the broken and the working records had the same full list of queries I would have had to check them all. That would have indicated to me that it pre-loads most of the data before processing it. I was in luck.
The query was a large one but the only difference between all records was the record ID. So it is the same query running for both working and non-working records. Running the query for all of them gave me different sized record sets. Some had more a lot of records and some had very few. One thing did jump out at me as I looked at the raw data.
A few null values jumped out to me in a column that looked like it should have a value. With my experience with the data I knew it should have had a value. I saw one column that had data in it directly related to reference I expected instead of that null value. Because this looked like a complicated query, my first guess was a bad join.
That column was not the only one with nulls in it but every time it had a null the 3 columns to the right also had nulls. That's what I expect a bad join to look like. So I started reformatting the query with proper indents and white space. It did contain a few union calls. The first query in the union practically returned the same results. It looked the same except for a few less rows. My null values were still mixed into the good data.
I focused on the table columns that were null and the one column that contained directly related information. The joins between them looked correct. Exactly how I would have written them. After rewriting the query as simple 2 table join that used the same logic, I ended up with the same results. It was a join that I expected to be valid every time but in this case there are times where its not.
It turned out to be a flaw in the application logic where it allowed some changes to one set of tables that should have enforced changes in another. But they were enforcing them at the application level instead. Other parts of the system enforced it correctly, this one point in the application did not. It overlooked the fact that it needed to enforce it and I almost expect the developer to make the same assumption I made.
I know this is a little different from my usual problem/solution posts but this was mostly an internal product and I wanted to document the process I took to solve it.