Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.0k views
in Technique[技术] by (71.8m points)

sql - MySQL huge tables JOIN makes database collapse

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables

I have two tables sales1 and sales2 like this:

id | dates | customer | sale

With this table definition:

CREATE TABLE sales (
    id int auto_increment primary key, 
    dates date,
    customer int,
    sale int
);

sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.

I want to select, for each customer:

  • number of purchases
  • last purchase value
  • total amount of purchases, when it has a positive value

The query I am using is:

SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale 
        from sales x where dates = (select max(dates) 
                                    from sales y 
                                    where x.customer = y.customer
                                    and y.sale > 0
                                   )

       )b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;

The problem is:

I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.

Whenever I do just one year, it takes about 2-3 minutes to storage all the information.

But when I try to gather information from all the years, the database crashes and I get messages like:

InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')

It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.

I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.

So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

300k rows is not a huge table. We frequently see 300 million row tables.

The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.

It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:

  • Easier to code.
  • Easier to optimize.
  • Easier to debug.
  • Easier to read.
  • Easier to maintain if/when you have to implement new requirements.

Number of Purchases

SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales 
GROUP BY customer;

An index on sales(customer,sale) would be best for this query.

Last Purchase Value

This is the greatest-n-per-group problem that comes up frequently.

SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
 ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;

In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.

An index on sales(customer,dates,sale) would be best for this query.

If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.

SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
 ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;

Total Amount of Purchases, When It Has a Positive Value

SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;

An index on sales(customer,sale) would be best for this query.

You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.


Re: your comment

What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?

Top Five Customers for Q4 2012

SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;

I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.

Last Purchase for Customers with Total Purchases > 5

SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
 ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;

As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.


These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...