MySQL: Should I use an INNER JOIN or a WHERE statement for a million+ records?

mslilafowler · October 2, 2018, 12:50pm

Hi

So let’s say we create a new table (which will eventually contain a million+ records):

CREATE TABLE duplicates (
id int(11) not null primary key auto_increment,
username varchar(256) not null,
time timestamp not null
);
INSERT INTO duplicates VALUES
(null, 'admin', '2018-08-02 12:30:40'),
(null, 'admin', '2018-08-03 11:20:40'),
(null, 'admin', '2018-09-02 00:30:40'),
(null, 'admin', '2018-09-02 12:30:40'),
(null, 'admin', '2018-10-02 12:30:40'),
(null, 'bill', '2018-08-02 12:30:40'),
(null, 'bill', '2018-08-03 11:20:40'),
(null, 'bill', '2018-09-02 00:30:40'),
(null, 'bill', '2018-09-02 12:30:40'),
(null, 'bill', '2018-10-02 12:30:40'),
(null, 'ryan', '2018-08-02 12:30:40'),
(null, 'ryan', '2018-08-03 11:20:40'),
(null, 'ryan', '2018-09-02 00:30:40'),
(null, 'ryan', '2018-09-02 12:30:40'),
(null, 'ryan', '2018-10-02 12:30:40');

Then to query the latest distinct record, I am using one of the statements below.
So the question is, which of the two statements below should be used for this purpose?

SELECT * 
FROM duplicates as t1
INNER JOIN (SELECT MAX(id) as id FROM duplicates GROUP BY username) as t2
ON t1.id = t2.id;

or:

SELECT * 
FROM duplicates 
WHERE id IN (SELECT MAX(id) FROM duplicates GROUP BY username);

The table might get up to a million+ rows, so speed & efficiency is necessary. After reading up a bit I assumed that the INNER JOIN was faster, but I have no explanation to back that assumption. Can anyone clarify this?

astonecipher · October 2, 2018, 1:18pm

I would wonder why? I mean, you are planning on this before you even start, so it isn’t to solve a problem, but to introduce one that shouldn’t be there to begin with.

Now, what is the point of the query? What is the point of the table?

mslilafowler · October 2, 2018, 2:29pm

Sorry if I didn’t mention it before, but I’ve already got a table with about 5200+ record sets (and that’s only the beginning). I used the above example for simplicity only.

What I’m asking is, if my table eventually reaches a large number like a million record sets, which query should I use to get the result fastest?

The record sets are going to be displayed on a web interface, kind of like a dashboard with all the records in your face.

To clarify, the point of the query is to select all the records from a MySQL database and display it in an HTML table for a user admin to monitor all his client’s activity.

astonecipher · October 2, 2018, 3:10pm

Then one issue is,

Will only retrieve a single users records. The IN clause isn’t even needed. If you want all of the records, you need to remove the max aggregate and just use the query. However, you will want to look into pagination, as you can easily take out your allotted memory with a large enough dataset being returned.

mslilafowler · October 2, 2018, 4:25pm

Actually both the queries give a nearly identical result…You could test it out and see. See this fiddle. It returns each user only once (which is the expected result).
The only question here is whether the INNER JOIN or WHERE statements execute faster…And I have no idea how to test the speed and efficiency of a query

astonecipher · October 2, 2018, 4:39pm

The where clause will be faster. Look at your fiddle results and it allows you to see the execution plan for both as well as the time it took.

astonecipher · October 2, 2018, 4:43pm

And further tweaking produced the following results and times. And it can be further modified,

SELECT DISTINCT * 
FROM duplicates 
GROUP BY username

mslilafowler · October 2, 2018, 5:13pm

Ah alright, that’s a handy feature on SQLfiddle. Tho I wish it was more detailed.

Btw, the results returned from your edited code is not the results I need… I explicitly selected the max(id) because I want the latest distinct record. The results above give the ID’s 1, 6 & 10, whereas the expected results are supposed to be 5, 10 & 15.

Thank you for your time tho, I do appreciate it