Import Excel spreadsheet into several tables

Valkrider · February 15, 2015, 10:17am

I have some excel spreadsheets that I need to convert into a mysql database. This will be an ongoing task.

A typical spreadsheet will have between 10 and 15 thousand rows. One column contains exactly the same data so this is an ideal candidate for just one row in one mysql table. Another column will have anywhere from 1 to 20rows with identical data so again another candidate for its own table. Looking at the data another 2 columns would be best split into yet another table and then the remaining data in a final table.

What is the best way of processing this data?

I see two options:

1: Read the spreadsheet in one row at a time and process the record into the tables.
2: Read the spreadsheet into an array and then process the array.

Which is the best option or is there another way that I have missed? I particularly want to minimise the impact to other users of the website whilst this data processing is taking place.

JimL · February 15, 2015, 10:49am

You could do this with prepared statements. A prepared statement will execute faster than normal statements as the query is cached by the SQL server and only the values change.

pseudoish code
[php]$sth->prepare(‘INSERT INTO table (column1, column2, column3) VALUES (?,?,?)’);

while ($row = readLine()) {
$sth->execute(array(
$row[‘column1’],
$row[‘column2’],
$row[‘column3’],
));
}[/php]

[hr]

Using the native mysql LOAD DATA INFILE would be much much faster though. Example here

Valkrider · February 15, 2015, 11:34am

So you are recommending my option 1 then JimL?

This will be into a Wordpress database so I will be using the Wordpress db commands rather than pdo.

JimL · February 15, 2015, 1:05pm

Generally I would recommend using MySQLs LOAD DATA, it’s usually many times faster than manually inserting stuff.

How this should be implemented kinda depends on how it’s going to be used.

Valkrider · February 15, 2015, 1:10pm

My reading of MySQL’s LOAD DATA is for putting all the csv data into one table I want to split this into 4 tables as I said in my OP. That is unless I am misunderstanding MySQL’s LOAD DATA?

There will be 3 different types of front end search on the data in the tables. Once loaded it is unlikely to need updating (only if a typo is subsequently found).

ErnieAlex · February 15, 2015, 4:36pm

Just curious. Are you loading this from a CSV file?

I have been using a package that can read the live .xls file and grab all the data.
You have to parse thru the rows and save it wherever you wish.
It is called PHPExcel and works great. It is free. It can read and write Excel sheets.
Works with PDF’s, too, if you set it up correctly.

Currently, I am helping a country with their hydroplane boat racing clubs. The country needs a very
complicated Excel sheet that contains each club in different states listed with all their members and
boats and events. Quite a project, but, it was extremely simple to create the sheet from a database
just by a few loops and a lot of formatting code to draw boxes around areas and change box colors.

Anyway, just wanted you to know that library is free and out there… You can skip the saving as a CSV
file and read the XLS directly if you want to…

Valkrider · February 15, 2015, 5:38pm

I don’t see the advantage of PHPExcel over just coding it myself as I am only reading from the Excel CSV and then loading into multiple MySQL tables unless I have missed something.

ErnieAlex · February 15, 2015, 5:56pm

No, not at all. If the user has to export the data as a CSV and then upload that, it is just fine your way.
Just wanted you to be aware that it can also be done directly from the XLS in case you didn’t know that.

It was just FYI…

Arrays are usually faster. It depends on how you need to break up the data. You mentioned four cols with
15k rows. That is only 60k cells. Not really much speed wise. Either way would work. JimL showed a very
simple way to load three of the four cols of data into a DB table. Change it to four cols. Then, the data will
be in one table and you know the layout, so just pull it out as needed. If you need col1 which you wanted
to put into a separate table, just query JimL’s table and only pull out column1. Did that make sense?

So, no array, directly to DB table and then query that spreadsheet table as needed to access the data.

Not sure, but, I think that is what you are looking for. Now, if you need to keep the data in separate tables
such as if you need to keep many spreadsheet’s data stacked in a table, you can alter the code to prepare
the extra tables and change the execute array to handle just the parts you need. Lots of different ways to
handle this depending on what you need for a final output of table(s)…

Good luck!

Valkrider · February 15, 2015, 7:39pm

Thanks guys for all the comments and suggestions.

I have attached a small segment of a spreadsheet. Hopefully this will give a better idea of what i am trying to achieve. As you can see there is a lot of duplicated data that is ripe for normalising into a relational database.

The coloured columns indicate what I think the 4 tables should be and the dark blue rows are duplicated data that I need to ignore.

I do not intend to manually process these files but have a couple of trusted users upload them and then automatically process them into the tables.

ErnieAlex · February 15, 2015, 9:19pm

Well, again, it might be a lot easier to just use JimL’s routine to read the file into one temporary table.
And, then let queries steal whatever you need into the other tables.

So, the question is are you planning on taking several of these and pile the data on top of each other?
OR, do you just re-read the one file and rebuild all of the tables each time?

What I am asking is how often do you read this file and when you do does it replace all previous data?
Or, is it added on each time? You gave us sample data which helps a lot, but, no titles on the cols.
So, we have not way to know what the data is except the obvious ones like names and teams.

I think by your colours that you dont want matches that are just listed with the players swapped. That
makes sense. So, that can be handled in one small query to delete dups if you swap the player names.
If this is all in one big DB table, you can use simple queries to fix it up and remove the junk. Then, all the
data is ready and no extra tables needed.

Why take the time to separate it into four tables? Is there a reason for that? If the data is there in one
table already, just pull out what you need in your query, no need to sort into four tables… Waste of time.

Again, how often is this to be run and does it replace the data or add onto it?

Valkrider · February 15, 2015, 9:43pm

This will be multiple new files added to the database over time. The initial load if I don’t de-duplicate and split will be approaching 1 million rows and then new data will be added regularly. Data will need updating very rarely only if an error is found.

As I said originally there will be 4 different regular queries that will return a different subset of the data.

ErnieAlex · February 15, 2015, 10:01pm

Well, a million items with four cols is not much for MySQL to handle. Not an issue on that small DB.

I still would use JimL’s routine altered to grab the for entries and weed out the unneeded ones.
We can help you do that if you take the pix you sent and list what you would get out of it.

So, take the pix. Make a col name for each of them. Tell us them. Mark the cols you do not need at all
as, let’s say “NA”. (Not-Applicable) You will need one DB table with the same field names. Then, write
the JimL’s code with the names in place. At that point, we can explain how to skip the dup’s, etc…

Valkrider · February 16, 2015, 6:45am

Thanks for the offer I know how to do what you are suggesting just with everything I have learnt over the years I didn’t think it was the best way. it isn’t duplicate columns but duplicate rows that I am wanting to eliminate. I need all the columns not just 4 of them.

ErnieAlex · February 16, 2015, 4:03pm

Well, I have done exactly that in a knowledge scraping system that I created for Google searches.

(I didn’t want to rerun the process on URL’s already checked.)

So, I just did a simple loop thru the list and did an inside query to check for the duplicates before
handling the process of grabbing the data. Works nice. It does mean that there is an extra query
for each of the rows. On my tests, which are not a million rows, maybe 5 thousand, there was not
any big added delay in the processing. I would guess with a million rows, it might take a bit of time
to check for the dups.

To speed that up a lot, you can use the UPDATE IF () NOT EXISTS in the query. Doing that would let the
MySQL system handle the checking for duplicates and then you can just use the loop similar to JimL’s
version to load all the data. Let the database system handle dup checking. Should increase the speed
by many factors than doing the compares yourself. Since I decided to alter my code I mentioned to do
this, I found this example. Nothing to do with your or my code, but, you will get the idea…

INSERT INTO AdminAccounts
(Name)
SELECT t1.name
FROM Matrix t1
WHERE NOT EXISTS(SELECT id
FROM AdminAccounts t2
WHERE t2.Name = t1.Name)

So, you insert whatever WHERE NOT EXISTS the items that could be duplicated such as your two names.
Why this is good is that you need no code to implement it. The database engine handles it and does the
dup checking for you. Should work out nicely for you… What do you think?