This topic is locked

Removing Duplicates after Search/Advance Search but Before Export

7/7/2009 12:47:30 PM
PHPRunner General questions
D
dblack6047 author

Greetings Everyone,
I am an mid-level user of PHPRunner and have been using it steadily since late last year. (BTW... Excellent Product!)
Usually, I am not the person to ask questions as I can figure out a problem myself or find an answer somewhere within the forums but this time I think I have run up against a brick wall on this one:
What I have is a large database with a single table (1.2 million records) which contains (File Names, File Categories, MD5, SHA1 and SHA256 hash values).
What I am attempting to do is speed up the removal of Duplicates hash values that is ultimately Exported out by the customer in CSV format and other formats. The current process that I have in place now is to perform a small SQL query that removes duplicates during the customer's Search and Advance Search process where I have a small SQL query that removes duplicates (based on hash value) which I have added to the "Edit SQL Query" screen.
As a result and based on the unique customer searches, it can take a very LONG time to present the results to the customer on the List page before they have the opportunity to Export out the results (non-duplicates) to a CSV and other formats. Remember, this is 1.2 million records that is initially searched and then duplicates removed and I have just about everything already Indexed to allegedly speed up the process.
What I am now thinking is to have a regular Search and/or Advance Search without the remove duplicate SQL query but instead have somewhere within the Export process the same or similar query to remove duplicates based on the customer's initial search results. I am thinking if they have already found 50,000 records that it is simpler to query removal of those duplicates prior to Exporting the results rather than removing duplicates from the initial 1.2 million records in the beginning.
Based on this line of thinking is this something that can be done by maybe adding an Event somewhere within the "Export page: Before SQL query" ?
I suspect that the List page results are itself a temporary table (kind of like a View) that can have Duplicates removed during the Export process. I could be wrong and if so how would I go about removing duplicates prior to Exporting?
P.S. Removing all duplicates within the database long before customers query the database is not an option at this point based on the various File Categories and the cross-category searching by the customer.
Any assistance would be greatly appreciated.
David

A
alang 7/7/2009

Is it possible to avoid putting duplicates in the table in the first place using unique index?

D
dblack6047 author 7/7/2009

Is it possible to avoid putting duplicates in the table in the first place using unique index?


Unfortunately not in this instance. I'm really needing a way to remove duplicates from the List page results as the customer begins to Export the results (minus duplicates) if they so choose which is most of the time but unfotunately not all of the time.

Sergey Kornilov admin 7/7/2009

I think it's a matter of adding DISTINCT to SQL query on the export page.
Here is the sample code that you can place to 'After table initialized' event:

if (substr($_SERVER["SCRIPT_NAME"],strrpos($_SERVER["SCRIPT_NAME"],"/")+1)=="Cars_export.php")

$gsqlHead="SELECT DISTINCT ID, Make, Model, YearOfMake, Horsepower, EPACity, EPAHighway, Price";

D
dblack6047 author 7/7/2009

I think it's a matter of adding DISTINCT to SQL query on the export page.

Here is the sample code that you can place to 'After table initialized' event:


I am just a little lost on the above coding but I can see that it is just an example but not an exact use for my situation. I am assuming that what you are saying is that I could make some simple statement that says "IF search_engine_export.php is being requested THEN gsqlHead="SELECT DISTINCT hash_value"; (FYI... hash_value is the MD5 field which I am wanting to make DISTINCT and subsequently have exported by the customer)
Is this correct? If so, what is a simple IF statement that I could use and place within the 'After Table Initialized' event that says IF search_engine_export.php THEN gsqlHead="SELECT DISTINCT hash_value";
Many Thanks in advance!

Sergey Kornilov admin 7/7/2009

When you open 'After table initialized' event in event editor you can see a hint area with variables like $gsqlHead.
Paste my code there, use correct name of export page, paste line with $gsqlHead from the hint area and insert DISTINCT after SELECT.
This is it.

D
dblack6047 author 7/8/2009

Unfortunately DISTINCT is not working as I have far more than 20 fields and I am only wanting DISTINCT on the MD5 field which I have named in my database table as "Hash_Value". To use DISTINCT I would have to create a more cpu intensive, lengthy and complicated command such as using SELECT MAX (Name), MAX (File_Size), Hash_Value, MAX (Created_Date), MAX (Written_Date) etc. etc and then GROUP BY at the tail end.
Could I use something more like...
if (substr($_SERVER["SCRIPT_NAME"],strrpos($_SERVER["SCRIPT_NAME"],"/")+1)=="search_engine_export.php")

$gsqlHead="SELECT * FROM search_engine WHERE 1 GROUP BY Hash_Value";
I tried this initially but when I tried to export the results of my advance search query it threw up an error. I suspect I have to break down the last line as I have it all starting after "$gsqlHead=" which may be wrong.

Sergey Kornilov admin 7/8/2009

If you think GROUP BY gives you better results than DISTINCT you should definitely proceed that way.
Check that event hint area for other variables you may want to override like $gsqlFrom;

D
dblack6047 author 7/8/2009

I've spent last night and this morning looking at the event hints but I cannot find the exact title lines that I need for each sql statement. I mean I have the following that I listed above:
$gsqlHead="SELECT FROM search_engine WHERE 1 GROUP BY Hash_Value";
but obviously this does not work. Do I need to change it to something more in line with:
if (substr($_SERVER["SCRIPT_NAME"],strrpos($_SERVER["SCRIPT_NAME"],"/")+1)=="search_engine_export.php")

$gsqlHead="SELECT
";

$gsqlFrom="FROM search_engine ";

$gsqlWhereExpr="WHERE 1 GROUP BY hash_Value";
I tried this also using the Event hints provided in the product Help Manual but something is still not right as I get a SQL syntax error on the screen when I try to Export the records from the "search_engine_export.php" process. Can you tell what looks incorrect that I am putting into the Event?

Sergey Kornilov admin 7/8/2009

dblack6047,
your code doesn't look good as GROUP BY cannot be added to WHERE clause.
I recommend to post your application to Demo Account (use 'Demo Account' button on the last screen in program). Then open a ticket at http://support.xlinesoft.com sending your Demo Account URL for investigation. We also need to see SQL query that you expect to be executed on the export page.

D
dblack6047 author 9/15/2009

Sergey,
After nearly two months I finally figured out the exact code. The exact code needed under "Table Events" and then within "After Table Initialized" which allowed me to remove duplicates (based on MD5 hash column of values) once the customer selected from the List Page the option to Export the data shown on the List Page was:
/ here goes EVENT_INIT_TABLE event

if (substr($_SERVER["SCRIPT_NAME"],strrpos($_SERVER["SCRIPT_NAME"],"/")+1)=="search_engine_export.php")

{

$gsqlHead="SELECT Name, MD5, SHA_1, SHA_256";

$gsqlFrom="FROM search_engine ";

$gsqlWhereExpr="1 ";

$gsqlTail="GROUP BY MD5 ";

}
I thought this would be best to provide the above to others in the Forum in case they could pull information from above code. Good Luck!