The gimmick I used to attract reader responses was to end each column with a SQL programming puzzle. Ten years later, those two magazines were consolidated into Intelligent Enterprise. My SQL puzzles moved to some smaller publications and then finally faded away. Today, I throw out a puzzle or two on the www. There is not much for database programmers to test themselves against, except my little puzzle book.
I would often find my puzzles showing up in homework assignments because I was the only source that teachers knew about for SQL problems. I would then get an e-mail from a lazy student wanting me to do his homework for him, unaware of the source of the assignment. Back in those early days, the de facto standard was SQL, and the SQL standard was a design goal for the database vendors.
Today, most vendors have gotten most of SQL into their products. SQL products were expensive and the best ones lived on mainframes. The Internet is full of newsgroups where you can get help for particular products.
This collection of puzzles includes the original puzzles, so that the original readers can look up their favorites. But now many of them have new solutions, some use the older syntax, and some use the newer features. Many of the original solutions have been cooked by other people over the years.
The original book contained 50 puzzles; this edition has 75 puzzles. In the first edition, I tried to organize the puzzles by categories rather than in chronological order or by complexity. This edition, I have given up my informal category scheme because it made no sense. I have tried to credit the people involved with each problem, but if I missed someone, I apologize.
Acknowledgments, Corrections, and Future Editions I will be glad to receive corrections, new tricks and techniques, and other suggestions for future editions of this book. Send your ideas to or contact me through the publisher, Morgan Kaufmann. Special thanks to Richard Romley of Smith Barney for cooking so many of my early puzzles, all the people on CompuServe and SQL newsgroups who sent me e-mail all these years, and the people who are posting on the newsgroups today I used your newsgroup handles, so people can search for your postings.
This little exercise is important because SQL is a declarative language and you need to learn how to specify things in the database instead of in the code.
For example, the federal government runs its fiscal year from October 1 until the end of September. Answer 1 1. The related Absenteeism row must be either explicitly or implicitly deleted as well. You could replicate the above deletion for the Absenteeism table. The current situation is that if you want to spend your weekends being sick, that is fine with the company. This is not a very nice place to work. If an employee reports in absent on Friday of week number 1, all of week number 2, and just Monday of week number 3, the UPDATE will catch only the five days from week number 2 as long-term illness.
The Friday and Monday will show up as sick days with severity points. I would avoid problems with weekends by having a code for scheduled days off weekends, holidays, vacation, and so forth that carry a severity point of zero.
A business that has people working weekend shifts would need such codes. This same trick would also prevent you from losing scheduled vacation time if you got the plague just before going on a cruise. While I agreed that I left out the aging on the dates missed, I will argue that it would be better to have another DELETE statement that removes the year-old rows from the Absenteeism table, to keep the size of the table as small as possible.
Answer 3 Another useful tool for this kind of problem is a Calendar table, which has the working days that can count against the employee. In the 10 years since this book was first written, this has become a customary SQL programming practice.
Holidays and weekends would carry the same Julian number as the preceding workday. Medal came up with this nifty little problem many years ago.
Anesthesiologists administer anesthesia during surgeries in hospital operating rooms. Information about each anesthesia procedure is recorded in a table. This is not a mistake. Anesthesiologists, unlike surgeons, can move from one operating room to another while surgeries are underway, checking on each patient in turn, adjusting dosages, and leaving junior doctors and anesthesia nurses to monitor the patients on a minute-to-minute basis. There is a sliding scale for remuneration for each procedure based on the maximum count of simultaneous procedures that an anesthesiologist has underway.
The higher this count, the lower the amount paid for the procedure. For example, for procedure 10, at each instant during that procedure Dr. Baker counts up how many total procedures in which he was concurrently involved.
This maximum count for procedure 10 is 2. The problem then is to determine for each procedure over its duration, the maximum, instantaneous count of procedures carried out by the anesthesiologist. We can derive the answer graphically at first to get a better understanding of the problem.
The lower graph Instantaneous Count of In-Progress Procedures shows how many procedures are underway at each moment. It helps to think of a slide rule hairline moving from left to right over the Gantt chart while each procedure start or end is plotted stepwise on the lower chart. We can see in Example 1 by inspection that the maximum is 2. Example 1—Dr. Baker, Proc 10 Example 2 shows a more complex set of overlapping procedures, but the principle is the same.
The maximum, which happens twice, is 3. Example 2—Dr. Dow, Proc 30 Note that the correct answer is not the number of overlapping procedures but the maximum instantaneous count. The puzzle is how to do this for each procedure using SQL.
The NOT condition eliminates procedures that do not overlap the subject procedure. Notice that the same anesthesiologist can start more than one procedure at the same time.
This series of backward-looking sums gives us the values represented by each step in the step charts. The idea is to loop through all procedures P1 ; for each procedure P1 you look at procedures P2 where their start time lies in the interval of procedure P2.
For each start time you found of P2, count the number of procedures P3 that are ongoing on that time. Then, take the maximum count for a certain procedure P1. How do you enforce the rule that you cannot add a reservation that has an arrival date conflicting with the prior departure date for a given room? That is, no double bookings. Answer 1 One solution requires a product to have the capability of using fairly complex SQL in the CHECK clause, so you will find that a lot of implementations will not support it.
Given cheap storage today, this might not be a problem, but redundancy always is. You will also need to find a way in the INSERT statements to be sure that you put in all the room days without any gaps. Only a few products have implemented it so far. This is a good trick for getting table-level constraints in a table on products without full SQL features. Perhaps I cannot see the forest for the trees here, but this seems like a real challenge to solve in an elegant manner that does not result in numerous circular references.
Although this puzzle seems to be an entire system, my question is whether or not there is a way to eliminate the apparent circular references within the table design phase.
Shame on him! The third document in the chain of succession is a second circle drawn around the first circle, and so forth. We show these nested sets with the next value, flattening the circles onto the number line starting at zero. You have to create the new document row in Portfolios, then the succession table entry. The value of next in the successor is one greater than the highest next value in the chain.
Nested sets!! My chains are not really all that linear. You can try it. No big deal with this schema. You did not say if the succession date column values have to be in increasing order, along with the next column values.
Is that true? Users in his shop usually end up using the wrong printer for printout, thus they decided to include a new table in the system that will derive the correct printer for each user at runtime. In the first case, a simple query can pull out one row and it works fine; in the second case, you get two rows and cannot use that result. Can you come up with a one-query solution? The name tells you that it should be unique, but it has multiple NULLs in it.
There is also another problem in the real world; you want to balance the printer loads between LPT4 and LPT5, so that one of them is not overused. The 'celko' user id qualified only for LPT4 because it falls alphabetically within that range of strings.
A user 'norman' is qualified only for LPT5. Careful choice of these ranges will allow you to distribute the printer loads evenly if you know what the user ids are going to be like. This is not true. This is a funny characteristic of the aggregate functions on empty sets. The bad news is that when I get back a row, this query is going to return the same printer over and over, instead of distributing the workload over all the unassigned printers.
You can add an update statement to replace the NULL with the guest user, so that the next printer will be used. If you went that route, you would just remove everything after the CASE statement from the second query above. This would mean, however, that you never record information about the printers in the database. If you drop or add printers, you have to change the query, not the database where you are supposed to keep this information. You want to write a query to produce a list of the available seats in the restaurant, set up in blocks by their starting and ending seat numbers.
Oh yes, the gimmick is that the database resides on a personal digital assistant and not a mainframe computer.
As part of the exercise, you must do this with the smallest amount of storage possible. Assume each seat number is an integer. The available seating query would be based on the flag. This would be 1, rows of one integer and one character for the whole restaurant and would work pretty well, but we have that minimal storage requirement.
Answer 1 The flag can be represented by a plus or minus on the seat number itself to save the extra column, but this is very bad relational practice; two attributes are being collapsed into one column.
But it does keep us at 1, rows. Answer 2 The second thought is to create a second table with a single column of occupied seating and to move numbers between the occupied and available tables. That would require a total of 1, rows in both tables, which is a little weird, but it leads to the next answer.
They act as sentries on the edge of the real seating to make the code easier. Delete each seat from the table as it is occupied and insert it back when it is free again. The Restaurant table can get as small as the two dummy rows if all the seating is taken, but no bigger than 1, rows 2, bytes if the house is empty.
This could be useful if the restaurant is broken into sections in some way. He was working on a pension fund problem. The longest period could be 60 years, with 1 month in each year. Some people might work four years and not the fifth, and thus not qualify for a pension at all. EID , z. ColorId , x. SKU , b. Price , b. YM ,x. YM AND x. ColorId and x. ColorId , z. SKU , z. SKU, a. ColorId AND x. Mins AND a. What values in which tables can we join together to get the data that we need? Ideally the data should be broken down such that we could easily graph the values to visualize two line plots of free vs paying users.
The x-axis would represent the date and the y-axis would represent the average number of downloads. Advanced SQL reporting questions for business analysts typically require some advanced SQL skills, as well as problem-solving ability.
Some example exercises include:. Given a table of job postings, write a query to breakdown the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times. This question is kind of complicated so it's helpful to break it into multiple steps. First let's visualize what the output would look like. We want the value of two different metrics, the number of users that have posted their jobs once versus the number of users that have posted at least one job multiple times.
What does that mean exactly? Well if a user has 5 jobs but only posted them once, then they are part of the first statement.
But if they have a 5 jobs and posted a total of 7 times, that means that they had to at least posted one job multiple times. To get to that point, we need a table with the count of user-job pairings and the number of times each job gets posted. Now we just need a way to differentiate the users that posted each job once from users that posted multiple times.
Let's go back to our example. We can deduce that if we take the sum of the number of rows for each user and the sum of the number of times each job is posted , they must be equal for the user to have posted each job only once , since we grouped by the user and the job id.
We're given a table of product purchases. Each row in the table represents an individual user product purchase. Write a query to get the number of customers that were upsold by purchasing additional products. Note: If the customer purchased two things on the same day that does not count as an upsell as they were purchased within a similar timeframe. Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.
If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet. Analytics SQL interview questions, which you might hear referred to as SQL case studies , are some of the trickiest interview questions that you will face.
This is because they test two concepts:. Analytics SQL interview questions are designed to test how you would think about solving a problem, and are purposely left more ambiguous than other types of problems.
For example, an interviewer might ask you to write a SQL query given a few tables to understand which AB test variant won. But there might not even be any understanding of what winning actually means. Write a query to see which variant "won. First touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website.
How do we figure out the beginning path of the Facebook ad and connect it to the end purchasing user? We need to do two actions: 1 subset all of the users that converted to customers and 2 figure out their first session visit to attribute the actual channel.
We can do that by creating a sub-query that only gets the distinct users that have actually converted. We're looking to understand the effect of a new Uber driver incentive promotion released in the past month on driver behavior. Write a query to figure out if the incentive worked as indicated. Let's say we want to build a naive recommender. Write an SQL query to create a metric to recommend pages for each user based on recommendations from their friends liked pages.
Let's solve this problem by visualizing what kind of output we want from the query. Then the max value on our metric will be the most recommendable page. The first thing we have to do is then to write a query to associate users to their friends liked pages. We can do that easily with an initial join between the two tables. Amazon released a new recommendation widget on their landing page. Write a query to determine the impact the recommendation widget made on user behavior for one metric.
Write a query to show the number of users, number of transactions placed, and total order amount per month in the year Assume that we are only interested in the monthly reports for a single year January-December. Given a table of transactions and products , write a query to return the product id and average product price for that id.
Only return the products where the average product price is greater than the average price of all transactions. Given a table of product subscriptions with a subscription start date and end date for each user, write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user.
Write a query to get the post success rate for each day in the month of January Hint: Let's see if we can clearly define the metrics we want to calculate before just jumping into the problem. We want post success rate for each day over the past week. Additionally since the success rate must be broken down by day, we must make sure that a post that is entered must be completed on the same day. ETL stands for "Extract, Transfer, Load" and describes the process for which data flows between different data warehousing systems.
Extract does the process of reading data from a database. Transform converts data into a format that could be appropriate for reporting, analysis, machine learning, etc. In the interview, ETL tools and concepts are important to know for virtually all roles. The more difficult interview questions, however, will likely be focused and asked in data engineering, business intelligence, and related interviews.
If we set this query to run daily, it becomes a daily extract, transform, and load ETL process. Let's say you have a table with a billion rows. How would you add a column inserting data from the original source without affecting the user experience?
Before jumping into the question we should remember to clarify a few details that we can potentially get out of the interviewer. It helps to ask questions to understand and show that you can think holistically about the problem. Rushing too fast into a solution is a red flag for many interviewers. Due to an ETL error, the employees table instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.
Let's say that we have two ETL jobs that feed into a single production table each day.
0コメント