Programming

21586 readers

189 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]

founded 2 years ago

MODERATORS

[email protected]

database greenhorn (discuss.tchncs.de)

submitted 2 months ago* (last edited 2 months ago) by [email protected] to c/[email protected]

49 comments fedilink hide all child comments

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 month ago (1 children)

First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

Currently about ~50. But like 30 of them are the result of splitting them into a common column like "country". In the beginning I assumed this lead to the same as partitioning one large table?

Also, look at your slowest queries

The different queries itself take not long because of the query per se. but due to the limitation of the HDD, SQL reads as much as possible from the disk to go through a table, given that there are now multiple connections all querying multiple tables this leads to a server overload. While I see now the issue with our approach, I hope that migrating the server from SQL server to postgreSQL and to modern hardware + refactoring our approach in general will give us a boost.

They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query?

Actually no JOIN. Most "complex" query is INSERT INTO with a WHEN NOT EXIST constraint.

But thank you for your advice. I will incorporate the tips in our new design approach.

[–] [email protected] 1 points 1 month ago

You really have to see what the db is doing to understand where the bottlenecks are, i.e. find the query plans. It's ok if it's just single selects. Look for stuff like table scans that shouldn't happen. How many queries per second are there? Remember that SSD's have been a common thing for maybe 10 years. Before that it was HDD's everywhere, and people still ran systems with very high throughput. They had much less ram then than now too.