this post was submitted on 02 May 2025
19 points (100.0% liked)

Programming

20237 readers
501 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 2 years ago
MODERATORS
19
database greenhorn (discuss.tchncs.de)
submitted 2 weeks ago* (last edited 2 weeks ago) by [email protected] to c/[email protected]
 

hi my dears, I have an issue at work where we have to work with millions (150 mln~) of product data points. We are using SQL server because it was inhouse available for development. however using various tables growing beyond 10 mln the server becomes quite slow and waiting/buffer time becomes >7000ms/sec. which is tearing our complete setup of various microservices who read, write and delete from the tables continuously down. All the stackoverflow answers lead to - its complex. read a 2000 page book.

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table. the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

the question is. is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

edit: I know that my questions seems broad. but as I am new to database architecture I welcome any input and discussion since the topic itself is a lifetime know-how by itself. thanks for every feedbach.

top 42 comments
sorted by: hot top controversial new old
[–] [email protected] 25 points 2 weeks ago* (last edited 2 weeks ago) (2 children)

All the stackoverflow answers lead to - its complex. read a 2000 page book.

This is an exceptionally good answer and you're doing everything possible to avoid doing it, when you could have been half way done with the book by now probably. Database administration is a profession, not a job. It requires specialized training to do it well and doing everything possible to avoid that training and knowledge won't help you one bit.

my queries are not that complex.

It doesn't matter. Your database is very complex.

they simply go through the whole table to identify any duplicates

You search 10 million records on every request and you wonder why it's slow?

is there a wizard move to bypass any of my restriction or is a change in the setup and algorithm inevitable?

No. Database administration is very difficult. Reading that 2000 page book is essential for setting up infrastructure to avoid a monolithic setup like this in the first place.

the other culprit is that our server runs on a HDD which is with 150mb read and write per second probably on its edge.

lol wtf

Realistically, this setup is 10 years too old. How large is your database? Is there any reason why it can't be run in memory? 10 million lines isn't insurmountable. Full text with a moderate number of tables could be ~10GB--no reason that can't be run in memory with Redis or other in-memory database or to update to a more modern in-memory database solution like Dice.

Your biggest problem is the lack of deduplication and normalization in your database design. If it's not fixed now, it'll simply get worse YOY until it's unusable. Either spend the time and money now, or spend even more time and money later to fix it. 🤷‍♂️

tl;dr: RTFM.

[–] [email protected] 3 points 9 hours ago (1 children)

Realistically, this setup is 10 years too old

thanks for this input. This was the winning argument for my boss for migrating to a modern server. While I admit that I see many flaws in our design, we are now working on refactoring our architecture and approach itself.

Thanks to the other numerous answers leading me to the right direction (hopefully).

[–] [email protected] 1 points 7 hours ago (1 children)

This was the winning argument for my boss for migrating to a modern server.

Exceptionally good news! Glad it's working out. Be sure to make a new post when you decide what you go with, I'm sure people here would enjoy hearing about your approach.

[–] [email protected] 1 points 7 hours ago

Well. For now the system is not yet running on the new hardware.

It is now a pondering process of whether migrating everything as it is to the new hardware and then optimize/refactor.

Or refactor before (or at least develop a plan) and then improve during migration....

Be sure to make a new post when you decide what you go with, I’m sure people here would enjoy hearing about your approach.

Nice to hear. Thanks. I will share updates.

[–] [email protected] 3 points 2 weeks ago (2 children)

Sort of harsh approach, but I get it.

Though I did learn the most while having a lot of data and had issues with performance.

Studying Postgres in that job was the absolute best part, I learned so much, and now I can't find a problem Postgres can't fix.

There was a running joke in my last office that I was paid to promote Pg because every time MySQL fucked something up, I would bring up how Postgres would solve it. I even did several presentations.

Then we migrated to Postgres and suddenly everything is stable as a rock, even under worse conditions and way more data.

I just love Postgres so much.

[–] [email protected] 2 points 9 hours ago

Sort of harsh approach, but I get it.

Yeah. To me it feels like we used a powertool as a hammer. Brute force in the wrong way.

As an update: I was able to convince my people to migrate to a modern server - altogether we also switch from SQL server to PostgreSQL. During this migration we also try to refactor our workflow since it was flawed by design.

So, many thanks for the input.

[–] [email protected] 3 points 2 weeks ago (1 children)

Sometimes it feels like postgres is cheating (in a good way)

[–] [email protected] 2 points 2 weeks ago

Compared to MySQL most definitely.

Granted, Oracle has pushed some fresh air into it, but still it has a long way to go.

[–] [email protected] 8 points 2 weeks ago (1 children)
  • spent time to generate/optomize your indexes.
  • faster storage/cpu/ram for your rdbms
  • get the data needed by specific services into the service, only get the data from a central place if you have to (spinning up a new instance, another service changes state of data you need, which is a warning sign in itself that your architecture is brittle...)
  • faster storage/cpu/ram
  • generate indexes
  • 2nd level cache shared between services
  • establish a faster datastore for often requested data thats used by multiple services (that might be something like redis, or another rdbms on beefier hardware)
  • optimize queries
  • generate indexes
  • faster storage/cpu/ram
[–] [email protected] 1 points 9 hours ago (1 children)

first of all many thanks for the bullets. Good to have some guidance on where to start.

2nd level cache shared between services

I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

faster storage/cpu/ram faster storage/cpu/ram faster storage/cpu/ram

I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

generate indexes

How would indeces help me when I want to ensure that no duplicate row is added? Is this some sort of internal SQL constraint or what is the difference to compare a certain list of rows with an existing table (lets say column id)?

[–] [email protected] 1 points 4 hours ago

I have read about this related to how FB does it. In general this means that fetching from the DB and keep it in memory to work with right? So we assume that the cached data is outdated to some extend?

correct, introducing caching can result in returning outdated data for awhile, which is usually not a huge deal. those caches can get tricky, but they should take pressure from your db, if you're scenario is read heavy, which is often the case. Research existing caching solutions before running ahead and implementing something from scratch, especially if you need a cache distirbuted between multiple instances of your service. In the Java world that would be something like Infinispan, but your ecosystem might over better integration with other solutions.

I was able to convince management to put money into a new server (SSD thank god). So thank you for your emphasizes. We are also migrating to PostgreSQL from SQL server, and refactor the whole approach and design in general.

having management on board is great and the new hardware should help a lot, migrating to another RDBMS sounds scary, but probably worth it if your organisation has more expertise with it.

generate indexes

they won't help you with your duplicates, they will help speed up your reads but could slow down writes. building a good index is not trivial, but nothing is when it comes to performance tuning a database, it's tradeoff after tradeoff. The best way to handle identical rows of data is to not write them usually, but i don't know your system nor its history, maybe there is or was a good reason for its current state.

[–] [email protected] 7 points 2 weeks ago (1 children)

Indexes and pagination would be good starts

[–] [email protected] 1 points 9 hours ago (1 children)

with pagination you mean paginating to split the query into chunks during comparison of a give data set with a whole table?

[–] [email protected] 1 points 8 hours ago (2 children)

yes? maybe, depending on what you mean.

Let's say you're doing a job and that job will involve reading 1M records or something. Pagination means you grab N number at a time, say 1000, in multiple queries as they're being done.

Reading your post again to try and get context, it looks like you're identifying duplicates as part of a job.

I don't know what you're using to determine a duplicate, if it's structural or not, but since you're running on HDDs, it might be faster to get that information into ram and then do the job in batches and update in batches. This will also allow you to do things like writing to the DB while doing CPU processing.

BTW, your hard disks are going to be your bottleneck unless you're reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That'll blow any other suggestion I have out of the water.

BUT! there are ways to help things out. I don't know what language you're working in. I'm a dotnet dev, so I can answer some things from that perspective.

One thing you may want to do, especially if there's other traffic on this server:

  • use WITH (NOLOCK) so that you're not stopping other reads and write on the tables you're looking at
  • use pagination, either with windowing or LIMIT/SKIP to grab only a certain number of records at a time

Use a HashSet (this can work if you have record types) or some other method of equality that's property based. Many Dictionary/HashSet types can take some kind of equality comparer.

So, what you can do is asynchronously read from the disk into memory and start some kind of processing job. If this job does also not require the disk, you can do another read while you're processing. Don't do a write and a read at the same time since you're on HDDs.

This might look something like:

offset = 0, limit = 1000

task = readBatchFromDb(offset, limit)

result = await task

data = new HashSet\<YourType>(new YourTypeEqualityComparer()) // if you only care about the equality and not the data after use, you can just store the hash codes

while (!result.IsEmpty) {

offset = advance(offset)

task = readBatchFromDb(offset, limit) // start a new read batch



dataToWork = data.exclusion(result) // or something to not rework any objects

data.addRange(result)



dataToWrite = doYourThing(dataToWork)

// don't write while reading

result = await task



await writeToDb(dataToWrite) // to not read and write. There's a lost optimization on not doing any cpu work

}



// Let's say you can set up a read or write queue to keep things busy

abstract class IoJob {

public sealed class ReadJob(your args) : IoJob

{

Task\<Data> ReadTask {get;set;}

}

public sealed class WriteJob(write data) : IoJob

{

Task WriteTask {get;set;}

}

}



Task\<IoJob> executeJob(IoJob job){

switch job {

ReadJob rj => readBatchFromDb(rj.Offset, rj.Limit), // let's say this job assigns the data to the ReadJob and returns it

WriteJob wj => writeToDb(wj) // function should return the write job

}

}



Stack\<IoJob> jobs = new ();



jobs.Enqueue(new ReadJob(offset, limit));

jobs.Enqueue(new ReadJob(advance(offset), limit)); // get the second job ready to start



job = jobs.Dequeue();

do () {

// kick off the next job

if (jobs.Peek() != null) executeJob(jobs.Peek());



if (result is ReadJob rj) {



data = await rj.Task;

if (data.IsEmpty) continue;



jobs.Enqueue(new ReadJob(next stuff))



dataToWork = data.exclusion(data)

data.AddRange(data)



dataToWrite = doYourThing(dataToWork)

jobs.Enqueue(new WriteJob(dataToWrite))

}

else if (result is WriteJob wj) {

await writeToDb(wj.Data)

}



} while ((job = jobs.Dequeue()) != null)

[–] [email protected] 2 points 7 hours ago (1 children)

BTW, your hard disks are going to be your bottleneck unless you’re reaching out over the internet, so your best bet is to move that data onto an NVMe SSD. That’ll blow any other suggestion I have out of the water.

Yes, we are currently in the process of migrating to PostgreSQL and to a new hardware. Nonetheless the approach we are using is a disaster. So we will refactor our approach as well. Appreciate your input.

I don’t know what language you’re working in.

All processing and SQL related transactions are executed via python. But should not have any influence since the SQL server is the bottleneck.

WITH (NOLOCK)

Yes I have considered this already for the next update. Since our setup can accept dirty reads - but I have not tested/quantified any benefits yet.

Don’t do a write and a read at the same time since you’re on HDDs.

While I understand the underlying issue here, I do not know yet how to control this. Since we have multiple microservices set up which are connected to the DB and either fetch (read), write or delete from different tables. But to my understanding since I am currently not using NOLOCK such occurrences should be handled by SQL no? What I mean is that during a process the object is locked - so no other process can interfere on the SQL object?

Thanks for putting this together I will review it tomorrow again (Y).

[–] [email protected] 1 points 2 hours ago

Thanks for giving it a good read through! If you're getting on nvme ssds, you may find some of your problems just go away. The difference could be insane.

I was reading something recently about databases or disk layouts that were meant for business applications vs ones meant for reporting and one difference was that on disk they were either laid out by row vs by column.

[–] [email protected] 1 points 8 hours ago (1 children)

That was a bit of a hasty write, so there's probably some issues with it, but that's the gist

[–] [email protected] 0 points 7 hours ago (1 children)
[–] [email protected] 1 points 2 hours ago

Thanks haha

[–] [email protected] 7 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

What? Problems like this usually come down to some missing indexes. Can you view the query plan for your slow queries? See how long they are taking? IDK about SQL Server but usually there is a command called something like ANALYZE, that breaks down a query into the different parts of its execution plan, executes it, and measures how long each part takes. If you see something like "FULL TABLE SCAN" taking a long time, that can usually be fixed with an index.

If this doesn't make any sense to you, ask if there are any database gurus at your company, or book a few hours with a consultant. If you go the paid consultant route, say you want someone good at SQL Server query optimization.

By the way I think some people in this thread are overestimating the complexity of this type of problem or are maybe unintentionally spreading FUD. I'm not a DB guru but I would say that by now I'm somewhat clueful, and I got that way mostly by reading the SQLlite docs including the implementation manuals over a few evenings. That's probably a few hundred pages but not 2000 or anything like that.

First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

Also, look at your slowest queries. They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query? If just one, you probably need an index; if two or three, it might take a bit of head scratching; and if 4 or more, something is possibly wrong with your schema or how the queries are written and you have to straighten that out.

Basically from having seen this type of thing many times before, there is about a 50% chance that it can be solved with very little effort, by adding indexes based on studying the slow query executions.

[–] [email protected] 1 points 9 hours ago

First question: how many separate tables does your DB have? If less than say 20, you are probably in simple territory.

Currently about ~50. But like 30 of them are the result of splitting them into a common column like "country". In the beginning I assumed this lead to the same as partitioning one large table?

Also, look at your slowest queries

The different queries itself take not long because of the query per se. but due to the limitation of the HDD, SQL reads as much as possible from the disk to go through a table, given that there are now multiple connections all querying multiple tables this leads to a server overload. While I see now the issue with our approach, I hope that migrating the server from SQL server to postgreSQL and to modern hardware + refactoring our approach in general will give us a boost.

They likely say SELECT something FROM this JOIN that JOIN otherthing bla bla bla. How many different JOINs are in that query?

Actually no JOIN. Most "complex" query is INSERT INTO with a WHEN NOT EXIST constraint.

But thank you for your advice. I will incorporate the tips in our new design approach.

[–] [email protected] 6 points 2 weeks ago* (last edited 2 weeks ago)

While I get that SO can be monstrously unhelpful, database optimization is a whole profession so I think we need a bit more to help

A few directions we could go here: Post your SQL query. This could be a structure or query issue. Best case, we could do some query optimization. Also, have you looked into indexing?

Where are your bottlenecks coming from? Is your server desined for a I/O intensive workload like databases. Sequential read speed is not a good metrix.

What about concurrency? If this is is super read/write intensive, optimization could depend on where data is written while you're reading

[–] [email protected] 5 points 2 weeks ago (1 children)

Lotta smarter people than me have already posted better answers in this thread, but this really stood out to me:

the thing is. my queries are not that complex. they simply go through the whole table to identify any duplicates which are not further processed then, because the processing takes time (which we thought would be the bottleneck). but the time savings to not process duplicates seems now probably less than that it takes to compare batches with the SQL table

Why aren't you de-duping the table before processing? What's inserting these duplicates and why are they necessary to the table? If they serve no purpose, find out what's generating them and stop it, or write a pre-load script to clean it up before your core processing queries access that table. I'd start here - it sounds like what's really happening is that you've got a garbage query dumping dupes into your table and bloating your db.

[–] [email protected] 1 points 10 hours ago

Why aren’t you de-duping the table before processing?

I need to dedupe the to-be-processed data with the data thats already processed in the "final" table. We are working with hundreds of millions of products therefore we thought about "simply" using random batches from the data to be processed. But thanks to the many replies Ive learned already that our approach was in the beginning already wrong.

[–] [email protected] 4 points 2 weeks ago (1 children)

Do you remember the part of education where they talked about tradeoffs? How making decision a means x, y, x good things and a, b, c bad things? Because it’s reading strongly like your system design methodology was “this is the path of least resistance so I’m doing that”.

Most code is not complex. Good code is usually very easy to read and understand.

Just because you can read and understand the queries you wrote doesn’t mean they’re efficient or that you’re using good design.

So yes. Stack Overflow is going to tell you to RTFM. Because someone needs to sit down with this mess, determine the pros and cons of the system design, and figure out where to start overhauling.

[–] [email protected] 1 points 9 hours ago

So yes. Stack Overflow is going to tell you to RTFM. Because someone needs to sit down with this mess, determine the pros and cons of the system design, and figure out where to start overhauling.

yes thats me. But thanks to the numerous replies to this thread, I have no a clearer picture about culprits and steps where to start with.

The tradeoffs you mentioned are exactly why we are in this mess. In the beginning with no knowledge we thought that certain measures would help us. but it turned out that those poor decisions led to the wrong direction.

Thank you for reply.

[–] [email protected] 4 points 2 weeks ago (1 children)

To paraquote H. L. Mencken: For every problem, there is a solution that's cheap, fast, easy to implement -- and wrong.

Silver bullets and magic wands don't really exist, I'm afraid. There's amble reasons for DBA's being well-paid people.

There's basically three options: Either increase the hardware capabilities to be able to handle the amount of data you want to deal with, decrease the amount of data so that the hardware you've got can handle it at the level of performance you want or... Live with the status quo.

If throwing more hardware at the issue was an option, I presume you would just have done so. As for how to viably decrease the amount of data in your active set, well, that's hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series? If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage? Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

[–] [email protected] 1 points 9 hours ago

To paraquote H. L. Mencken: For every problem, there is a solution that’s cheap, fast, easy to implement – and wrong.

This can be the new slogan of our development. :')

I have convinced management to switch to a modern server. In addition we hope refactoring our approach (no random reads, no dedupe processes for a whole table, etc.) will lead us somewhere.

As for how to viably decrease the amount of data in your active set, well, that’s hard to say without knowledge of the data and what you want to do with it. Is it a historical dataset or time series?

Actually now. We are adding a layer of processing products to an already in-production system which handles already multiple millions of products on a daily basis. Since we not only have to process the new/updated products but have to catch up with processing the historical (older) products as well its a massive amount of products. We thought since the order is not important to use a random approach to catch up. But I see now that this is a major bottleneck in our design.

If so, do you need to integrate the entire series back until the dawn of time, or can you narrow the focus to a recent time window and shunt old data off to cold storage?

so no. No narrowing.

Is all the data per sample required at all times, or can details that are only seldom needed be split off into separate detail tables that can be stored on separate physical drives at least?

Also no IMO. since we dont want a product to be processed twice, we want to ensure deduplication - this requires knowledge of all already processed products. Therefore comparing with the whole table everytime.

[–] [email protected] 3 points 2 weeks ago

What is the execution path? What indexes are being hit? What are the keys? Can you separate things by, for example, category since dupes wouldn't typically exist there? There are lots of potential things that might tell you more or improve performance, but this is super vague.

[–] [email protected] 3 points 2 weeks ago (1 children)

Database performance tuning is its own little world and there are lots of complexities and caveats depending on the database system.

With MSSQL, the first thing you should check is your indexes. You should have indexes on commonly queried fields and any foreign keys. It's the best place to start because indexing alone can often make or break database performance.

[–] [email protected] 1 points 9 hours ago

With MSSQL, the first thing you should check is your indexes. You should have indexes on commonly queried fields and any foreign keys. It’s the best place to start because indexing alone can often make or break database performance.

Indexing is the most answered step. But for foreign key, to my understanding, I apologize is this is maybe wrong, would lead to split the data into separate tables all related by this key right? What would be the difference in splitting the columns of a table into multiple tables - all related by an mutual column, lets say "id"?

[–] [email protected] 3 points 2 weeks ago (1 children)

sounds like some changes would be a good idea 😅

[–] [email protected] 1 points 2 weeks ago (1 children)

haha. relating to a switch to ssd? or in which direction?

[–] [email protected] 1 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

sounds like lots of directions:

  • why are duplicates such a frequent problem, sounds like upstream solutions are needed there?
  • SSD would be faster read/write, yes (your data shouldn't be on a single hard-drive, it should be regularly backed up at least - make the HDD the backup and copy the main database to SSD?); you might even consider a cloud service like AWS RDS
  • for some use-cases, a noSQL database can be faster for reading - but it's contextual
[–] [email protected] 1 points 2 weeks ago

yes. the problem is, we are fetching products from an API. and since we thought processing power will be a limiting factor, we thought that sorting out duplicates would reduce load.

but since the different microservices which process the data are taking different times we are using the sql tables as a pool. this should help upcscaling by using multiple microservices.

cloud services are yet not a solution as we are still in development.

[–] [email protected] 3 points 2 weeks ago

could you tell me what book it is 👀

[–] [email protected] 2 points 2 weeks ago (2 children)

Ms sql is trash

Indexes are great but probably don't get you far if it is already really slow.

Running anything on a Hdd is a joke

You read write and compare continuously? Did you try to split it into smaller chunks?

[–] [email protected] 1 points 9 hours ago

Ms sql is trash Running anything on a Hdd is a joke

thank you, I convinced management to migrate to a modern hardware, and we switch to PostgreSQL together with refactoring our design and approach.

You read write and compare continuously? Did you try to split it into smaller chunks?

Locks are handled by SQL. but yes, multiple tables are read, written and the final table compared with multiple requests/transactions (connections?) simultaneously. Split into smaller chunks would nonetheless mean that the query would loop through the whole table - in chunks? how would this help with simultaneous transactions?

[–] [email protected] 2 points 2 weeks ago

I'd prefer MS SQL over Oracle SQL any day. And PG SQL over both of them.

[–] [email protected] 1 points 2 weeks ago
[–] [email protected] 1 points 2 weeks ago* (last edited 2 weeks ago) (1 children)

"They simply go through the whole table".. that's the problem. A full table scan should be avoided at all costs.

Learn: how to run and read an explain plan, indexes, keys, constraints, and query optimization (broadly you want to locate individual records as quickly as possible by using the most selective criteria).

You also need to learn basic schema design and to familiarize yourself with normalization.

Avoid processing huge result sets in your application. The database is good at answering questions about data it contains. It isn't just a big bucket to throw data into to retrieve later.

[–] [email protected] 1 points 9 hours ago

broadly you want to locate individual records as quickly as possible by using the most selective criteria

What can be more selective than "if ID = "XXX"? Yet the whole table still has to be reviewed until XXX is found?

... and to familiarize yourself with normalization.

based on a quick review of normalization, I doubt that this helps me - as we are not experiencing such links in the data. For us we "simply" have many products with certain parameters (title, description, etc.) and based on those we process the product and store the product with additional output in a table. However to not process products which were already processed, we want to dismiss any product which is in the processing pipeline which is already stored in the "final" table.

It isn’t just a big bucket to throw data into to retrieve later.

thats probably the biggest enlightment I have got since we started working with a database.

Anyway I appreciate your input. so thank you for this.