Skip to content

Data and loads are not well distributed in my benchmark #7349

@jaltabike

Description

@jaltabike

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

I tried to insert tuples into TiDB by using YCSB (Yahoo! Cloud Serving Benchmark).
My TiDB settings were as follows:

  • I used 9 AWS EC2 instances (each instance is i3.2xlarge: 8 vCPUs, 61GiB RAM , 1 x 1900GB nvme disk)
    . node#1~store: fix boltdb package import. #3: 1 x TiDB & 1 x PD ware deployed on each instance (i.e., total 3 x TiDBs & 3 x PDs)
    . node#4~Update README.md #9: 1 x TiKV was deployed on each instance (i.e., total 6 x TiKVs)
  • tidb_version: v2.0.5
  • I used the default TiDB config. Exception is as follows:
    . prepared_plan_cache: enabled: true
  • I used the default PD config.
  • I used the default TiKV config. Exceptions are as follows:
    . raftstore: sync-log: false
    . rocksdb: bytes-per-sync: "1MB"
    . rocksdb: wal-bytes-per-sync: "512KB"
    . raftdb: bytes-per-sync: "1MB"
    . raftdb: wal-bytes-per-sync: "512KB"
    . storage: scheduler-concurrency: 1024000
    . storage: scheduler-worker-pool-size: 6
    . *: max-write-buffer-number: 10
  • I created one table like follows:
    CREATE TABLE usertable (
    YCSB_KEY VARCHAR(255) PRIMARY KEY,
    FIELD0 TEXT, FIELD1 TEXT,
    FIELD2 TEXT, FIELD3 TEXT,
    FIELD4 TEXT, FIELD5 TEXT,
    FIELD6 TEXT, FIELD7 TEXT,
    FIELD8 TEXT, FIELD9 TEXT
    );

And YCSB settings were as follows:

  • I used YCSB "load" command to insert tuples into TiDB
  • I used the following config for YCSB "load"
    . maxexecutiontime: 3600 (i.e., insert tuples for 3600 seconds)
    . target: 32000 (i.e., insert tuples at a rate 32000 insertions per sec)
    . threads: 512 (i.e., use 512 client threads to insert tuples)
    . db.batchsize: 100 & jdbc.autocommit: false (i.e., commit every 100 inserts)
  1. What did you expect to see?
    I expected the data and loads to be well distributed.

  2. What did you see instead?
    Data and loads were not well distributed. I think the performance (TPS and latency) was limited by this.

Data were not well distributed as follows:

  • Sore size => store size of tikv_1 and tikv7 was 3 times larger than that of the others
    image
  • Leader and region distribution => It was unbalanced
    image
    image

Loads for TiKV were not well distributed as follows:
(In overall, tikv1 seemed to be overloaded.)

  • CPU => tikv_1 (also tikv_7) used about 200% (100%) more CPU than other tikvs
    image
  • Load => load received by tikv1(=tikv_1) was twice more than that of the others
    image
  • IO Util => IO util of tikv1(=tikv_1) and tikv5(=tikv_7) was six times (twice for tikv5) more than that of the others
    image
  • Network traffic Inbound => Inbound of tikv1 and tikv5 was twice more than that of the others
    image
  • Network traffic Outbound => Outbound of tikv1 was very larger than that of the others
    image
  • Region average written keys => most of keys seemd to be written into tikv_1
    image
  • Scheduler pending commands => most of scheduler pending commands were issued from tikv_1
    image
  1. Questions
    Why were data not well-distributed in my situation?
    Why were loads (CPU, IO, Network) not well-distributed in my situation?
    What is "scheduler pending commands"? Is it related to unbalanced load or data?
    Can the region size exceed region-max-size? Following graph shows average region size was 20.3GiB which is too big! It seems that there was one very large size region. If there was such a region, why did not this large size region split?
    image

  2. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?
    v2.0.5

Metadata

Metadata

Assignees

Labels

type/questionThe issue belongs to a question.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions